Senior Site Reliability Engineer (m/f/d)

Senior Site Reliability Engineer (m/f/d)

Stuttgart Vollzeit Kein Homeoffice möglich
F

As a Senior Site Reliability Engineer in our Platform Squad, you'll own critical reliability domains end-to-end and drive the technical direction within the squad – leading architectural decisions on our platform, mentoring teammates, and continuously raising the reliability bar inside the team.

This role is for an engineer with a proven track record of building and operating high-throughput, highly available systems, who wants senior-level technical ownership and real impact through deep engineering work inside a tight, well-scoped team.

What awaits you with us

  • Co-own the architecture: Help drive the architecture and evolution of our cloud infrastructure on Azure and our Kubernetes clusters – designed for high throughput and highest availability – to support Flip's rapid growth across the globe.
  • Drive the resilience strategy: Define how we approach global scaling, zero‑downtime deployments, rollback mechanisms and disaster recovery, and make sure the platform stays available around the clock.
  • Evolve our observability stack: Improve our LGTM stack (Loki, Grafana, Tempo, Mimir) into a foundation our engineers can trust.
  • Improve our IaC Platform: Eliminate toil at the source, and make our infrastructure truly self‑service for engineering teams.
  • Lead in incidents: Take a leading role in platform‑related major incidents, drive blameless post‑mortems for the squad, and translate findings into systemic improvements.
  • Mentor within the squad: Coach teammates, run RFCs and design reviews inside the team, and help engineers grow into stronger SREs.
  • Shape our roadmap: Partner with your squad to define the platform's direction.

What you bring to the table

We're looking for a hands‑on, SaaS‑minded senior Site Reliability Engineer who treats scalability and reliability as a first‑class product concern.

Must‑Have Qualifications

  • 5+ years of hands‑on experience as a Site Reliability Engineer (SRE), Platform Engineer, DevOps Engineer, Infrastructure Engineer, Cloud Engineer, or Backend Engineer with a strong infrastructure focus.
  • Proven track record building and operating high‑throughput, highly available systems in production.
  • Deep, production‑level experience with Kubernetes on any Hyperscaler.
  • Strong experience with modern observability stacks (e.g. Prometheus, Mimir, VictoriaMetrics, Dash0, Loki, ELK) and a clear point of view on SLIs, SLOs and error budgets.
  • Solid software development skills in Go (strongly preferred, since our IaC runs on Pulumi in Go) or Python.
  • Hands‑on experience with Infrastructure as Code (Pulumi, OpenTofu, Terraform) and GitOps (e.g. ArgoCD) + CI/CD pipeline design.
  • Demonstrated ability to lead complex infrastructure initiatives from design to production – including writing RFCs and driving architecture decisions within your team.
  • Experience mentoring engineers and raising the technical bar within a team.
  • Comfortable owning major incidents end‑to‑end and turning learnings into systemic change.
  • Strong communication skills and business‑fluent English.
  • Willingness to participate in on‑call rotations to ensure the reliability of our platform.

Nice‑to‑Have Qualifications

  • Rolled out production‑ready API‑Gateways with Gateway API (e.g. Envoy Gateway).
  • Operated multi‑cluster service meshes (e.g. Cilium, Linkerd, Istio).
  • Deployed and maintained Kubernetes Operators (e.g. Strimzi, CNPG).
  • Operated highly available PostgreSQL in production.
#J-18808-Ljbffr
F

Kontaktdaten:

Flip GmbH Recruiting-Team