This is a remote position.
Core Expertise - SRE Foundations & Practices Deep understanding ofSRE principles(SLIs, SLOs, error budgets, toil reduction, reliability vs. velocity trade-offs).
Proven experience drivingSRE adoption and culture changeacross teams and applications.
Strong knowledge ofincident management,on-call practices, andblameless postmortems.
- Cloud & Infrastructure 5+ years of experience withGoogle Cloud Platform (GCP)services
Solid expertise withKubernetes , including scaling, workload optimization, network policies, service mesh, and troubleshooting.
Experience withinfrastructure as code
- Reliability & Observability Strong knowledge ofmonitoring, logging, and tracing
Proven ability to design and implementalerting strategiesaligned with SLOs/SLIs.
Hands-on experience optimizingapplication performance, resiliency, and cost efficiencyin cloud-native environments.
- Automation & Tooling Proficiency in at least one modern programming language (preferably Python) for automation, reliability tooling, and operational improvements.
Familiarity withCI/CD pipelines and release engineering best practices.
Expertise inautomating reliability tasks, reducing toil, and scaling best practices across multiple applications.
Leadership & Collaboration - Ability toevangelize SRE best practicesand influence engineering/product teams in adopting them.
- Experiencementoring engineersand establishing communities of practice around reliability.
- Strong stakeholder management skills to balance product delivery goals with reliability requirements.
- Excellent communication skills.
Requirements
Preferred Qualifications - Hands-on experience migrating applications toSRE operating modelsin multi-team/multi-application settings.
- Certification(s): Google Cloud Professional DevOps Engineer, Kubernetes CKA/CKS, or equivalent.
Benefits
● Full Time Employment with competitive salary and benefits
● Medical, dental, and vision insurance coverage