Job Description
This is a remote position.
Core Expertise
SRE Foundations & Practices
Deep understanding of
SRE principles (SLIs, SLOs, error budgets, toil reduction, reliability vs. velocity trade-offs).
SRE adoption and culture change across teams and applications.
incident management ,
on-call practices , and
blameless postmortems .
Cloud & Infrastructure
5+ years of experience with
Google Cloud Platform (GCP) services
Kubernetes , including scaling, workload optimization, network policies, service mesh, and troubleshooting.
infrastructure as code
Reliability & Observability
Strong knowledge of
monitoring, logging, and tracing
alerting strategies aligned with SLOs / SLIs.
application performance, resiliency, and cost efficiency in cloud-native environments.
Automation & Tooling
Proficiency in at least one modern programming language (preferably Python) for automation, reliability tooling, and operational improvements.
CI / CD pipelines and release engineering best practices.
automating reliability tasks , reducing toil, and scaling best practices across multiple applications.
Leadership & Collaboration
evangelize SRE best practices and influence engineering / product teams in adopting them.
mentoring engineers and establishing communities of practice around reliability.
Requirements
Preferred Qualifications
SRE operating models in multi-team / multi-application settings.
Benefits
Requirements
Kubernetes, CI / CD, Infrastructure as Code, GCP
Site Engineer • Makati, 00, ph