Infrastructure Security SME
Primary Skills : Prometheus, Grafana, Splunk, Datadog, Dynatrace, ELK, AppDynamics, AWS, Azure, GCP, Site Reliability Engineering, ML.AI, Kubernetes, Docker, OpenShift.
Location : Manila
Job Description
We are seeking a highly skilled Site Reliability Engineering (SRE) Subject Matter Expert (SME) to lead and advance our observability, performance engineering, reliability, and AIOps practices. The SME will be responsible for designing, implementing, and evangelizing modern SRE capabilities that improve system reliability, scalability, and efficiency across our IT ecosystem. This role requires deep technical expertise, hands‑on problem‑solving skills, and the ability to influence cross‑functional teams.
Key Responsibilities
- Observability & Monitoring
- Define and implement observability frameworks across logs, metrics, traces, and events.
- Establish SLOs, SLIs, and error budgets in collaboration with product and engineering teams.
- Drive proactive incident detection and root cause analysis.
- Performance Engineering
- Lead performance benchmarking, load / stress testing, and scalability assessments of applications and infrastructure.
- Build performance models and capacity planning strategies for critical business systems.
- Partner with development teams to identify performance bottlenecks and optimize application / infrastructure efficiency.
- Reliability Engineering
- Design and implement automation for incident response, disaster recovery, and self‑healing systems.
- Lead Chaos Engineering and Resilience testing initiatives.
- Drive reliability reviews, postmortems, and blameless RCA culture.
- Ensure best practices for fault tolerance, availability, and resilience are embedded in system design.
- Define AIOps strategy and deploy ML / AI‑driven observability and incident response capabilities.
- Leverage anomaly detection, event correlation, and predictive analytics for proactive IT operations.
- Integrate AIOps platforms with ITSM tools for intelligent ticketing, alert suppression, and automated remediation.
- Act as a thought leader in SRE practices, mentoring engineers and influencing leadership decisions.
- Partner with development, infrastructure, and business teams to embed SRE principles across the enterprise.
- Drive continuous improvement culture for availability, scalability, and operational excellence.
Required Qualifications
10+ years of experience in IT Operations, Reliability Engineering, or Performance Engineering.Deep expertise in observability and monitoring platforms (Prometheus, Grafana, Splunk, Datadog, Dynatrace, ELK, AppDynamics, etc.).Strong background in performance testing tools (JMeter, LoadRunner, Gatling, k6, etc.) and capacity planning.Hands‑on experience in cloud platforms (AWS, Azure, GCP) and containerized environments (Kubernetes, Docker, OpenShift).Experience with AIOps platforms (Moogsoft, BigPanda, Dynatrace Davis AI, ServiceNow AIOps, etc.) and ML‑driven IT operations.Strong understanding of distributed systems, networking, CI / CD, and DevOps practices.Preferred Qualifications
Prior experience leading enterprise‑wide SRE / Observability transformations.Knowledge of Chaos Engineering platforms (Gremlin, Chaos Mesh, Litmus).Exposure to ITSM / ITIL processes and modern incident management practices.Strong communication skills with ability to influence CxO‑level stakeholders.Certifications : Google SRE, AWS DevOps Engineer, Azure SRE Expert, Dynatrace / Datadog certifications (preferred).Strategic and analytical thinker with problem‑solving mindset.Strong leadership, mentorship, and stakeholder engagement skills.Passionate about automation, scalability, and resilience engineering.Ability to balance reliability with velocity in fast‑paced environments.About CLPS RiDiK
RiDiK is a global technology solutions provider and a subsidiary of CLPS Incorporation (NASDAQ : CLPS), delivering cutting‑edge end‑to‑end services across banking, wealth management, and e‑commerce. With deep expertise in AI, cloud, big data, and blockchain, we support clients across Asia, North America, and the Middle East in driving digital transformation and achieving sustainable growth. Operating from regional hubs in 10 countries and backed by a global delivery network, we combine local insight with technical excellence to deliver real, measurable impact. Join RiDiK and be part of an innovative, fast‑growing team shaping the future of technology across industries.
#J-18808-Ljbffr