Salary : 100,000 - 150,000
Required Experience :
Bachelor’s degree in Computer Science, Information Technology, or a related field; or equivalent work experience.
With 5+ years of extensive experience as a Site Reliability Engineer.
Hands-on experience managing monitoring tools such as, but not limited to, Solarwinds, Nagios, etc.
Evidently understand what Observability is and what it does
Proficient with major cloud platforms such as AWS, GCP, Azure, and Alibaba Cloud
Hands-on experience with SNMP-based monitoring tools such as Solarwinds, Nagios, CheckMK, etc.
Good grasp on Observability platforms such as Splunk and Dynatrace
Experience with containerization platforms such as Docker and Kubernetes
Extensive experience with virtualization technology such as VMWare
Strong knowledge of networking using collapsed architecture or similar enterprise networking technology
Knowledgeable in scripting languages such as Python, Bash, or PowerShell.
AWS Certified Solutions Architect, Azure Solutions Architect, or equivalent certification.
Certified Kubernetes Administrator (CKA)Solid understanding of disaster recovery and business continuity practices.
Senior Engineer – Site Reliability is responsible for ensuring the reliability, performance, and availability of applications, services and underlying infrastructure by employing monitoring and observability solution as well as creation and maintenance of automation scripts to ensure optimum level across all technology stack.
MAJOR RESPONSIBILITIES AND DUTIES :
Configure and maintenance of the enterprise monitoring tool to provide realtime visibility and state of health across the technology stack
Design and create dashboards to provide multi-level view based on functional requirement such as executive and tactical views
Create and maintain key threshold across all monitoring elements to ensure proactive detection and early detection of impending incident or problem
Analyze events and correlate to all observability and monitoring tools to capture trends and behavior patterns to assist in proactive course of actions
Design, develop and utilize automation tools and scripts to address repetitive actions and where possible create correction course of action to prevent and / or reduce prolonged outages
Work closely with operations team during incident and problem management for quick reaction response as identified using the monitoring tools
Regularly review and optimize infrastructure performance using logs, metrics and traces as part of continuous improvements thru adjustment of thresholds and monitoring requirement as environment constantly change
Develop and maintain a robust alerting strategy, including integration with on-call tools to ensure timely escalation and resolution of critical issues.
Implement and manage end-to-end event lifecycle processes to ensure accurate incident detection and efficient response.
Site Reliability Engineer • Paranaque, National Capital Region, Philippines