Job Description :
As a Site Reliability Engineer, you will play a crucial role in ensuring the reliability, scalability, and performance of our systems. You will collaborate with cross-functional teams to design, build, and maintain scalable infrastructure, automate operational processes, and respond to incidents swiftly. The ideal candidate is passionate about automation, has a deep understanding of system architecture, and is dedicated to delivering high-quality, reliable services.
Responsibilities :
System and Service Reliability
- Ensure overall reliability and performance
- Monitoring system health
- Performing root cause analysis
Incident / Ticket Management
Alert managementIncident responseTriaging, Investigating & Mitigating incidentsCo-ordinating with cross functional teamsAutomation and Tooling
Automation and process improvementsDeveloping automation tools, scripts and infrastructureIdentify and automate repetitive tasks to reduce manual workCapacity Planning and Scalability
Collaborate with development and infrastructure teamsConduct capacity planning exercisesForecast resource requirementsOptimize system scalability to handle increased workloadsPerformance and Optimization
Monitor and analyze performance metricsIdentify bottlenecks and recommend optimizationsCollaborate to optimize application code, database queries and system configurationsReliability Engineering Practices
Advocate and implement reliability engg. practicesError budgeting and reviewsConduct blameless postmortemsContinuous Improvement
Analyze incident trends and monitor system metricsGather feedback from devops, app developers and customersIdentify areas of improvement and collaborate with development teamsCollaboration and Communication
Foster collaborations with development , operations & cross functional teamsActs as a bridge between different teamsKnowledge sharing, promote effective communicationsCreate and contribute to documentation & share best practices