Key Responsibilities
SRE Activation & Operating Model
∙ Drive adoption of the SRE operating model across application teams
∙ Establish clarity in roles between:
o SRE
o Production Support Engineering (PSE)
o Application teams
∙ Ensure SRE practices are embedded into the development lifecycle, not treated as post-production activities
Reliability Standards & Governance
∙ Define and enforce:
o SLIs, SLOs, and Error Budgets
o Production readiness criteria
o Reliability best practices
∙ Lead SLO adoption and compliance reviews across the organization
∙ Establish governance frameworks to ensure consistent application of standards
Cross-Team Coordination & Enablement
∙ Partner with:
o Application product teams
o Production Support Engineering (MG team)
o Platform / Infrastructure / Observability teams
∙ Drive alignment and reduce friction between engineering and operations
∙ Ensure clear handoffs, escalation models, and operational ownership
Observability & Monitoring Strategy
∙ Lead adoption of centralized observability standards across:
o Metrics
o Logging
o Tracing
∙ Align tooling (AppDynamics, Splunk, Prometheus, etc.)
∙ Ensure monitoring and alerting are SLO-driven and actionable, not noise-based
Incident Management & Continuous Improvement
∙ Partner with PSE to strengthen:
o Incident management processes
o RCA (Root Cause Analysis) standards
∙ Drive identification of patterns and systemic issues
∙ Ensure learnings translate into engineering improvements and automation
Automation & Reliability Engineering
∙ Identify opportunities to:
o Reduce manual operational work
o Improve system resilience
o Enable self-healing capabilities
∙ Promote a culture of engineering over reaction
Reporting & Organizational Insight
∙ Define and track reliability metrics across FS&I
∙ Build reporting that provides visibility into:
o System health
o Incident trends
o SLO performance
∙ Translate technical data into actionable business insights
Required Qualifications
∙ 10+ years in engineering, operations, or SRE roles
∙ 5+ years leading SRE, platform, or reliability-focused teams
∙ Proven experience implementing SRE practices at scale (SLIs, SLOs, error budgets)
∙ Strong background in cloud environments (AWS, Azure, GCP)
∙ Hands-on experience with observability tools (Splunk, AppDynamics, Prometheus, etc.)
∙ Experience in incident management and production operations at scale
∙ Ability to operate effectively in high-pressure and complex enterprise environments
Preferred Qualifications
∙ Experience driving organizational transformation (not just technical implementation)
∙ Strong understanding of CI/CD, DevOps, and automation practices
∙ Experience working in regulated or large enterprise environments
∙ Familiarity with AIOps or advanced automation strategies
Key Success Indicators
∙ Increased adoption of SLOs and reliability standards
∙ Reduction in high-severity incidents over time
∙ Improved MTTR and operational efficiency
∙ Increased adoption of standardized observability practices
∙ Reduction in reactive, ticket-driven work across teams
∙ Clear alignment between SRE, PSE, and application teams
Core Competencies
∙ Strategic thinking with strong execution focus
∙ Ability to drive alignment across multiple teams and stakeholders
∙ Strong communication and influence skills
∙ Bias toward structure, clarity, and accountability
∙ Ability to operate with urgency and discipline in complex environments