Job description

Key Responsibilities
SRE Activation & Operating Model
∙ Drive adoption of the SRE operating model across application teams
∙ Establish clarity in roles between:
o SRE
o Production Support Engineering (PSE)
o Application teams
∙ Ensure SRE practices are embedded into the development lifecycle, not treated as post-production activities

Reliability Standards & Governance
∙ Define and enforce:
o SLIs, SLOs, and Error Budgets
o Production readiness criteria
o Reliability best practices
∙ Lead SLO adoption and compliance reviews across the organization
∙ Establish governance frameworks to ensure consistent application of standards

Cross-Team Coordination & Enablement
∙ Partner with:
o Application product teams
o Production Support Engineering (MG team)
o Platform / Infrastructure / Observability teams
∙ Drive alignment and reduce friction between engineering and operations
∙ Ensure clear handoffs, escalation models, and operational ownership

Observability & Monitoring Strategy
∙ Lead adoption of centralized observability standards across:
o Metrics
o Logging
o Tracing
∙ Align tooling (AppDynamics, Splunk, Prometheus, etc.)
∙ Ensure monitoring and alerting are SLO-driven and actionable, not noise-based

Incident Management & Continuous Improvement
∙ Partner with PSE to strengthen:
o Incident management processes
o RCA (Root Cause Analysis) standards
∙ Drive identification of patterns and systemic issues
∙ Ensure learnings translate into engineering improvements and automation

Automation & Reliability Engineering
∙ Identify opportunities to:
o Reduce manual operational work
o Improve system resilience
o Enable self-healing capabilities
∙ Promote a culture of engineering over reaction

Reporting & Organizational Insight
∙ Define and track reliability metrics across FS&I
∙ Build reporting that provides visibility into:
o System health
o Incident trends
o SLO performance
∙ Translate technical data into actionable business insights

Required Qualifications
∙ 10+ years in engineering, operations, or SRE roles
∙ 5+ years leading SRE, platform, or reliability-focused teams
∙ Proven experience implementing SRE practices at scale (SLIs, SLOs, error budgets)
∙ Strong background in cloud environments (AWS, Azure, GCP)
∙ Hands-on experience with observability tools (Splunk, AppDynamics, Prometheus, etc.)
∙ Experience in incident management and production operations at scale
∙ Ability to operate effectively in high-pressure and complex enterprise environments

Preferred Qualifications
∙ Experience driving organizational transformation (not just technical implementation)
∙ Strong understanding of CI/CD, DevOps, and automation practices
∙ Experience working in regulated or large enterprise environments
∙ Familiarity with AIOps or advanced automation strategies

Key Success Indicators
∙ Increased adoption of SLOs and reliability standards
∙ Reduction in high-severity incidents over time
∙ Improved MTTR and operational efficiency
∙ Increased adoption of standardized observability practices
∙ Reduction in reactive, ticket-driven work across teams
∙ Clear alignment between SRE, PSE, and application teams

Core Competencies
∙ Strategic thinking with strong execution focus
∙ Ability to drive alignment across multiple teams and stakeholders
∙ Strong communication and influence skills
∙ Bias toward structure, clarity, and accountability
∙ Ability to operate with urgency and discipline in complex environments

Senior Manager SRE

Role overview

Qualifications

Responsibilities

Key facts

Hard skills

Other skills

About the company

Company details

Links

Your match analysis

Job description

Apply once. Then go straight to the hiring manager.

Related jobs

I-Drone Operator (Pilot) with Skydio X10 experience

Remote Scorer - NY Region

Data Engineer Sr. - TMD

Mobile UI Designer (Netherlands)

Capital Projects & Contracts (Construction) Senior Manager

Other jobs at Expedite Technology Solutions LLC

Senior Fullstack Developer

SRE Program Manager

Manhattan WMS Admin

Reach out to the hiring manager directly.