Match score not available

Site Reliability Manager

Remote: 
Full Remote
Contract: 
Salary: 
140 - 180K yearly
Experience: 
Senior (5-10 years)
Work from: 
District of Columbia (USA), United States

Offer summary

Qualifications:

Bachelor’s degree in computer science or related field; Master's preferred., 10+ years experience managing site reliability engineers in AWS., Deep understanding of cloud platforms and containerization technologies., Strong knowledge of infrastructure as code tools and CI/CD pipelines., Certifications like AWS DevOps Engineer are a plus; Public Trust clearance required..

Key responsabilities:

  • Lead service delivery team, define best practices for infrastructure.
  • Collaborate with teams to design scalable architectures.
  • Develop and maintain SLOs and KPIs for system performance.
  • Conduct post-mortems for incidents and drive continuous improvement initiatives.
  • Mentoring team members and fostering a culture of learning.
Karsun Solutions, LLC logo
Karsun Solutions, LLC SME https://www.karsun-llc.com/
201 - 500 Employees
See more Karsun Solutions, LLC offers

Job description

We are seeking a highly skilled and experienced Site Reliability Manager to join our team. The ideal candidate will be responsible for ensuring the reliability, scalability, and performance of our systems and services. They will lead a team of engineers in designing, implementing, and maintaining robust infrastructure and automation solutions. The ideal candidate must reside in the Washington DC area and be available to work on site in downtown Washington DC as required.

  • Lead a service delivery team of 8-20 people (Service Support specialist, DevSecOps and Site reliability engineers)
  • Define and implement best practices for infrastructure as code, deployment automation, and monitoring
  • Collaborate with cross-functional teams to design scalable and fault-tolerant architectures.
  • Develop and maintain service level objectives (SLOs) and key performance indicators (KPIs) to measure system reliability and performance.
  • Conduct post-mortems and root cause analyses for incidents and implement preventive measures to mitigate future incidents.
  • Drive continuous improvement initiatives to enhance the reliability, scalability, and efficiency of our systems and services.
  • Mentor and coach team members to foster a culture of learning and innovation.

Required:

  • Bachelor’s degree in computer science, Engineering, or a related field; Master's degree preferred.
  • 10+ years of experience in a similar role managing a team of site reliability engineers and delivering in AWS cloud platform.
  • Proven track record of managing high-performance teams.
  • 5+ years of experience supporting operations and maintenance for cloud-native applications in production that are fault-tolerant, self-healing, scalable and high available,
  • Deep understanding of cloud computing platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).
  • Strong knowledge of infrastructure as code tools (e.g., Terraform, Ansible, ArgoCD) and CI/CD pipelines.
  • Experience with monitoring, logging, and observability tools like DataDog, AWS Cloudwatch, ELK, Prometheus, Splunk etc.
  • Excellent communication and interpersonal skills, with the ability to collaborate effectively with cross-functional teams.
  • Strong problem-solving and analytical skills, with a keen attention to detail.
  • Certifications such as AWS Certified DevOps Engineer or Google Professional Cloud DevOps Engineer are a plus.
  • Ability to obtain and maintain a Public Trust clearance.

Preferred:

  • Understanding of modern architecture, e.g. micro-services, EDA, etc., and cautious against overcomplexity and overengineering
  • Experience with monitoring and metrics platforms, e.g. New Relic, Prometheus, InfluxDB, Grafana, Splunk, etc
  • Experience designing and operating distributed systems and cloud infrastructure at scale

In accordance with pay transparency guidelines, the proposed salary range for this position is $140,000.00 to $180,000.00. Final salary will be determined based on various factors such as relevant skills, experience and certifications.

Find Your Next at Karsun Solutions and transform your career with the company transforming possible for the US Government.

At Karsun, collaboration drives our community. We’re committed to building an environment where team members from diverse backgrounds can innovate, learn and grow with us. Here at Karsun, the only limit to your potential is the limit of your curiosity.

And because we know well-being empowers us to thrive, we offer robust and comprehensive benefits including:

  • Health, Life & Disability Insurance – Medical, Dental, Life and Disability coverage is paid for by Karsun for full time employees.
  • Paid Parental Leave
  • 401k Retirement Plan – with pre-tax and post-tax ROTH contribution offerings and immediate vesting with a per pay period match
  • Generous time off programs including 11 paid holidays per year
  • Supplemental plans such as Vision, Pet Insurance and 529 Savings Plan
  • Employee Assistance Program with behavioral health, physical wellness and financial advice
  • Employee Discounts & Perks
  • In-house Technical/Skills Training

Join Team Karsun and Find Your Next.

Karsun Solutions is an Equal Employment Opportunity (EEO) employer. It is the policy of the Company to provide equal employment opportunities to all qualified applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, protected veteran or disabled status, or genetic information.

Karsun does not accept unsolicited resumes through or from search firms or staffing agencies. All unsolicited resumes will be considered the property of Karsun and Karsun will not be obligated to pay a placement fee.

Required profile

Experience

Level of experience: Senior (5-10 years)
Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Analytical Skills
  • Team Leadership
  • Communication
  • Problem Solving

Site Reliability Engineer (SRE) Related jobs