Offer summary

Qualifications:

5+ years of Site Reliability Engineering or DevOps experience, Deep experience with Kubernetes administration and troubleshooting, Hands-on experience deploying and maintaining observability tools like Prometheus and Grafana, Strong understanding of Helm charts and GitOps practices..

Key responsibilities:

Deploy and maintain observability stack across multiple customer clusters and DoD networks

Build automation to streamline monitoring deployments for new customers

Troubleshoot and debug complex Kubernetes issues and monitoring stack failures

Collaborate with security teams to ensure compliance with NIST requirements and DoD standards.

Job description

ABOUT THE ROLE

Second Front Systems' (2F) Product team is seeking a highly skilled and motivated Senior Site Reliability Engineer to join our Observability team. We are a small team working to accelerate the deployment of emerging technology into national security use-cases. We are seeking technical professionals who want to operate on the front lines of an exciting and disruptive mission.

As a Senior SRE for Second Front Systems, you'll be responsible for deploying, maintaining, and scaling our observability infrastructure across multiple DoD networks. You'll work with Kubernetes-based platforms, BigBang charts from DoD Platform One, and build automation to make our monitoring stack easier to deploy for new customers. You'll be empowered to collaborate with others to implement infrastructure that delivers unique capabilities for our commercial and government customers, including the Department of Defense.

The Observability team is looking for a strong SRE with deep DevSecOps and Kubernetes experience. Someone who has deployed and maintained monitoring infrastructure at scale, with an eye for security in highly-regulated environments. Experience with DoD software deployments, Platform One, and single-tenant architectures is highly valued.

We are a fast-growing entrepreneurial team working at the convergence of technology and national security. If this type of effort interests you, come join us!

Note: This position requires U.S. citizenship due to government contract requirements.

What You’ll Do

Deploy and maintain observability stack (Grafana, Mimir, Prometheus) across multiple customer clusters and DoD networks

Build Helm chart abstractions and automation to streamline monitoring deployments for new customers

Troubleshoot and debug complex Kubernetes issues, networking problems, and monitoring stack failures

Configure and maintain BigBang charts and DoD Platform One integrations

Design and implement infrastructure automation using tools like Pulumi, ArgoCD, and Flux

Work with Istio service mesh and Keycloak for authentication in secure environments

Monitor and optimize performance of monitoring infrastructure across multiple environments

Collaborate with security teams to ensure compliance with NIST requirements and DoD standards

Participate in on-call rotation and incident response for production environments

Skills You’ll Bring to Our Team

5+ years of Site Reliability Engineering or DevOps experience

Deep experience with Kubernetes administration, troubleshooting, and scaling

Hands-on experience deploying and maintaining observability tools (Prometheus, Grafana, Mimir/Cortex)

Strong understanding of Helm charts, GitOps practices, and CNCF tooling

Experience with service mesh technologies (Istio preferred)

Proven ability to debug complex distributed systems and networking issues

Understanding of authentication systems and security in regulated environments

Ability to work independently and collaborate with team members in a remote environment

Preferred Qualifications

Active security clearance or ability to obtain a Secret-level security clearance

Previous experience with DoD software deployments and Platform One

Experience with BigBang charts and Iron Bank containers

Experience working in national security or highly regulated environments

Familiarity with compliance frameworks (NIST, FedRAMP, etc.)

Experience with infrastructure as code (Pulumi, Terraform)

Technologies we Use

Observability: Grafana stack, Prometheus, custom alerting tools

Kubernetes: Helm, ArgoCD, Flux, Tekton, BigBang charts

Security: Istio, Keycloak, Kyverno

Infrastructure: AWS/GCP/Azure, Pulumi, Git/GitLab

Languages: YAML, Bash, Go

Required profile

Are you interested?

Site Reliability Engineer (SRE) Related jobs

Senior Software Reliability Engineer (Observability) - open to remote across ANZ

4 days ago

Canva

Full time

Java (Programming Language)AWS Cloud ServicesPython (Programming Language)

Zuverlässigkeitsingenieur (all gender)

4 days ago

Alten

Full time

Simulation SoftwareReliability EngineeringRisk AnalysisCalculations

Career Opportunities: Director Site Reliability Engineering (231570)

5 days ago

Scotiabank

Full time

Incident ManagementInfrastructure as Code (IaC)DevOpsSite Reliability Engineering

Site Reliability Engineer 3 (Platform)

5 days ago

Behavox

Full time

CI/CDInfrastructure as Code (IaC)Python (Programming Language)

Senior Site Reliability Engineer - AI Platform

5 days ago

N26

Full time

AWS Cloud ServicesCI/CDPython (Programming Language)Infrastructure as Code (IaC)

See more Site Reliability Engineer (SRE) jobs

Site Reliability Engineer - Observability

Offer summary

Qualifications:

Key responsibilities:

Job description

Required profile

Experience

Hard Skills

Other Skills

Site Reliability Engineer (SRE) Related jobs

Senior Software Reliability Engineer (Observability) - open to remote across ANZ

Zuverlässigkeitsingenieur (all gender)

Career Opportunities: Director Site Reliability Engineering (231570)

Site Reliability Engineer 3 (Platform)

Senior Site Reliability Engineer - AI Platform