Match score not available

Site Reliability Engineer (SRE)

extra holidays - extra parental leave
Remote: 
Full Remote
Contract: 
Experience: 
Senior (5-10 years)
Work from: 

Offer summary

Qualifications:

5+ years in SRE or DevOps, Expertise with AWS, especially GovCloud, Strong experience with IaC tools like Terraform, Hands-on experience with CI/CD pipelines, Proficiency in monitoring and scripting languages.

Key responsabilities:

  • Ensure availability and performance of My HealtheVet platform
  • Develop and maintain CI/CD pipelines for deployment consistency
  • Implement monitoring tools for timely incident resolution
  • Conduct performance testing and optimizations
  • Collaborate on best practices and compliance with healthcare regulations
By Light Professional IT Services logo
By Light Professional IT Services Large https://www.bylight.com/
1001 - 5000 Employees
See more By Light Professional IT Services offers

Job description

Company Overview:

By Light Professional IT Services LLC readies warfighters and federal agencies with technology and systems engineered to connect, protect, and prepare individuals and teams for whatever comes next. Headquartered in McLean, VA, By Light supports defense, civilian, and commercial IT customers worldwide.

Position Overview:

We are seeking a Site Reliability Engineer (SRE) to join our team supporting My HealtheVet, the U.S. Department of Veterans Affairs’ (VA) online health portal. This role focuses on ensuring the platform’s performance, scalability, and reliability while enhancing automation and infrastructure-as-code practices. The SRE will collaborate with development, operations, and security teams to deploy and maintain robust systems that meet the unique needs of the healthcare industry, with special attention to compliance, data privacy, and integration with Oracle Health (Cerner) systems.

Responsibilities:
  • Maintain System Reliability: Oversee availability, reliability, and performance of the My HealtheVet platform, proactively identifying potential issues and ensuring minimal service disruptions.
  • Automation & Infrastructure-as-Code: Develop and maintain CI/CD pipelines using tools such as Jenkins, GitLab CI/CD, or AWS CodePipeline; automate deployment processes to ensure consistency and scalability.
  • Monitoring & Incident Response: Implement and manage monitoring tools (e.g., Prometheus, Grafana, CloudWatch) to detect and resolve incidents quickly; participate in on-call rotation for 24/7 support.
  •  Performance Tuning & Optimization: Conduct regular performance testing and tuning, ensuring the platform meets user demands and operates efficiently.
  • Collaboration: Work closely with development, QA, and security teams to align on deployment strategies and secure development best practices, including healthcare-specific compliance (e.g., HIPAA).
  • Continuous Improvement: Identify and resolve bottlenecks, implement best practices, and contribute to team knowledge sharing to improve overall system stability and team efficiency.
  • Security & Compliance: Enforce security protocols and work with compliance teams to ensure My HealtheVet meets HIPAA, FISMA, and federal security guidelines.
Required Experience/Qualifications:
  • Experience in SRE/DevOps: 5+ years of experience as an SRE or DevOps engineer, preferably within a healthcare or government environment.
  • Proficiency in Cloud Platforms: Expertise with AWS (GovCloud experience preferred); familiarity with core AWS services such as EC2, RDS, Lambda, VPC, and IAM.
  • Infrastructure as Code (IaC): Strong experience with tools like Terraform, AWS CloudFormation, or Ansible for managing infrastructure as code.
  • CI/CD Pipelines: Hands-on experience with CI/CD tools (e.g., Jenkins, GitLab CI/CD, AWS CodePipeline) to streamline deployment and reduce time to market.
  • Monitoring & Logging: Skilled in monitoring and logging platforms such as Prometheus, Grafana, ELK stack, CloudWatch, and Splunk.
  • Automation Scripting: Proficiency in scripting languages such as Python, Bash, or PowerShell for automation and process improvement.
  • Networking & Security: Strong understanding of network protocols, VPNs, firewalls, and security practices; experience with compliance frameworks like HIPAA, FISMA, and FedRAMP is a plus.
  • Containerization & Orchestration: Experience with containerization (Docker) and orchestration platforms (Kubernetes, OpenShift) to manage scalable deployments.
  • Problem-Solving Skills: Ability to troubleshoot complex issues in a high-availability production environment.
  • Soft Skills: Strong communication and collaboration skills, especially in cross-functional team settings.

 

Preferred Experience/Qualifications:
  • Healthcare IT Standards: Familiarity with HL7 and FHIR standards for data interoperability within healthcare applications.
  • Experience with Oracle Health (Cerner): Knowledge of Oracle Health’s (Cerner) systems and experience with integrations in the healthcare IT environment.
  • Security Certifications: Security certifications like CompTIA Security+, CISSP, or AWS Certified Security Specialist.
  • ITIL Foundation: Knowledge of ITIL practices, especially around incident, change, and problem management.

Required profile

Experience

Level of experience: Senior (5-10 years)
Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Problem Solving
  • Collaboration
  • Communication

Site Reliability Engineer (SRE) Related jobs