Match score not available

Site Reliability Engineer

Remote: 
Full Remote
Experience: 
Mid-level (2-5 years)
Work from: 

Offer summary

Qualifications:

3+ years of experience in site reliability engineering., Deep expertise in Kubernetes and containers., Strong understanding of cloud infrastructure., Experience with monitoring and logging tools..

Key responsabilities:

  • Monitor performance and reliability of systems.
  • Automate routine maintenance tasks and respond to incidents.

Job description

Overview:

Site Reliability Engineer

Remote | US Based - EST Timezone Preferred

US Citizenship - Clearable; Ability to obtain a Secret Clearance

 

Summary

Our client is a leading provider of generative AI solutions designed for both government and commercial sectors. Their platform focuses on secure, versatile, and cloud-agnostic AI applications, supporting a wide range of large language models (LLMs) for data processing across text, images, and audio. Our client’s technology is built to support various cloud infrastructures, offering strong security features, including zero-trust access controls and compliance with high government security standards. The platform also aims to enhance productivity by enabling users to analyze diverse data formats efficiently.

 

Responsibilities

Our client is seeking a Site Reliability Engineer to join their team! The Site Reliability Engineer (SRE) will be responsible for ensuring the reliability, performance, and scalability of our client's software, websites, and applications. This role requires a combination of software engineering and systems administration skills to monitor, control, and automate systems. The ideal candidate will have a deep understanding of cloud infrastructure, automation tools, and best practices for maintaining high availability and performance. This position plays a critical role in maintaining the overall health and efficiency of our client's platform.

  • Monitor the performance and reliability of our client's Kubernetes clusters, software, websites, and applications
  • Automate routine maintenance tasks to ensure system stability and performance
  • Respond to and resolve incidents in a timely manner, minimizing downtime and impact on users
  • Conduct root cause analysis to identify and address underlying issues
  • Develop and implement strategies to prevent future incidents and improve system resilience
  • Design, build, and maintain automated systems and processes to improve efficiency and reduce manual intervention
  • Manage cloud infrastructure, including provisioning, scaling, and optimizing resources
  • Collaborate with development teams to ensure seamless deployment and integration of new features and updates
  • Analyze system performance and identify areas for improvement
  • Implement performance tuning and optimization techniques to enhance system efficiency
  • Collaborate with cross-functional teams to ensure optimal performance of all components
  • Ensure compliance with security best practices and industry standards
  • Implement and maintain security measures to protect systems and data
  • Conduct regular security audits and vulnerability assessments
  • Maintain accurate and up-to-date documentation of systems, processes, and procedures
  • Generate and analyze reports on system performance, incidents, and other key metrics
  • Provide regular updates to management and stakeholders on system health and performance
  • Identify opportunities for improving system reliability, performance, and scalability
  • Stay up-to-date with industry trends and best practices in site reliability engineering
  • Participate in training and development opportunities to enhance skills and knowledge

Requirements

  • 3+ years of experience in site reliability engineering, Kubernetes administration, or a related role
  • Deep expertise of Kubernetes and containers is required
  • Strong understanding of cloud infrastructure, automation tools, and best practices for maintaining high availability and performance
  • Experience with monitoring and logging tools such as Loki and Grafana
  • Excellent problem-solving skills and attention to detail
  • Strong communication and interpersonal skills, with the ability to work effectively with cross-functional teams

Education/Certification Requirements

  • None

Preferred Requirements

  • Local to Washington D.C. is preferred
  • Experience working within a start-up environment is highly preferred

Clearance Requirements

  • Applicants selected will be subject to a security investigation and may need to meet eligibility requirements for access to classified information; Must be able to obtain a US Government Secret level clearance once starting the position.
Other Duties
Please note this job description is not designed to cover or contain a comprehensive listing of activities, duties, or responsibilities that are required of the employee for this job. Duties, responsibilities, and activities may change at any time with or without notice.
 
--------------
 
About Us
Northern Virginia-based Precision Solutions is an expert in staffing solutions for companies of any size that open the door to new opportunities and seek outstanding talent. We pride ourselves on being versatile enough to tailor our relationships to the needs of each individual client, being agile in the fast-paced marketplace, and being precise in meeting the needs of any company.
 
Equal Opportunity Employer Statement
Precision Solutions is an equal opportunity employer. We prohibit discrimination and harassment of any kind based on race, color, sex, religion, sexual orientation, national origin, disability, genetic information, pregnancy, or any other protected characteristic as outlined by federal, state, or local laws.

Required profile

Experience

Level of experience: Mid-level (2-5 years)
Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Communication
  • Problem Solving
  • Social Skills

Site Reliability Engineer (SRE) Related jobs