Match score not available

Site Reliability Engineer

Remote:

Full Remote

Contract:

Full time

Experience:

Senior (5-10 years)

Work from:

California (USA), United States

Offer summary

Qualifications:

5+ years in site reliability engineering, Experience in a 24/7 enterprise environment, Hands-on with automation tools (Ansible, Chef, Puppet, Terraform), Strong skills in Docker, Kubernetes, and Terraform, Proficient in Python and C++ programming.

Key responsabilities:

Design and maintain scalable infrastructure for AI models
Develop infrastructure automation tools using Docker and Terraform
Ensure system reliability and performance through monitoring
Collaborate on distributed systems design with teams
Optimize GPU and HPC clusters for AI training

Genmo AI Art TPE https://www.genmo.ai

2 - 10 Employees

See more Genmo AI offers

Job description

We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI. Join us in shaping the future of AI and pushing the boundaries of what's possible in video generation.

As a Site Reliability Engineer (SRE) at Genmo, you will be responsible for designing, implementing, and maintaining the infrastructure that powers our large generative AI models. You will work on infrastructure automation, distributed systems design, and manage high-performance computing (HPC) and GPU clusters. The ideal candidate will have a strong background in infrastructure automation, distributed systems, and experience with GPU and HPC environments.

Responsibilities:

Design, implement, and maintain scalable infrastructure to support our generative AI models.
Develop and maintain infrastructure automation tools using technologies like Docker, Kubernetes, and Terraform.
Ensure the reliability, availability, and performance of our systems through proactive monitoring and incident response.
Collaborate with software engineers and researchers to design and implement distributed systems.
Manage and optimize GPU and HPC clusters for efficient AI model training and inference.
Develop and maintain CI/CD pipelines to streamline development and deployment processes.
Implement and maintain security best practices across the infrastructure.

Qualifications:

5+ years of experience in site reliability engineering or a similar role.
Experience working in a 24 x 7 enterprise environment
Hands-on experience with infrastructure as code and automation tools (Ansible, Chef, Puppet, Terraform)
Strong experience with infrastructure automation tools such as Docker, Kubernetes, and Terraform.
Expertise in designing and maintaining distributed systems.
Proficiency in scripting and programming languages, particularly Python and C++.
Strong understanding of networking, security, and system performance.
Excellent problem-solving skills and the ability to work in a fast-paced environment.

Bonus points:

Experience with cloud providers like AWS, GCP, or Azure.
Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
Familiarity with CI/CD tools and practices (e.g., Jenkins, GitLab CI/CD).
Experience working with AI and machine learning models.
Strong passion for artificial intelligence and the drive to learn new technologies.

Genmo is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law. Genmo, Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish.