Offer summary
Qualifications:
5+ years in site reliability engineering, Experience in a 24/7 enterprise environment, Hands-on with automation tools (Ansible, Chef, Puppet, Terraform), Strong skills in Docker, Kubernetes, and Terraform, Proficient in Python and C++ programming.
Key responsabilities:
- Design and maintain scalable infrastructure for AI models
- Develop infrastructure automation tools using Docker and Terraform
- Ensure system reliability and performance through monitoring
- Collaborate on distributed systems design with teams
- Optimize GPU and HPC clusters for AI training