2+ years of Site Reliability Experience., Proficiency with Prometheus, Kubernetes, GitLab CI, and Ansible., Understanding of TCP/IP, web services, and REST API., High degree of emotional intelligence and professional fluency in English..
Key responsibilities:
Ensure high availability of services and document infrastructure details.
Develop infrastructure solutions to simplify operations and maintain 99.99% uptime.
Train developers to solve infrastructure problems independently and improve observability.
Build monitoring systems and automate service reliability processes.
Report This Job
Help us maintain the quality of our job listings. If you find any issues with this job post, please let us know.
Select the reason you're reporting this job:
TripleTen (formerly Practicum): Empowering all to thrive in tech. Our coding bootcamp provides versatile learners with the skills and confidence for sustainable careers. Unlock your potential with us! 🌱💡💻
TripleTen for Business empowers companies to achieve their business goals by bridging talent gaps in Data Science, AI for professionals, Python Development, and Management.
Our transformative approach includes tailored training programs, informed by comprehensive pre-training assessments, ensuring precise alignment with client needs. With expert-led content and personalized mentoring, we help employees excel and achieve new levels of proficiency.
We are looking for a Senior Site Reliability Engineer. In this role, you will take ownership of ensuring service high availability*, documenting infrastructure details, and empowering developers through training and guidance on working with it*
What you will do
Develop infrastructure, and write solutions to simplify operations.
Build processes to achieve and maintain 99.99% uptime, and improve the exercise process.
Develop automation and service reliability, plan resources, and reduce ops in development.
Build infrastructure and monitoring, help developers solve infrastructure problems, train developers to solve problems independently, and improve the observability of infrastructure, monitoring, schedules, and alerts.
Requirements
2+ years of Site Reliability Experience.
Experience working with Prometheus - must have.
Experience working with Kubernetes, GitLab CI, and Ansible.
Experience working with Unix systems (we have Ubuntu) and the console.
Understanding the basics of TCP/IP to build networks, how web services work, REST API, and gRPC.
Experience performing diagnostics, including interpreting the output of Ps, Top, Strace, Perf, and TCPDump.
Understanding of how user applications interact with the operating system, including familiarity with system calls, processes, and threads.
Willingness to build high-load systems and understanding of how to do that.
Understanding of fault tolerance and service scaling.
High degree of emotional intelligence, ability to find common ground with colleagues and work as part of a team.
Must be professionally fluent in English
Nice To Have
Experience working with AWS and Terraform.
Experience programming in Python / Golang or desire to learn how.
What we can offer you
Full-time remote collaboration with a convenient schedule. Professional freedom, where we trust your experience instead of wasting each other's time and effort micromanaging;
A diverse and tight-knit team. Our teammates are spread out across Serbia, the US, Israel, Georgia, Armenia, Latin America, and more. They’ve worked at all of big techs, ed-techs, design agencies, and cultural institutions;
Comfortable digital workspace. We use Miro, Notion, Google Workspace, Jira, etc.— to make working together process seamless.
All Jobs
Terms & Policies
Cookies
Required profile
Experience
Spoken language(s):
English
Check out the description to know which languages are mandatory.