Match score not available

Senior Site Reliability Engineer

75% Flex
FULLY FLEXIBLE
Remote: 
Full Remote
Contract: 
Experience: 
Senior (5-10 years)
Work from: 

Offer summary

Qualifications:

5-8 years of SRE experience, Experience with GCP, monitoring tools, and alerting tools.

Key responsabilities:

  • Proactively ensure infrastructure reliability
  • Lead incident management and post-incident reviews
Tech Holding logo
Tech Holding Scaleup https://www.techholding.co/
51 - 200 Employees
See more Tech Holding offers

Job description

Logo Jobgether

Your missions

About us:

Working at Tech Holding isn't just a job, it's an opportunity to be a part of something bigger. We are a full-service consulting firm that was founded on the premise of delivering predictable outcomes and high-quality solutions to our clients.  Our founders and team members have industry experience and have held senior positions in a wide variety of companies – from emerging startups to large Fortune 50 firms – and we have taken our combined experiences and developed a unique approach that is supported by the principles of deep expertise, integrity, transparency, and dependability.

The Role:  

We are seeking a highly skilled and experienced Senior Site Reliability Engineer to join our growing team. You will play a critical role in ensuring the reliability, scalability, and performance of our critical infrastructure and applications. Beyond core SRE responsibilities, you will also serve as a key liaison across various teams, fostering collaboration and ensuring seamless operations.

Responsibilities:

Site Reliability Engineering:

  • Proactively identify and mitigate potential issues impacting infrastructure and applications.
  • Partner with development teams to implement best practices for building reliable and scalable systems.
  • Stay up-to-date on the latest SRE trends and technologies.

Monitoring and Observability:

  • Design, implement, and maintain robust monitoring solutions using tools like Prometheus and Grafana.
  • Develop and configure alerts within tools like PagerDuty to ensure timely notification of potential issues.
  • Analyze and troubleshoot issues using collected application and infrastructure metrics.

Incident Management:

  • Lead incident response, ensuring timely resolution and minimizing downtime.
  • Document and communicate incident details effectively to stakeholders.
  • Conduct post-incident reviews to identify root causes and implement preventative measures.

Service Level Agreements (SLAs):

  • Collaborate with product and engineering teams to define clear and measurable SLAs for our SaaS offerings.
  • Establish Service Level Objectives (SLOs) for key metrics based on SLA requirements.
  • Define Service Level Indicators (SLIs) to track progress towards achieving SLOs.
  • Monitor SLO compliance and proactively identify potential SLA breaches.

Automation:

  • Identify opportunities for automation to improve efficiency and reliability.
  • Develop and implement automation scripts using tools like Python or Bash.
  • Automate routine tasks and incident response workflows.

Cross-Team Collaboration:

  • Act as a liaison between SRE, Product, Security, Application Engineering, and Customer Operations teams.
  • Facilitate communication and information sharing across teams to ensure smooth operations.
  • Work collaboratively to define and implement solutions that meet the needs of all stakeholders.

Mentorship and Knowledge Sharing:

  • Mentor and collaborate with junior SRE engineers.
  • Share knowledge and best practices within the team.
  • Contribute to the development and documentation of internal SRE processes.

Required Skills:

  • 5-8 years of experience as a Site Reliability Engineer (SRE) or related role.
  • Experience with cloud platform GCP
  • Proven experience with monitoring tools like Prometheus and Grafana.
  • Strong understanding of incident management best practices.
  • Experience with alerting tools like PagerDuty.
  • Experience with scripting languages like Python or Bash for automation.
  • Excellent communication and collaboration skills.
  • Ability to work independently and as part of a team.
  • Strong problem-solving and analytical skills.
  • Passion for building reliable and scalable systems.

Nice to Have:

  • Experience with container orchestration platforms like Kubernetes.
  • Experience with chaos engineering principles.
  • Experience with configuration management tools like Ansible or Chef.

What we offer:

  • Remote Work Opportunities
  • Flexible Work Hours

Required profile

Experience

Level of experience: Senior (5-10 years)
Spoken language(s):
English
Check out the description to know which languages are mandatory.

Soft Skills

  • Team Communication
  • Teamwork
  • Problem Solving
  • Analytical Thinking
  • Reliability

Go Premium: Access the World's Largest Selection of Remote Jobs!

  • Largest Inventory: Dive into the world's largest remote job inventory. More than half of these opportunities can't be found on standard platforms.
  • Personalized Matches: Our AI-driven algorithms ensure you find job listings perfectly matched to your skills and preferences.
  • Application fast-lane: Discover positions where you rank in the TOP 5% of applicants, and get personally introduced to recruiters with Jobgether.
  • Try out our Premium Benefits with a 7-Day FREE TRIAL.
    No obligations. Cancel anytime.
Upgrade to Premium

Find more Site Reliability Engineer jobs