Site Reliability Engineer (SRE)

Work set-up: 
Full Remote
Contract: 
Experience: 
Senior (5-10 years)
Work from: 

Offer summary

Qualifications:

At least 3 years of experience in software engineering, with 2+ years in SRE or DevOps roles., Hands-on experience managing high-availability production systems., Proficiency in programming languages like Go or Python, focusing on automation., Strong knowledge of observability tools and cloud infrastructure such as AWS and DigitalOcean..

Key responsibilities:

  • Design and implement observability solutions for backend services, web applications, and databases.
  • Develop and maintain cloud and self-hosted infrastructure using tools like Terraform and Ansible.
  • Support developers in improving service reliability and automating deployments.
  • Build and maintain CI/CD pipelines and track SLI/SLOs for continuous improvement.

Art2Hire Tech Recruiters logo
Art2Hire Tech Recruiters Human Resources, Staffing & Recruiting TPE https://www.art2hire.com/

Job description

Our client, a new Silicon Valley-based profitable B2C product startup building innovative mobile solutions for the planet, is now looking for an experienced Site Reliability Engineer to help the build reliable, scalable, and observable systems. You will work closely with backend services (Python/Go), web applications, and databases to ensure performance, stability, and fast recovery in case of failures.

Location: Poland
Type: Remote, Full-time
Start date: ASAP
About project and position:

Based in Silicon Valley and backed by top-tier VCs is a new mobile innovator delivering exciting new products for consumers across the planet.
The company has a flagship VPN application with over 1B downloads, ensuring online privacy and anonymity for our users by creating a private network from a public internet connection.

Responsibilities:

  • Design and implement observability solutions (monitoring, logging, alerting, tracing) for backend services, web applications, and databases
  • Develop and maintain cloud and self hosted infrastructure ( AWS, DigitalOcean) using infrastructure-as-code and configuration management tools such as Terraform and Ansible
  • Support developers in improving service reliability and automating deployments
  • Build and maintain CI/CD pipelines (e.g. GitHub Actions, Jenkins)

  • Track and improve SLI/SLOs; run root cause analyses and post-mortems

  • Promote a strong reliability and continuous improvement culture

Requirements:

  • 3+ years of experience in software engineering, including 2+ years in an SRE or DevOps role
  • Experience managing high-availability production systems
  • Hands-on experience managing and operating Kubernetes clusters in production
  • Proficiency in at least one programming language (e.g. Go, Python), with focus on automation and code quality
  • Strong knowledge of observability platforms (e.g. Datadog, CloudWatch, Prometheus, Grafana, Clickhouse)
  • Experience with cloud (AWS, Digital Ocean) and self hosted infrastructure
  • Good understanding of incident management, disaster recovery, and monitoring best practices (e.g. DORA metrics, post-mortems, SLOs/SLIs)
  • Solid Linux administration, networking, and basic security knowledge
  • Experience building and maintaining CI/CD pipelines (e.g. Jenkins, AWS CodePipeline)
  • English - Intermediate, spoken and written

Nice to have:

  • Security knowledge (e.g. OWASP, threat modeling, vulnerability scanning)
  • Experience with OpenTelemetry or similar tracing tools

Required profile

Experience

Level of experience: Senior (5-10 years)
Industry :
Human Resources, Staffing & Recruiting
Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Collaboration
  • Problem Solving

Site Reliability Engineer (SRE) Related jobs