Offer summary

Qualifications:

At least 3 years of experience in software engineering, with 2+ years in SRE or DevOps roles., Hands-on experience managing high-availability production systems., Proficiency in programming languages like Go or Python, focusing on automation., Strong knowledge of observability tools and cloud infrastructure such as AWS and DigitalOcean..

Key responsibilities:

Design and implement observability solutions for backend services, web applications, and databases.

Develop and maintain cloud and self-hosted infrastructure using tools like Terraform and Ansible.

Support developers in improving service reliability and automating deployments.

Build and maintain CI/CD pipelines and track SLI/SLOs for continuous improvement.

Job description

Our client, a new Silicon Valley-based profitable B2C product startup building innovative mobile solutions for the planet, is now looking for an experienced Site Reliability Engineer to help the build reliable, scalable, and observable systems. You will work closely with backend services (Python/Go), web applications, and databases to ensure performance, stability, and fast recovery in case of failures.

Location: Poland
Type: Remote, Full-time
Start date: ASAP
About project and position:

Based in Silicon Valley and backed by top-tier VCs is a new mobile innovator delivering exciting new products for consumers across the planet.
The company has a flagship VPN application with over 1B downloads, ensuring online privacy and anonymity for our users by creating a private network from a public internet connection.

Responsibilities:

Design and implement observability solutions (monitoring, logging, alerting, tracing) for backend services, web applications, and databases
Develop and maintain cloud and self hosted infrastructure ( AWS, DigitalOcean) using infrastructure-as-code and configuration management tools such as Terraform and Ansible
Support developers in improving service reliability and automating deployments
Build and maintain CI/CD pipelines (e.g. GitHub Actions, Jenkins)
Track and improve SLI/SLOs; run root cause analyses and post-mortems
Promote a strong reliability and continuous improvement culture

Requirements:

3+ years of experience in software engineering, including 2+ years in an SRE or DevOps role
Experience managing high-availability production systems
Hands-on experience managing and operating Kubernetes clusters in production
Proficiency in at least one programming language (e.g. Go, Python), with focus on automation and code quality
Strong knowledge of observability platforms (e.g. Datadog, CloudWatch, Prometheus, Grafana, Clickhouse)
Experience with cloud (AWS, Digital Ocean) and self hosted infrastructure
Good understanding of incident management, disaster recovery, and monitoring best practices (e.g. DORA metrics, post-mortems, SLOs/SLIs)
Solid Linux administration, networking, and basic security knowledge
Experience building and maintaining CI/CD pipelines (e.g. Jenkins, AWS CodePipeline)
English - Intermediate, spoken and written

Nice to have: