Senior Site Reliability Engineer

Remote: 
Full Remote
Contract: 
Work from: 

Offer summary

Qualifications:

Bachelor’s degree in Computer Science, Software Engineering, or related field., Minimum 3 years of experience in SRE, DevOps, or cloud infrastructure roles., Hands-on expertise with AWS Cloud Services and Infrastructure as Code tools like Terraform., Proficiency in at least one programming language such as Python or Node.js..

Key responsibilities:

  • Design and maintain Infrastructure as Code solutions using Terraform across AWS environments.
  • Collaborate with teams to build scalable and secure cloud infrastructure.
  • Implement monitoring and alerting solutions to ensure system reliability.
  • Drive incident response processes and mentor junior engineers.

Razer Inc. logo
Razer Inc. SME https://www.razer.com/
1001 - 5000 Employees
See all jobs

Job description

Joining Razer will place you on a global mission to revolutionize the way the world games. Razer is a place to do great work, offering you the opportunity to make an impact globally while working across a global team located across 5 continents. Razer is also a great place to work, providing you the unique, gamer-centric #LifeAtRazer experience that will put you in an accelerated growth, both personally and professionally.

Job Responsibilities :

We are seeking a skilled and driven Senior Site Reliability Engineer (SRE) to join our growing infrastructure and platform engineering team. The ideal candidate will have hands-on experience in Amazon Web Services (AWS), strong troubleshooting capabilities, and a passion for building scalable, observable, and resilient systems using modern Infrastructure as Code (IaC) and automation tools.

REQUIREMENTS:

  • Bachelor’s degree in Computer Science, Software Engineering, Information Technology, or a related field.
  • Minimum 3 years of experience in SRE, DevOps, cloud infrastructure, or system administration roles.
  • Hands-on expertise with AWS Cloud Services, including:
  • Compute & Containerization: EC2, Lambda, ECS, EKS, Auto Scaling
  • Networking: Load Balancers, VPC, Route 53, Security Groups, Firewalls
  • Storage & Databases: RDS, ElastiCache, Athena, S3
  • Messaging: SQS, SES
  • Deep understanding of Infrastructure as Code (IaC) tools such as Terraform and CloudFormation.
  • Proficiency in at least one programming/scripting language: Python, Node.js, Bash, Ruby, or related.
  • Experience operating and troubleshooting across Linux, Windows, and container-based environments.
  • Strong understanding of distributed systems, cloud networking (routers, switches), firewalls, DNS, and HTTP/TLS.
  • Experience implementing monitoring and alerting systems and working with incident management processes.
  • Experience with Zero Downtime Deployments, blue/green or canary deployments.
  • Familiarity with cost optimization and right-sizing AWS resources.
  • Exposure to multi-region, multi-account AWS architecture.
  • Understanding of API gateway, or edge networking (e.g., Akamai, CloudFront).

JOB DESCRIPTION:

  • Design, implement, and maintain Infrastructure as Code (IaC) solutions using Terraform and/or CloudFormation across multi-account AWS environments.
  • Collaborate with developers, architects, and DevOps teams to build scalable, secure, and observable cloud infrastructure.
  • Lead and participate in architecture design sessions, focusing on system reliability, scalability, security, and performance.
  • Implement and manage robust monitoring, alerting, and observability solutions (e.g., CloudWatch, Prometheus, ELK, Datadog).
  • Set and monitor Key Performance Indicators (KPIs) for system uptime, latency, throughput, and overall reliability.
  • Drive incident response processes, including coordination, triaging, resolution, documentation, and post-incident reviews (PIRs).
  • Supervise and mentor junior SREs and infrastructure engineers, fostering knowledge-sharing and team growth.
  • Collaborate across development, operations, and security teams to ensure secure and compliant deployments.
  • Automate manual tasks and workflows through scripting and tooling (Python, Node.js, Bash, Ruby, JSON/YAML).
  • Troubleshoot complex infrastructure issues across Linux, Windows, Docker, and cloud-native environments.
  • Provide IaC and CI/CD best practices to ensure repeatability, scalability, and compliance across all environments.
  • Provide on-call support, participate in incident rotations, and lead technical investigations during outages or degradations.
  • Strong understanding and experience for Disaster Recovery (DR).
  • Support from 5:00PM to 2:00AM (UTC+8) shift to ensure continuous of SRE coverage.
  • Undergo initial familiarization period during regular working hours before transitioning to the designated shift.
  • Provide support and solution handling to incident and tickets assigned.


 

Pre-Requisites :

Are you game?

Required profile

Experience

Industry :
Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Troubleshooting (Problem Solving)
  • Teamwork
  • Communication
  • Problem Solving

Site Reliability Engineer (SRE) Related jobs