Site Reliability Engineer

Work set-up: 
Full Remote
Contract: 
Experience: 
Senior (5-10 years)
Work from: 

Offer summary

Qualifications:

5+ years of experience in Site Reliability Engineering or similar roles., Deep expertise in Datadog for monitoring, alerting, and troubleshooting., Proficiency in Python programming for developing automation tools., Solid understanding of AWS cloud services and cloud-native architectures..

Key responsibilities:

  • Design and maintain highly available and scalable systems.
  • Manage and optimize Datadog monitoring and alerting practices.
  • Develop software and automation tools to improve system reliability.
  • Participate in incident management, post-mortems, and define reliability metrics.

QAD logo
QAD Large https://www.qad.com
1001 - 5000 Employees
See all jobs

Job description

Company Description

Redzone is the #1 Connected Workforce Solution for manufacturers big and small. We work to improve efficiency in plants, provide coaching for best practices, and enable the frontline worker to improve the quality of their work and their work life by providing them with tools, processes, and collaboration tools to keep their manufacturing lines running smoothly and efficiently.

At Redzone we focus on the customer experience, listening to the customer, and providing solutions that create great outcomes. We are a combination of great leadership, years of manufacturing experience, and an incredible technology team that all work together to create great products.

This role is fully remote, but must be based in Mexico. With full work authorization already in effect. No Visa sponsorship available.

Job Description

We are expanding our Site Reliability Engineering (SRE) team and seeking a highly skilled and passionate Senior SRE to join us. As a member of our growing SRE function, you will play a critical role in ensuring the reliability, scalability, and performance of our missioncritical services that power our customer experience. This is an exciting opportunity to shape our SRE practices, drive automation, and significantly impact our products operational excellence.

What Youll Do:

  • Drive Operational Excellence: Design, implement, and maintain highly available, scalable, and resilient systems that deliver exceptional customer experience.
  • Datadog Expert: Be one of the goto experts for Datadog. You will be responsible for defining, implementing, and enforcing best practices for monitoring, alerting, logging, tracing, and synthetic testing across our entire AWS environment. This includes deep handson configuration, dashboarding, troubleshooting, and optimization within Datadog.
  • Software Development for Reliability: Develop robust, welltested, and maintainable software and tooling to automate operational tasks, create selfservice capabilities for engineering teams, and enhance system reliability. This will involve building applications, not just scripts.
  • Toil Reduction Champion: Identify and eliminate toil through automation, process improvements, and systematic problemsolving. Work proactively to shift our operational focus from reactive firefighting to proactive engineering.
  • Incident Management & PostMortems: Contribute to and evolve our incident response framework, participating in oncall rotations (using OpsGenie). Lead blameless postmortems, extracting actionable insights and driving systemic improvements to prevent recurrence.
  • Reliability Metrics & Goals: Collaborate with engineering teams to define, implement, and track Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets. Use these metrics to drive continuous improvement and make datadriven decisions about reliability investments.
  • Infrastructure as Code: Leverage and contribute to our infrastructure as code (IaC) efforts, moving towards a fully automated environment using Terraform and GitHub Actions.
  • System Design & Architecture: Provide SRE expertise in system design reviews, influencing architectural decisions to build reliability, observability, and scalability into our services from the ground up.
  • Knowledge Sharing & Mentorship: Document processes, build runbooks, and share your expertise with both the SRE team and broader engineering organization. Help foster an SRE culture of shared ownership and continuous learning.
    • Qualifications

      What Youll Bring:

      • 5+ years of direct Site Reliability Engineering (SRE) experience or equivalent experience in a production engineering role focused on system reliability.
      • Deep expertise and handson experience with Datadog. Proven ability to implement, manage, and optimize Datadog for comprehensive monitoring (APM, infrastructure, logs, synthetics, RUM), alerting, and troubleshooting in complex cloud environments.
      • Strong software development proficiency in Python (required). Demonstrated ability to build applications, tools, and automation frameworks beyond simple scripting.
      • Experience with Golang (desired).
      • Solid understanding of cloudnative architectures and best practices, specifically within AWS (EKS, Load Balancers, Aurora RDS Serverless Postgres, S3, Secrets Manager, MSK, Bedrock, SageMaker, Route53).
      • Experience with containerization and orchestration technologies, particularly Kubernetes (EKS).
      • Familiarity with CICD pipelines and tools (Jenkins, GitHub Actions).
      • A strong understanding of distributed systems concepts, networking, and security principles.
      • Experience with incident management processes and tools.
      • Excellent problemsolving skills, with a methodical and datadriven approach to troubleshooting complex systems.
      • Strong communication and collaboration skills, with the ability to work effectively with diverse engineering teams.
      • A proactive mindset, with a passion for automation, continuous improvement, and blameless culture.
        • Bonus Points (Nice to Have):

          • Experience defining and working with SLOs, SLIs, and Error Budgets.
          • Familiarity with other observability tools or concepts beyond Datadog.
          • Experience with feature flagging platforms like LaunchDarkly.
            • Additional Information

              Why Join Us?

              • Be a key member of a growing SRE team and help shape our operational future.
              • Work on challenging problems at the intersection of software engineering, operations, and customer experience.
              • Opportunity to significantly reduce toil and drive impactful automation.
              • Collaborate with talented engineers in a supportive and learningoriented environment.
              • Your health and well being are important to us. We provide programs that help you strike a healthy worklife balance.
              • Opportunity to join a growing business, launching into its next phase of expansion and transformation.
              • Collaborative culture of smart and hardworking people who support one another to get the job done.
              • An atmosphere of growth and opportunity, where ideasharing is always prioritized over level or hierarchy.
              • Compensation packages based on experience and desired skill set
                • About QAD and QAD Redzone:

                  QAD Inc. is a leading provider of adaptive, cloudbased enterprise software and services for global manufacturing companies. Global manufacturers face everincreasing disruption caused by technologydriven innovation and changing consumer preferences. In order to survive and thrive, manufacturers must be able to innovate and change business models at unprecedented rates of speed. QAD calls these companies Adaptive Manufacturing Enterprises.

                  QAD Redzone helps to enable QAD’s vision for the Adaptive Enterprise. Labor productivity improvements directly impact efficiency. Productive and empowered employees increase the effective capacity of your plant and accelerate time to productivity for new employees giving manufacturers the agility to increase production beyond what was previously possible without having to invest in production equipment or new plants, and reduce the amount and impact of employee attrition. Empowered employees with a growth mindset take extreme ownership of challenges that impact their production goals, creating resilience in the face of disruption.

                  We are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race, color, sex, age, national origin, religion, sexual orientation, gender identity, status as a veteran, and basis of disability or any other federal, state or local protected class.

                  #LIRemote

Required profile

Experience

Level of experience: Senior (5-10 years)
Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Open Mindset
  • Collaboration
  • Communication
  • Problem Solving

Site Reliability Engineer (SRE) Related jobs