Senior Site Reliability Engineer

Remote: 
Full Remote
Contract: 
Work from: 

Offer summary

Qualifications:

Strong background as a Site Reliability Engineer (SRE) in a 24x7 production environment for SaaS or cloud services., Experience with AWS and Infrastructure as Code tools like Terraform and CloudFormation., Proficient in monitoring and observability tools such as Prometheus, Grafana, and New Relic., Familiarity with programming languages like Java, Python, or Go, and scripting languages like Bash or PowerShell..

Key responsabilities:

  • Ensure availability, performance, and reliability of production environments while maintaining uptime.
  • Analyze and resolve operational challenges to meet defined Service Level Objectives (SLOs).
  • Develop and implement automated observability solutions and manage incident response processes.
  • Collaborate with software development teams to enhance operational readiness and improve system performance.

CentralReach logo
CentralReach SME https://centralreach.com/
201 - 500 Employees
See all jobs

Job description

CentralReach is the #1 provider of SaaS software solutions for autism and IDD care. Trusted by more than 150,000 users, we enable therapy providers, educators, and employers to scale the way they deliver Applied Behavior Analysis and related therapies with innovative technology, market-leading industry expertise, and world-class customer satisfaction. The Engineering Operations group at CentralReach builds the underlying technologies that power our Public and Private Cloud Platforms worldwide. The group is responsible for storage, data infrastructure, IT, observability systems, DevOps, SRE, provisioning, compute, orchestration platform, internal tools, internal platforms (laptops, networks, systems etc.) and services - all the components that make up the CentralReach Platform. 

If you have a passion for the future, enjoy and thrive in an agile, fast-moving, ever-changing startup environment, welcome and take on technical challenges of all shapes and sizes, have excellent interpersonal skill and sense of humor and enjoy rolling up your sleeves and jumping in, then read on! As a Sr. SRE, you will work closely with the key stakeholders in Software Engineering to drive adoption of modern reliability practices like SLOs, error budget policies, actionable alerts, incident retrospectives, chaos testing, and end-to-end ownership. 

Key Accountabilities: 

  • Responsible for availability, latency, performance, efficiency, monitoring/observability, emergency response, capacity planning, setting and maintaining SLOs, SLIs and Error Budgets, creating dashboards.
  • Analyze, troubleshoot and resolve operational challenges contributing to defined SLO's.
  • Manage site stability, performance, reliability, and maintain uptime for production environments.
  • Develop a fully automated multi-environment observability stack based on the existing system and extend it to predict capacity needs based on the usage patterns. 
  • Strive for automation to reduce toil and increase development velocity.
  • Perform application-specific production support, incident management, change management, problem management, RCAs, and service restoration as needed.
  • Identify changes for the product architecture from the reliability, performance and availability perspective with a data driven approach. 
  • Document resolution run books and standard operating procedures.
  • Actively look for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation.
  • Collaborate with software development teams in the release management process and to shape the future roadmap and establish strong operational readiness across teams. 
  • Implementation of reliability and observability tools (like New Relic, Prometheus, Grafana etc.,)

Desired Skills and Experience: 

  • Strong background as a SRE supporting a 24x7 highly available production environment for a SaaS or cloud service provider.
  • Strong Experience with AWS, and Infrastructure as code (Terraform, CloudFormation). 
  • Understanding of High Availability best practices in AWS.
  • Solid experience with Monitoring/APM/Observability tools (Splunk, New Relic etc.)
  • Solid experience with Prometheus and Grafana.
  • Experience implementing observability plans around logs, metrics, and traces. 
  • Extensive experience with Kubernetes, Helm, CI/CD and config management tools like Ansible, Chef. 
  • Experience with Release automation, system administration, configuration management. 
  • Experience with programming languages (Java, Python, Go, etc.).
  • Experience with scripting languages (Bash, PowerShell).
  • Strong understanding of Linux, Windows, software development, systems, networking, and cloud concepts. 

CentralReach was developed for Clinicians by Clinicians. The story of CentralReach begins in 2012 when the company’s founder, a practicing Board Certified Behavioral Analyst, decided there had to be a better way to manage her operations so she could spend more time on what mattered most — working with her clients and patients. To help ABA practices focus on what they do best, CentralReach launched the first iteration of its EMR and practice management platform. Today, under the leadership of Chris Sullens, an award-winning CEO in the technology space, CentralReach is committed to their mission of providing cutting-edge technology and services to help clinicians and educators produce superior client and patient outcomes. Already a market leader, CentralReach is expected to grow exponentially through its four core tenets: hire and develop great people; build industry-leading products; provide exceptional service to customers and continuously invest in systems, processes and infrastructure. We value our employees and offer a robust benefits package including health and dental, paid time off, life insurance, disability coverage and a 401(k) matching. We also provide comprehensive onboarding, ongoing training, mentoring and career pathing to help you develop your career. We pride ourselves on our fun and energetic environment that also provides our employees with a meaningful way to make a difference by helping clinicians and educators produce superior outcomes for children and adults with disabilities. CentralReach will not contact you or schedule interviews via Facebook. Please note social media is a current sourcing tool for talent acquisition via LinkedIn, Instagram, FB and for our recent job fair through CR company marketing, but we have a direct link to our website where all viable jobs are listed and directly tracked to our company page.  

#LI-Remote

Required profile

Experience

Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Social Skills
  • Communication
  • Problem Solving

Site Reliability Engineer (SRE) Related jobs