Senior Site Reliability Engineer

Work set-up: 
Full Remote
Contract: 
Work from: 

Penbrothers logo
Penbrothers SME http://www.penbrothers.com/
201 - 500 Employees
See all jobs

Job description

About Penbrothers

Penbrothers is an HR & remote talent management partner and one of the fastest-growing companies in the Philippines. We provide talented Filipinos with global opportunities in high-growth startups and dynamic companies, from the comfort of their own homes.

About the Client

The client is a pioneer in medical recruitment, is seeking an experienced Tech Lead to drive their mission to enhance doctors' well-being. This is an opportunity to contribute your unique skills and expertise to create technology that truly matters, impacting lives on a daily basis

About the Role

We are looking for a Senior SRE/DevOps Specialist to play a vital role in ensuring the reliability of our Salesforce and web/mobile application environments. You will work closely with our engineers to continually improve and enhance our platform leaning towards world class best practices. 

Service reliability and observability

  • Analysing resource utilization and forecasting capacity needs to ensure the system can handle expected traffic and workloads without performance issues.

  • Writing code and scripts to automate repetitive operational tasks, configuration management, and deployment processes to reduce human error and increase efficiency.

  • Managing changes to production systems and services, ensuring that new releases and configuration changes are rolled out with minimal disruption and risk.

  • Identifying and addressing performance bottlenecks, optimizing software and infrastructure to improve response times and reduce resource consumption.

  • Maintaining thorough documentation of systems, configurations, and incident response procedures to facilitate knowledge sharing and onboarding of new team members.

  • Defining and maintaining service level objectives that specify the acceptable level of service quality, such as uptime and latency, for a particular system or service.

  • Defining the key performance metrics and indicators that will be used to measure the system's performance and reliability, such as error rates and response times.

  • Designing and implementing monitoring systems to track the SLIs and using alerting mechanisms to notify the team when the system deviates from its defined SLOs.

Incident management & Disaster recovery planning

  • Responding to and mitigating incidents that impact service availability or performance,

  • following an incident management process, and conducting post-incident reviews to learn and improve.

  • Planning and implementing and executing disaster recovery and backup strategies to ensure data and service availability in case of failures or disasters.

Security

  • Ensure systems and infrastructure are securely configured and hardened by default

  • Manage secrets, credentials, and access controls across environments

  • Monitor for security-related events and support incident response efforts

  • Maintain secure CI/CD pipelines and enforce safe deployment practices

  • Planning and implementing disaster recovery and backup strategies to ensure data and service availability in case of failures or disasters.

Continuous Improvement

  • Continuously evaluating and improving system reliability, efficiency, cost optimization and automation to meet our evolving business needs and customer expectations.

  • Rationalizing, evaluating and integrating 3rd party developer tooling and services.

  • Troubleshooting platform issues with development teams

  • Providing tooling support and access management for development teams

  • Stay ahead of the tech curve, bringing new tools and frameworks to the table

Required profile

Experience

Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Collaboration
  • Adaptability
  • Problem Solving

Site Reliability Engineer (SRE) Related jobs