Match score not available

Site Reliability Engineer

Remote: 
Full Remote
Contract: 
Work from: 
South Africa

Offer summary

Qualifications:

Bachelor's degree in Computer Science or related field, Experience in programming languages, cloud platforms, monitoring tools, databases.

Key responsabilities:

  • Ensure infrastructure reliability and performance
  • Automate aspects of infrastructure and operations
  • Manage data security and disaster recovery

Job description

Job Overview


The Site Reliability Engineer (SRE) will apply software engineering principles and practices to infrastructure and operations problems, and design and implement solutions that automate and improve the availability, scalability, and efficiency of our systems. You will also collaborate with other engineering teams, project teams, and other stakeholders to deliver high-quality products and services that meet our customers’ needs and expectations.

You’ll be exposed to unique challenges assisting with the maintenance and stability of our global and local infrastructure, and have the opportunity to contribute to internal and external open source projects. We believe in choosing the right tools for the job, and support creativity in solving problems.


Key Focus Areas


You will primarily be responsible for:

  • Infrastructure reliability and performance:
    • Monitoring, measuring, and improving the reliability and performance of our systems
    • Maintenance, upgrades, and security updates
  • Automation and tooling:
    • Designing and developing software and scripts that automate and streamline various aspects of infrastructure and operations
  • Assisting other teams with deployment and updates of their applications and services
  • Supporting with internal management of the organisation's technological infrastructure
  • Data Management & Security:
    • Working with SRE, Data Security, Legal and project team to develop and enforce policies and procedures for data collection, storage, and access to ensuring compliance with data privacy regulations, implementing and monitoring security measures to protect sensitive health information, and managing data backups and disaster recovery.


Responsibilities and Duties


Your primary responsibilities will include but not be limited to:

  • Assisting with resources to facilitate engineering services, and keep them operational. This includes continuous integration systems, software deployment and basic troubleshooting of code, and creation and management of software repositories. 
  • Ensuring servers are patched against security exploits in time, managing secure access to servers and repositories for partners and internal staff, and secure interconnection between systems. 
  • Ensuring servers are configured in a documented and repeatable way. 
  • Ensuring system and server architecture is appropriate to the requirements of projects, easily maintainable in the long term, and provides appropriate levels of redundancy.
  • Provide timeous uptime assurance, and support with issue investigation and recovery procedures.
  • Design and develop tools and software that automate and improve the infrastructure and operation of our systems, ensuring adoption of best practices.
  • Perform code reviews, testing and debugging and troubleshooting of the software and tools developed by the SRE team and assist other engineering teams with the same.
  • General support (problems, password changes, etc) of office infrastructure and services such as Google Workspace, Slack, and PPM Pro. 
  • Site load testing, unit testing, disaster recovery testing, and quality assurance on a system level including backend performance, deployment sanity, security, scalability and stability. 
  • Providing data security expertise within SRE and supporting the organisation and projects with compliance with data privacy regulations, implementing and monitoring security measures to protect sensitive health information, and managing data backups and disaster recovery.
  • Advise on and/or contribute to new or emerging technologies that might be relevant to Reach.
  • Commit to writing software that allows itself to be tested.
  • Work well within cross functional teams in order to produce world class products and programmes that empower end users.


Qualifications


  • A bachelor’s degree in Computer Science, Engineering or related field, or equivalent experience.


Skills and Experience Required

 

  • Proficient in one or more programming languages, such as Python, Go, Java, or C++.
  • Proficient in one or more scripting languages, such as Bash, Perl, or Ruby.
  • Proficient in one or more cloud platforms, such as AWS, Azure, or GCP.
  • Proficient in one or more UNIX-like operating systems.
  • Proficient in one or more configuration management and deployment tools, such as Ansible, Chef, Puppet, or Terraform.
  • Proficient in one or more monitoring and alerting tools, such as Prometheus, Grafana, Datadog, or Splunk.
  • Proficient in one or more container and orchestration tools, such as Docker, Kubernetes.
  • Proficient in one or more web servers and proxies, such as Apache, Nginx, or Envoy.
  • Proficient in one or more databases and data stores, such as MySQL, PostgreSQL, MongoDB, or Redis.
  • Proficient in one or more version control and collaboration tools, such as Git.
  • Knowledgeable in the concepts and principles of site reliability engineering, such as SLIs, SLOs, error budgets, incident management, postmortems, and blameless culture.
  • Knowledgeable in the concepts and principles of software engineering, such as design patterns, code quality, testing, debugging, and documentation.
  • Knowledgeable in the concepts and principles of performance engineering, such as profiling, benchmarking, load testing, and capacity planning.
  • Knowledgeable in the concepts and principles of distributed computing, such as concurrency, parallelism, synchronisation, and consensus.
  • Excellent communication and collaboration skills, and ability to work effectively in a cross-functional and remote team environment.
  • Excellent problem-solving and analytical skills, and ability to troubleshoot and resolve complex issues in a timely and efficient manner.
  • Excellent learning and innovation skills, and ability to research and evaluate new technologies and methodologies.

Required profile

Experience

Spoken language(s):
English
Check out the description to know which languages are mandatory.

Site Reliability Engineer (SRE) Related jobs