Match score not available

Senior Site Reliability Engineer (SRE) (Brazil)

Work set-up:

Full Remote

Contract:

Experience:

Senior (5-10 years)

Work from:

Offer summary

Qualifications:

Bachelor's degree in Computer Science, Engineering, or related field or equivalent experience., At least 5 years of experience in DevOps, SRE, or similar roles., Strong proficiency with cloud platforms such as AWS, GCP, or Azure., Hands-on experience with infrastructure as code tools like Terraform or CloudFormation..

Key responsibilities:

Design and maintain scalable, highly available infrastructure for AI platforms.
Implement monitoring, alerting, and observability solutions to ensure system health.
Automate deployment, scaling, and management of cloud-native infrastructure.
Collaborate with development teams to build reliable and efficient AI systems.

Articul8 AI https://www.articul8.ai

51 - 200 Employees

Job description

About Us

Articul8 AI is at the forefront of Generative AI innovation, delivering cuttingedge SaaS products that transform how businesses operate. Our platform empowers organizations to leverage the power of artificial intelligence in a reliable, scalable, and secure environment.

Position Overview

We are seeking an experienced Site Reliability Engineer (SRE) to join our team and help ensure the reliability, performance, and scalability of our GenAI SaaS platform. As an SRE, you will bridge the gap between development and operations, implementing automation and best practices to maintain our service reliability objectives while supporting rapid innovation.

Key Responsibilities

Architect and maintain scalable, highly available infrastructure for our GenAI platform.
Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance.
Automate deployment, scaling, and management of our cloudnative infrastructure, reducing toil and improving efficiency.
Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to deliver outstanding service quality.
Participate in oncall rotations and provide rapid response to production incidents, minimizing downtime and user impact.
Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads.
Lead incident response efforts, conduct thorough postmortems, and champion continuous improvement initiatives.
Optimize infrastructure for performance, scalability, and costeffectiveness—especially for highdemand AI workloads.
Implement and enforce security best practices across all systems and environments.
Create and maintain comprehensive documentation, including runbooks and knowledge base articles, to foster a culture of shared knowledge.
Qualifications
Required
Bachelors degree in Computer Science, Engineering, or related field, or equivalent practical experience
5+ years of experience in DevOps, SRE, or similar roles
Strong experience with cloud platforms (AWS, GCP, or Azure)
Proficiency in at least one programmingscripting language (Python, Go, Bash, etc.)
Handson experience with infrastructure as code tools (Terraform, CloudFormation, etc.)
Solid background in containerization technologies (Docker, Kubernetes)
Proven experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, etc.)
Strong understanding of CICD pipelines and automation
Exceptional troubleshooting and problemsolving skills and ability to troubleshoot complex systems
Preferred
Experience supporting AIML systems in production
Knowledge of GPU infrastructure management and optimization
Familiarity with distributed systems and highperformance computing
Experience with database systems (SQL and NoSQL)
Certifications in cloud platforms (AWS, GCP, Azure)
Experience with chaos engineering and resilience testing
Knowledge of security best practices and compliance requirements
Ready to shape the future of resilient software systems? Apply now and help drive the reliability of tomorrow’s AI at Articul8 AI!

Required profile

Experience

Level of experience: Senior (5-10 years)

Spoken language(s):

English

Check out the description to know which languages are mandatory.

Other Skills

Troubleshooting (Problem Solving)
Problem Solving

Are you interested?

Share

Site Reliability Engineer (SRE) Related jobs

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Today

Lumin Digital

Full time

AWS Cloud ServicesCI/CDKubernetes

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

1 day ago

Penbrothers

Full time

Automated Information SystemsSalesforcePersonal SecurityIncident Management

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

1 day ago

Ryz Labs

Full time

Google Cloud Platform (GCP)CI/CDTerraformPython (Programming Language)

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Today

Wave Mobile Money

Full time

Cloud ComputingKubernetesPython (Programming Language)Site Reliability Engineering

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Today

QAD

Full time

DatadogAWS Cloud ServicesKubernetesPython (Programming Language)

See more Site Reliability Engineer (SRE) jobs