Role overview

Qualifications

7+ years in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
Hands-on experience with GCP and AWS
Proficiency in Python, Bash, or Shell scripting
Experience with Prometheus, Grafana, ELK, OpenTelemetry, or similar monitoring/logging tools

Responsibilities

Develop and improve observability using monitoring, logging, tracing, and alerting tools
Optimize system performance, troubleshoot incidents, and conduct post-mortems/RCA to prevent future issues
Collaborate with developers to enhance application reliability, scalability, and performance
Drive cost optimization efforts in cloud environments

Key facts

Remote from: India
Full time
Senior (5-10 years)
Site Reliability Engineer (SRE)
English

Hard skills

Site Reliability Engineering DevOps Observability gRPC Amazon Web Services Terraform Kubernetes Prometheus (Software) Grafana Telemetry Python (Programming Language) Bash (Scripting Language) Jenkins Argo CD Incident Management

Other skills

Collaboration
Problem Solving
Troubleshooting (Problem Solving)

About the company

HighLevel

Information Technology & Services

One white-labeled marketing app to rule them all. HighLevel is everything your agency needs to succeed!Capture leads using our landing pages, surveys, forms, calendars, inbound phone system & more!Automatically message leads via voicemail, forced calls, SMS, emails, FB Messenger & more!Use our built in tools to collect payments, schedule appointments, and track analytics!

Company details

Company typeScaleup

IndustryInformation Technology & Services

Company size201 - 500

Links

Website LinkedIn See all jobs

Your match analysis

See how your profile stacks up against this role.

We compared the job requirements to your profile to show where you're strong and where you fall short.

Job description

About HighLevel:
HighLevel is an AI-powered business operating system that gives agencies, entrepreneurs and SMBs the infrastructure to build, automate and scale. Today, HighLevel supports SMBs across 150+ countries, fueling community-driven growth rooted in real customer outcomes.To date, businesses operating on HighLevel have generated over $7 billion in ecosystem value, demonstrating the impact of shared infrastructure at scale. By centralizing conversations, automation and intelligence into one system, we help businesses move faster, reduce complexity and execute efficiently.Behind the platform, HighLevel powers more than 4 billion API hits and 2.5 billion message events daily. With 250 terabytes of distributed data, 250+ microservices and over 1 million domain names supported, our architecture is built for performance, resilience and long-term scalability.

Our PeopleWith over 2,000 team members across 10+ countries, HighLevel operates as a global, remote-first organization built for speed and ownership. We value initiative, clarity and execution, creating space for ambitious people to build systems that support millions of businesses worldwide. Here, innovation thrives, ideas are celebrated and people come first, no matter where they call home.

Our ImpactEvery month, HighLevel enables more than 1.5 billion messages, 200 million leads and 20 million conversations for the more than 1 million businesses we support. Behind those numbers are real people building independence, expanding opportunity and creating measurable impact. We’re proud to be a part of that.Learn more about us on our YouTube Channel or Blog Posts

About the Role:

We are looking for a Site Reliability Engineer (SRE) to join our team and help ensure the availability, performance, and scalability of our critical systems. You will work closely with development and operations teams to automate processes, enhance system reliability, and improve observability.

Responsibilities:

Develop and improve observability using monitoring, logging, tracing, and alerting tools (Prometheus, Grafana, ELK, OpenTelemetry, etc.).

Optimize system performance, troubleshoot incidents, and conduct post-mortems/RCA to prevent future issues.

Collaborate with developers to enhance application reliability, scalability, and performance.

Drive cost optimization efforts in cloud environments.

Experience with multiple databases Mongo, Redis, ES, Queue based etc

Requirements:

Experience: 7+ years in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles.

Cloud Expertise: Hands-on experience with GCP and AWS.

Infrastructure as Code (IaC): Terraform, Helm, or equivalent tools.

Containerization & Orchestration: Docker, Kubernetes (GKE).

Observability: Experience with Prometheus, Grafana, ELK, OpenTelemetry, or similar monitoring/logging tools.

Programming/Scripting: Proficiency in Python, Bash, or Shell scripting. Basic understanding of API parsing and JSON manipulation.

CI/CD Pipelines: Hands-on experience with Jenkins, GitHub Actions, ArgoCD, or similar tools.

Incident Management: Experience with on-call rotations, SLOs, SLIs, SLAs, Escalation Policies, and incident resolution.

Databases: Experience in monitoring Mongo, Redis, ES, Queue based etc

EEO Statement:
The company is an Equal Opportunity Employer. As an employer subject to affirmative action regulations, we invite you to voluntarily provide the following demographic information. This information is used solely for compliance with government record-keeping, reporting, and other legal requirements. Providing this information is voluntary and refusal to do so will not affect your application status. This data will be kept separate from your application and will not be used in the hiring decision.

We encourage you to review our Privacy Policy before submitting your application.

Apply once. Then go straight to the hiring manager.

After you apply, unlock the direct contact details of the people who actually make the call. A quick follow-up makes you 5x more likely to land an interview.