Role overview

Qualifications

4+ years of experience operating large-scale systems
Experience with GCP or other public cloud platforms
Experience with Kubernetes (GKE) in production
Strong troubleshooting and communication skills

Responsibilities

Participate in 24/7 on-call rotations for core infrastructure systems
Execute incident response during production events, including triage, mitigation, and recovery
Improve reliability of core infrastructure components including: Kubernetes (GKE) clusters, Cloud networking and load balancing
Automate repetitive operational and security tasks

Key facts

Remote from: India
Full time
Senior (5-10 years)
DevOps Engineer
English

Hard skills

Kubernetes gRPC Incident Response Reliability Engineering Standard Operating Procedure Capacity Planning Cloudflare Scripting Vulnerability Management

Other skills

Troubleshooting (Problem Solving)
Communication
Collaboration
Mentorship

About the company

HighLevel

Information Technology & Services

One white-labeled marketing app to rule them all. HighLevel is everything your agency needs to succeed!Capture leads using our landing pages, surveys, forms, calendars, inbound phone system & more!Automatically message leads via voicemail, forced calls, SMS, emails, FB Messenger & more!Use our built in tools to collect payments, schedule appointments, and track analytics!

Company details

Company typeScaleup

IndustryInformation Technology & Services

Company size201 - 500

Links

Website LinkedIn See all jobs

Your match analysis

See how your profile stacks up against this role.

We compared the job requirements to your profile to show where you're strong and where you fall short.

Job description

About HighLevel:
HighLevel is an AI-powered business operating system that gives agencies, entrepreneurs and SMBs the infrastructure to build, automate and scale. Today, HighLevel supports SMBs across 150+ countries, fueling community-driven growth rooted in real customer outcomes.To date, businesses operating on HighLevel have generated over $7 billion in ecosystem value, demonstrating the impact of shared infrastructure at scale. By centralizing conversations, automation and intelligence into one system, we help businesses move faster, reduce complexity and execute efficiently.Behind the platform, HighLevel powers more than 4 billion API hits and 2.5 billion message events daily. With 250 terabytes of distributed data, 250+ microservices and over 1 million domain names supported, our architecture is built for performance, resilience and long-term scalability.

Our PeopleWith over 2,000 team members across 10+ countries, HighLevel operates as a global, remote-first organization built for speed and ownership. We value initiative, clarity and execution, creating space for ambitious people to build systems that support millions of businesses worldwide. Here, innovation thrives, ideas are celebrated and people come first, no matter where they call home.

Our ImpactEvery month, HighLevel enables more than 1.5 billion messages, 200 million leads and 20 million conversations for the more than 1 million businesses we support. Behind those numbers are real people building independence, expanding opportunity and creating measurable impact. We’re proud to be a part of that.Learn more about us on our YouTube Channel or Blog Posts

About the Role:
We are seeking SDE3 engineers to join HighLevel’s Core Infrastructure SRE Operations & Security team. This role focuses on operating, securing, and improving HighLevel’s production infrastructure, with responsibilities spanning on-call operations, incident response, reliability engineering, and security remediation.

You will work closely with Cloud Infrastructure, Platform Engineering, Data Infrastructure, and Security teams to ensure systems are stable, resilient, and secure. This is a hands-on role with a strong operational and security mindset, critical to HighLevel’s platform maturity.

Responsibilities:

Production Operations & Reliability:
-> Participate in 24/7 on-call rotations for core infrastructure systems
-> Execute incident response during production events, including triage, mitigation, and recovery
-> Maintain and improve runbooks, operational procedures, and escalation paths
-> Help reduce MTTR and prevent repeat incidents through engineering solutions

Infrastructure Reliability Engineering:
-> Improve reliability of core infrastructure components including: Kubernetes (GKE) clusters, Cloud networking and load balancing & Edge services (Cloudflare)
-> Identify systemic reliability issues and drive corrective actions
-> Support capacity planning, scaling, and resilience testing

Security Operations & Remediation:
-> Execute security remediations across cloud and Kubernetes environments
-> Support enforcement of: IAM least-privilege access, Network security controls & Runtime security policies
-> Partner with Platform Security on vulnerability management and remediation
-> Support security incident response and post-incident reviews

Automation & Tooling:
-> Automate repetitive operational and security tasks
-> Build tooling to improve:Incident response speed, Operational visibility & Security posture enforcement
-> Reduce manual toil through scripts, tooling, and process improvements

Change Management & Governance:
-> Support safe execution of infrastructure and configuration changes
-> Ensure changes follow defined change management and audit requirements
-> Contribute to incident reviews, postmortems, and continuous improvement initiatives

Collaboration & Growth:
-> Work closely with Cloud Infrastructure, SRE, Platform, Data, and Security teams
-> Contribute to shared documentation and operational standards
-> Mentor junior engineers and lead small reliability or security initiatives

Requirements:

4+ years of experience operating large-scale systems

Experience with GCP or other public cloud platforms

Experience with Kubernetes (GKE) in production

Ability to identify systemic issues and propose long-term fixes

Experience leading incident response or reliability initiatives

Strong understanding of reliability, security, and operational best practices

Comfortable working in on-call and incident response environments

Strong troubleshooting and communication skills

Experience supporting or operating production systems

Comfortable mentoring junior engineers and influencing peers

Nice to have:

Familiarity with Cloudflare, networking, or edge security

Exposure to security tooling or vulnerability management

Scripting or automation experience (Python, Go, Bash, etc.)

Experience in compliance- or audit-driven environments (SOC2, ISO)

EEO Statement:
The company is an Equal Opportunity Employer. As an employer subject to affirmative action regulations, we invite you to voluntarily provide the following demographic information. This information is used solely for compliance with government record-keeping, reporting, and other legal requirements. Providing this information is voluntary and refusal to do so will not affect your application status. This data will be kept separate from your application and will not be used in the hiring decision.

We encourage you to review our Privacy Policy before submitting your application.

#LI-Remote #LI-NJ1

Apply once. Then go straight to the hiring manager.

After you apply, unlock the direct contact details of the people who actually make the call. A quick follow-up makes you 5x more likely to land an interview.