Key Facts

Remote From:

Indiana (USA)

Full time

Mid-level (2-5 years)

English

Hard Skills

Other Skills

•
Communication
•
Analytical Skills
•
Leadership
•
Mentorship

Roles & Responsibilities

9-15 years of experience in DevOps, Site Reliability Engineering, or Cloud Infrastructure roles.
Deep expertise in Kubernetes, container orchestration, and production-grade Docker deployments.
Strong understanding of Infrastructure-as-Code (Terraform, CloudFormation, etc.).
Expertise in CI/CD automation and release management.

Requirements:

Lead end-to-end DevOps strategy, including CI/CD pipelines, automation, infrastructure-as-code, and release engineering, while establishing reliability standards and operational governance.
Architect and manage large-scale Kubernetes environments for production workloads, optimize workloads across clusters for performance, reliability, and cost efficiency, and drive multi-cluster/multi-region deployments.
Own infrastructure cost visibility and savings initiatives, including rightsizing, reserved capacity planning, auto-scaling optimization, and workload scheduling; partner with finance for budgeting, forecasting, and reporting; create dashboards to track infrastructure ROI and spend trends.
Design and implement comprehensive observability using Grafana and related tools; build real-time dashboards, establish alerting, and drive incident response improvements; automate provisioning, deployments, scaling, and disaster recovery processes.

Job description

This role is for one of the Weekday's clients

Min Experience: 9 years

Location: Remote (India)

JobType: full-time

As a Staff Engineer, you will architect and evolve our DevOps ecosystem, champion cloud cost governance, and implement best-in-class container orchestration practices. You will work cross-functionally with engineering, security, and finance teams to ensure operational excellence while proactively managing infrastructure spend.

Requirements

Key Responsibilities

DevOps Leadership & Architecture

Lead end-to-end DevOps strategy, including CI/CD pipelines, automation, infrastructure-as-code, and release engineering.
Design scalable, resilient cloud-native architectures aligned with business growth.
Establish DevOps best practices, reliability standards, and operational governance.

Kubernetes & Containerization

Architect and manage large-scale Kubernetes environments for production workloads.
Optimize workloads across clusters for performance, reliability, and cost efficiency.
Build and maintain containerized applications using Docker and Kubernetes, ensuring portability and scalability.
Drive multi-cluster, multi-region deployments where necessary.

Cost Savings & Cost Planning

Own infrastructure cost visibility and optimization initiatives.
Implement cloud cost-saving strategies including rightsizing, reserved capacity planning, auto-scaling optimization, and workload scheduling.
Partner with finance teams for budgeting, forecasting, and cost planning.
Create dashboards and reporting mechanisms to track infrastructure ROI and spend trends.
Continuously identify inefficiencies and implement measurable cost-reduction initiatives without compromising performance.

Monitoring & Observability

Design and implement comprehensive monitoring systems using Grafana and related observability tools.
Build real-time dashboards for system health, performance metrics, and cost insights.
Establish alerting frameworks to minimize downtime and improve incident response.
Drive improvements in system reliability through data-driven monitoring and post-incident analysis.

Automation & Reliability

Automate provisioning, deployments, scaling, and recovery processes.
Improve system resilience, availability, and disaster recovery strategies.
Lead root cause analysis for major incidents and implement preventive measures.

Required Qualifications

9–15 years of experience in DevOps, Site Reliability Engineering, or Cloud Infrastructure roles.
Deep expertise in Kubernetes, container orchestration, and production-grade Docker and Kubernetes implementations.
Strong hands-on experience with Grafana, monitoring systems, and observability frameworks.
Proven track record in cost savings initiatives and infrastructure cost planning in cloud environments.
Experience designing highly available, scalable systems in AWS, Azure, or GCP.
Strong understanding of Infrastructure-as-Code (Terraform, CloudFormation, etc.).
Expertise in CI/CD automation and release management.
Solid knowledge of networking, security best practices, and cloud architecture patterns.

Preferred Attributes

Experience managing large-scale production environments with strict SLAs.
Strong analytical skills with the ability to translate technical metrics into financial impact.
Leadership mindset with experience mentoring engineers and influencing cross-functional teams.
Excellent communication and stakeholder management skills.