Key Facts

Remote From:

Canada

Full time

Senior (5-10 years)

English

Hard Skills

Other Skills

•
Decision Making
•
Accountability
•
Communication
•
Leadership
•
Plan Execution
•
Problem Solving

Roles & Responsibilities

5+ years in SRE, DevOps, or infrastructure engineering roles
Infrastructure-as-code proficiency with Terraform (modules, state management, multi-environment patterns)
Deep AWS experience including EKS, EC2, IAM, S3, VPC networking, Transit Gateway, CloudFront, KMS, and IRSA
Kubernetes expertise (cluster operations, node pools, autoscaling, Helm, RBAC) and GitOps/CD tooling experience (ArgoCD, GitHub Actions, Jenkins)

Requirements:

Own and evolve AWS-based infrastructure and EKS operations across production regions; optimize performance, reliability, and manage node pools, AMI lifecycles, autoscaling, and workload health; support GitOps deployments via infrastructure-as-code
Lead incident response including SLO measurement, incident investigations, root cause analysis, postmortems, and automated remediation to reduce MTTR
Ensure security and compliance through IAM governance, architecture reviews, audit-readiness, partner certifications, and responses to security questionnaires
Collaborate with cross-functional teams (application, customer, finance) to deliver secure, highly available platforms; support GPU/batch workloads with simulation/ML teams

Parallel Domain

About Parallel Domain

Training and testing autonomous systems in the real world is a slow, expensive and cumbersome process. Parallel Domain is the smartest way to prepare both your machines and human operators for the real world, while minimizing the time and miles spent there. Connect to the Parallel Domain API and tap into the power of synthetic data to accelerate your autonomous system development. Parallel Domain works with perception, machine learning, data operations, and simulation teams at autonomous systems companies, from autonomous vehicles to delivery drones. Our platform generates synthetic labeled data sets, simulation worlds, and controllable sensor feeds so they can develop, train, and test their algorithms safely before putting these systems into the real word. #syntheticdata #autonomy #AI #computervision #AV #ADAS #machinelearning #syntheticdatarealimpact

Company type: Scaleup

Founded: 2018

Company size: 51 - 200

Website LinkedIn See all jobs →

Job description

About the Role

Parallel Domain is looking for a Principal Site Reliability Engineer to own the reliability, scalability, and security of our cloud infrastructure - the backbone that runs simulation workloads for some of the most demanding customers in autonomous vehicle development.

This is a hands-on, high-ownership role. You'll be the primary infrastructure owner across our multi-region AWS/EKS platform, working closely with a small platform engineering team, partnering with engineering leads across simulation and ML, and our customer-facing teams.

What You'll Do

Infrastructure Ownership & Cloud Operations

Own and evolve our AWS-based infrastructure, improving platform performance and availability today, and building toward deployable configurations that support enterprise customer environments tomorrow.
Own EKS cluster operations across production regions: node pool strategy, AMI lifecycle, autoscaling, and Kubernetes workload health.
Support the GitOps deployment pipeline - define, deploy, and manage applications across clusters using infrastructure-as-code.
Manage complex networking: VPC design, cross-region connectivity, DNS, and load balancing.
Lead infrastructure deprecation and migration efforts with minimal disruption.

Reliability Engineering & Incident Response

Own SLO measurement infrastructure; enable proactive triage of emerging issues before they impact customers.
Lead incident investigation, root cause analysis and postmortems, driving systemic fixes rather than one-off patches.
Design and improve automated remediation systems to reduce MTTR.

Security & Access Management

Review and provide security-conscious feedback on platform architecture decisions.
Own cloud IAM governance - roles, policies, and access boundaries across accounts and services.
Lead compliance-adjacent work including audit-readiness, partner certification requirements, and supporting responses to customer security questionnaires.

Cross-Functional Collaboration

Partner with application development teams to build an inherently secure platform and drive next-generation deployment architecture.

Partner with customer teams to ensure availability for expected utilization.
Partner with Finance on cloud cost optimization - lifecycle policies, right-sizing, and spend visibility.
Support GPU and batch workloads in collaboration with simulation and ML engineering teams.

Platform Tooling & Developer Experience

Improve CI/CD pipelines and automated infrastructure validation.
Support engineering teams with infra-side debugging, log analysis, and environment configuration.

What We're Looking For

Technical Depth

5+ years in SRE, DevOps, or infrastructure engineering roles.
Infrastructure-as-code proficiency - Terraform modules, state management, and multi-environment patterns.
Deep AWS experience - EKS, EC2, IAM, S3, Storage Gateway, VPC networking, Transit Gateway, CloudFront, KMS, and IRSA.
Kubernetes expertise - cluster operations, node pools, probes, cordoning, pod scheduling, RBAC, Helm, node autoscaling (Karpenter experience a plus); solid understanding of containerization and AMI lifecycle management.
CI/CD - experience with GitOps workflows and pipeline tooling (ArgoCD, GitHub Actions, Jenkins)
Solid networking fundamentals - CIDR design, security groups, DNS, load balancing, VPN, cross-region connectivity.
Experience with monitoring and observability tooling - Prometheus, Grafana, Elasticsearch.
Comfort with Python and Bash for tooling and automation.
Familiarity working across Linux and Windows environments. Operational familiarity with Windows Server is a meaningful advantage.

Communication & Ownership

You communicate clearly across engineering, product, and customer-facing teams, flagging issues with urgency proportional to customer impact.
You advocate for SRE best practices and can effectively operationalize an informed and principled view on security.

You take end-to-end ownership of complex, multi-team efforts - from planning through execution and post-change verification.
You know when to push for a clean solution vs. when to accept a pragmatic one, and you communicate that tradeoff clearly.

Nice to Have

Experience with Windows-based workloads on EKS.
Experience supporting simulation, ML, or rendering workloads in cloud infrastructure; running GPU workloads on Kubernetes, including NVIDIA and DirectX device plugin configuration.
Experience with AWS Storage Gateway or Transfer Family integrations.
Familiarity with Envoy Gateway or similar.
Experience with container-optimized OS images (e.g., Bottlerocket, Packer).
Experience with cloud cost optimization at scale.

Core Tools

Terraform · AWS · Kubernetes · Helm · ArgoCD · Kustomize · Grafana · Prometheus · Elasticsearch · VictoriaLogs · Fluent Bit · GitHub Actions · Jenkins · Docker · Python · Bash

Why This Role

PD's simulation platform runs at the intersection of high-performance compute, distributed systems, and customer-critical reliability. The infrastructure problems here are genuinely interesting — multi-region GPU scheduling, Windows workloads on Kubernetes, startup latency optimization, and an enterprise product direction that will require rethinking how we deploy and manage the platform entirely.

The Principal SRE at PD is not a ticket-taker - it's a high-trust, high-autonomy position where you'll have genuine influence over infrastructure architecture, cross-team process, and customer experience.

Ready to apply?

APPLY

Share ·