Logo for Keep Simple

Devops /Platform Engineer (4631)

Key Facts

Remote From: 
Full time
Senior (5-10 years)
English

Other Skills

  • Communication
  • Teamwork
  • Problem Solving

Roles & Responsibilities

  • 6+ years of experience in DevOps, SRE, or platform engineering, with at least 2+ years supporting AI/ML workloads in production
  • Expert-level experience with infrastructure-as-code using Terraform (primary), with exposure to Pulumi, CloudFormation, or Bicep
  • Production experience with Kubernetes (EKS, AKS, or GKE) including cluster management, Helm charts, operators, auto-scaling, and troubleshooting
  • Deep experience with CI/CD pipeline design and implementation (GitHub Actions, GitLab CI, Azure DevOps Pipelines, or Jenkins) including multi-stage pipelines with automated quality gates

Requirements:

  • Design, build, and operate cloud-agnostic infrastructure, CI/CD pipelines, and a developer platform to support AI/digital innovation across AWS, Azure, and GCP
  • Lead end-to-end CI/CD engineering including multi-stage pipelines with automated tests and gates, deployment strategies (blue/green, canary, rolling updates), artifact management, and GitOps
  • Design and operate ML training and serving infrastructure, model registry, experiment tracking, AI monitoring, and related components (including vector databases for RAG and LLM gateway layers)
  • Implement security, compliance, and reliability practices through security-as-code, zero-trust networking, secrets management, IAM, audit logging, and automated compliance controls

Job description

JOB DESCRIPTION


Come work for a large global financial and insurance products company! This is your chance !!


Start a successful career in a renowned company in the international market! Great opportunity!


Global insurance and asset management company seeks a responsible, organized, dynamic and team-oriented person.



RESPONSIBILITIES AND ASSIGNMENTS


Role Summary

We are seeking a Senior DevOps / Platform Engineer to design, build, and operate the cloud infrastructure, CI/CD pipelines, and developer platform that underpin our AI and digital innovation initiatives. This is a cloud-agnostic role — you will architect infrastructure and platform capabilities that work across AWS, Azure, and GCP, ensuring our engineering teams can build, deploy, and operate AI-powered applications with speed, security, and reliability.


A distinguishing aspect of this role is the MLOps dimension. You will build and maintain the infrastructure for AI/ML model lifecycle management: training environments, model serving, experiment tracking, automated evaluation, and production monitoring. You will ensure that deploying an AI model to production is as reliable, repeatable, and observable as deploying a traditional software service. 


Key Responsibilities


CI/CD Pipeline Engineering

  • Design and maintain end-to-end CI/CD pipelines for all engineering workstreams: application code, infrastructure-as-code, AI/ML models, data pipelines, and automation scripts;
  • Build multi-stage deployment pipelines with automated testing gates: unit tests, integration tests, security scans (SAST/DAST/SCA), AI model evaluation, and infrastructure validation;
  • Implement deployment strategies: blue/green, canary, rolling updates, and feature flags — for both traditional services and AI model endpoints;
  • Design and maintain artifact management: container registries, model registries, package repositories, and versioned infrastructure modules;
  • Build pipeline observability: deployment frequency tracking, lead time for changes, change failure rate, and mean time to recovery (DORA metrics);
  • Implement GitOps workflows using ArgoCD, Flux, or equivalent for declarative infrastructure and application deployment.


Cloud Infrastructure (Cloud-Agnostic)

  • Design and maintain cloud infrastructure across AWS, Azure, and/or GCP — with emphasis on portability and avoiding deep vendor lock-in where practical;
  • Implement infrastructure-as-code using Terraform (primary), Pulumi, or CloudFormation/Bicep with modular, reusable, and well-tested infrastructure modules;
  • Design and operate Kubernetes clusters (EKS, AKS, GKE) for containerized workloads — including AI model serving, API services, and batch processing;
  • Build and manage serverless compute infrastructure (Lambda, Azure Functions, Cloud Functions) for event-driven workflows and lightweight AI inference;
  • Implement cloud cost optimization: right-sizing, reserved capacity planning, spot/preemptible instance strategies, and automated cost monitoring and alerting;
  • Design multi-environment strategies: development, staging, production — with proper isolation, data governance, and promotion workflows.


Security & Compliance Infrastructure

  • Implement security-as-code: infrastructure security policies (Checkov, tfsec, Sentinel), container image scanning (Trivy, Snyk), and runtime security monitoring;
  • Design and enforce zero-trust networking: service mesh (Istio, Linkerd), network policies, private endpoints, and API gateway security;
  • Implement secrets management using HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, or equivalent;
  • Build and maintain identity and access management: service accounts, workload identity, least-privilege IAM policies, and RBAC for Kubernetes and cloud resources;
  • Ensure infrastructure compliance with SOC 2, ISO 27001, GDPR, and industry-specific regulations;
  • Implement audit logging, security alerting, and automated compliance scanning across all infrastructure.


MLOps & AI Infrastructure

  • Design and build ML training infrastructure: GPU/TPU compute provisioning, distributed training support, and experiment tracking (MLflow, Weights & Biases);
  • Build model serving infrastructure: containerized model endpoints, auto-scaling (including GPU-based scaling), A/B testing, and model routing;
  • Implement model registry and lifecycle management: model versioning, staging, approval workflows, and automated deployment pipelines;
  • Build AI-specific monitoring: model latency, throughput, error rates, input/output drift detection, and token usage cost tracking;
  • Design and operate vector database infrastructure for RAG systems: deployment, scaling, backup, and disaster recovery;
  • Implement LLM gateway/proxy infrastructure: centralized API routing, rate limiting, cost controls, caching, and provider failover.


Reliability & Observability

  • Design and implement comprehensive observability stack: metrics (Prometheus/Grafana, Datadog), logs (ELK, Loki, CloudWatch), traces (Jaeger, OpenTelemetry), and AI-specific monitoring;
  • Build and maintain alerting systems with proper escalation policies, runbooks, and automated remediation where possible;
  • Implement SLI/SLO frameworks for all production services — including AI model endpoints — with error budget tracking;
  • Design disaster recovery and business continuity plans: multi-region deployment, data replication, backup strategies, and failover testing;
  • Build chaos engineering practices: fault injection, game days, and resilience testing for both infrastructure and AI systems;
  • Maintain incident management processes: on-call rotations, incident response playbooks, and post-incident review facilitation.


Developer Experience & Platform

  • Build and maintain an Internal Developer Platform (IDP) that enables self-service infrastructure provisioning, environment management, and deployment;
  • Design developer workflows: local development environments (dev containers, Codespaces), preview environments, and rapid feedback loops;
  • Build and maintain developer documentation: architecture decision records (ADRs), runbooks, onboarding guides, and platform usage guidelines;
  • Implement platform abstractions that reduce cognitive load on application developers while maintaining flexibility for power users;
  • Design and operate shared services: database provisioning, cache infrastructure, message queue clusters, and monitoring stack.



REQUIREMENTS AND QUALIFICATIONS


Required Qualifications / Skills


  • 6+ years of experience in DevOps, SRE, or platform engineering, with at least 2+ years supporting AI/ML workloads in production;
  • Expert-level experience with infrastructure-as-code: Terraform (primary), with exposure to Pulumi, CloudFormation, or Bicep;
  • Production experience with Kubernetes (EKS, AKS, or GKE): cluster management, Helm charts, operators, auto-scaling, and troubleshooting;
  • Deep experience with CI/CD pipeline design: GitHub Actions, GitLab CI, Azure DevOps Pipelines, or Jenkins — including multi-stage pipelines with automated quality gates;
  • Strong cloud infrastructure experience across at least two of: AWS, Azure, GCP — with hands-on skills in networking, compute, storage, identity, and security services;
  • Proficiency in scripting and automation: Python, Bash, PowerShell, and at least one of: Go, TypeScript;
  • Experience building observability stacks: Prometheus, Grafana, Datadog, ELK, OpenTelemetry, and alerting/on-call systems (PagerDuty, Opsgenie);
  • Strong understanding of security engineering: secrets management, network security, IAM, container security, and compliance automation;
  • Experience with GitOps practices and tools: ArgoCD, Flux, or equivalent;
  • Fluent English, both written and spoken;
  • Proven experience in international projects, including collaboration with global and multicultural teams;
  • Strong communication, stakeholder management, and problem-solving skills;
  • Previous experience mentoring engineers or acting as a technical lead is strongly preferred.


Preferred Qualifications


  • Hands-on MLOps experience: model serving (vLLM, TensorRT, Triton Inference Server, SageMaker Endpoints, Azure ML), model registries (MLflow, Weights & Biases), and GPU infrastructure management;
  • Experience building LLM gateway/proxy infrastructure: LiteLLM, AI Gateway, or custom routing layers;
  • Familiarity with platform engineering tools: Backstage, Port, Humanitec, or custom developer portals;
  • Experience with service mesh technologies: Istio, Linkerd, or Consul Connect;
  • Knowledge of FinOps practices: cloud cost management, tagging strategies, showback/chargeback models;
  • Experience in insurance, financial services, or other regulated industries with strict compliance requirements;
  • Certifications: CKA/CKAD (Kubernetes), AWS Solutions Architect / DevOps Engineer, Azure DevOps Engineer Expert, HashiCorp Terraform Associate;
  • Experience with chaos engineering tools: Chaos Monkey, Litmus, Gremlin;
  • Familiarity with edge/hybrid deployment patterns for AI models;
  • Experience building and operating data platform infrastructure: Spark clusters, Kafka, Airflow/Prefect deployments.


Base Requirements


  • DevOps Experience | All team members must demonstrate hands-on experience with CI/CD pipelines, containerization (Docker/Kubernetes), cloud platforms, and deployment automation;
  • Infrastructure as Code | Proficiency with at least one IaC toolchain (Terraform, Pulumi, CloudFormation/Bicep) is required across all roles — not just DevOps;
  • Cloud Platforms | Working knowledge of at least one major cloud provider (AWS, Azure, or GCP);
  • Version Control & Collaboration | Git-based workflows, code review practices, and collaborative development are expected of every team member.


Education

  • Bachelor's degree in Computer Science, Information Systems, Engineering, or a related field is preferred.



ADDITIONAL INFORMATION


Modelo de contratação:

  • PJ


Forma de atuação:

  • 100% Remoto



SEJAM BEM VINDOS A KEEP SIMPLE 👇🏽


Somos uma empresa de consultoria em TI com mais de 10 anos no mercado e contamos com um time de especialistas em recrutamento tech. Nosso processo é 100% focado na experiência de quem tanto importa, o candidato.


Optamos por fazer a diferença e temos orgulho em dizer que todos que passam pela Keep Simple se sentem especiais. Possuímos um ambiente descontraído, colaborativo, e adotamos o ágil de verdade.


Faça parte da nossa história, #vemprakeep 💙🚀


Platform Engineer Related jobs

Other jobs at Keep Simple

We help you get seen. Not ignored.

We help you get seen faster — by the right people.

🚀

Auto-Apply

We apply for you — automatically and instantly.

Save time, skip forms, and stay on top of every opportunity. Because you can't get seen if you're not in the race.

AI Match Feedback

Know your real match before you apply.

Get a detailed AI assessment of your profile against each job posting. Because getting seen starts with passing the filters.

Upgrade to Premium. Apply smarter and get noticed.

Upgrade to Premium

Join thousands of professionals who got noticed and hired faster.