BS/MS in Computer Science or a related field (PhD preferred), 5-7+ years of experience managing Kubernetes in large-scale production environments, 5-7+ years of experience building cloud-native infrastructure across AWS, Azure, and GCP, Hands-on experience with IaC tools such as Terraform and CloudFormation..
Key responsabilities:
Architect scalable systems for the Kumo platform to handle large datasets
Build and automate CI/CD pipelines for continuous delivery across cloud providers
Develop infrastructure microservices for usage tracking, diagnostics, and monitoring
Lead automation efforts to streamline global deployment and manage model lifecycles.
Report This Job
Help us maintain the quality of our job listings. If you find any issues with this job post, please let us know.
Select the reason you're reporting this job:
Democratizing AI on the Modern Data Stack
The team behind PyG (PyG.org) is working on a turn-key solution for AI over large scale data warehouses. We believe the future of ML is a seamless integration between modern cloud data warehouses and AI algorithms. Our ML infrastructure massively simplifies the training and deployment of ML models on complex data.
With over 40,000 monthly downloads and nearly 13,000 Github stars, PyG is the ultimate platform for training and development of Graph Neural Network (GNN) architectures. GNNs -- one of the hottest areas of machine learning now -- are a class of deep learning models that generalize Transformer and CNN architectures and enable us to apply the power of deep learning to complex data. GNNs are unique in a sense that they can be applied to data of different shapes and modalities.
The Cloud Infrastructure team at Kumo manages the Kubernetes-based, cloud-native Kumo AI platform. They define service level objectives, ensure capacity, maintain cost visibility, and uphold security compliance for the Multi-Cloud Platform.
As a key team member, you will architect scalable systems for the Kumo platform, making it the top choice for Big Data and AI workloads. Joining early, you'll design the platform to handle large datasets, enhancing productivity for engineers and users. Collaborating with ML scientists, product engineers, and leaders, you'll influence scaling ML tech, develop tools for speed, and craft full-stack experiences. Engineers at Kumo wear many hats, leading the design of core systems from scratch and shaping product direction. You'll dive into foundational work, managing model lifecycles, ML Ops, CI/CD, and deployment strategies.
The Value You'll Add:
Build and extend components of the core Kumo Cloud Infrastructure and Kumo infrastructure
Define a culture of engineering excellence and operational efficiency, especially as it relates to development and productization
Build and automate CI-CD pipelines, release tooling to support continuous delivery, and true zero-downtime deployments across different cloud providers using the latest cloud-native technologies
Work on advanced tools developed for the world’s leading cloud-native machine learning engine that uses graph deep learning technology
Develop the infrastructure microservices for features such as usage tracking, diagnostics, monitoring, and alerting at the cloud scale
Lead automation efforts to streamline global deployment effort
Build the Kumo ML Ops platform, which will be able to data drift, track model versions, report on production model performance, alert the team of any anomalous model behavior, and run programmatic A/B tests on production models.
Your Foundation:
Education: BS/MS in Computer Science or a related field (PhD preferred)
Kubernetes Expertise: 5-7+ years of experience managing Kubernetes (e.g., EKS, GKE, AKS, or OpenSource) in large-scale production environments, with deep knowledge of Kubernetes internals, controllers, operators, networking, and connectivity.
Cloud Infrastructure: 5-7+ years of experience building cloud-native infrastructure across AWS, Azure, and GCP.
Platform Engineering: 5-7+ years of experience developing platform engineering services using tools like Traefik, Istio/Envoy, and Calico/Tigera.
Software Development: 5-7+ years of experience writing production code in Python, Go, Rust, or similar languages.
Infrastructure-as-Code: Hands-on experience with IaC tools such as Terraform, CloudFormation, Ansible, Chef, and Bash scripting.
B2B SaaS & Distributed Systems: Experience in architecting large-scale distributed systems for B2B SaaS applications.
Cloud Application Deployment: Strong background in productionizing cloud applications, including Docker and Kubernetes.
CI/CD & Automation: Experience with CI/CD pipelines, advanced packaging, versioning, deployment orchestration, and infrastructure provisioning strategies.
Your Extra Special Sauce:
Experience with popular MLOps tooling from cloud vendors like GCP (Vertex AI), AWS (SageMaker), or Azure Machine Learning, MLFlow, Kubeflow, etc.
Experience with managing popular Data platforms such as AWS EMR, Snowflake, Databricks, etc.
Experience with industry standard security practices, such as security testing, vulnerability assessments, ISO27001, GRC, and risk under compliance
Extensive experience with Docker/Containers, Jenkins/Flux/Argo, and Terraform in a Linux environment
Experience with monitoring tools such as Prometheus, Grafana, etc.
Proficiency in developing customer-facing Web Front Ends or public APIs/SDKs for the application
Benefits:
Stock
Competitive Salaries
Medical Insurance
Dental Insurance
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
Required profile
Experience
Spoken language(s):
English
Check out the description to know which languages are mandatory.