Offer summary

Qualifications:

Bachelor's degree or higher in Computer Science, Software Engineering, Electronic Engineering, or related fields., At least 3 years of experience in system engineering or DevOps., Over 5 years of experience in cloud-native development or AI engineering, with 2 years in Kubernetes multi-cluster management., Proficiency with Docker, Kubernetes, and cloud platforms like AWS, GCP, or Azure..

Key responsibilities:

Build and operate large-scale GPU clusters ensuring stable performance.

Deploy and orchestrate AI models across multi-cluster environments using Kubernetes.

Monitor and troubleshoot GPU infrastructure, optimizing resource utilization.

Coordinate with data center providers for infrastructure deployment.

Job description

Location: Remote (Global)

Type: Fulltime

Company: Yotta Labs

Apply: careers@yottalabs.ai

🧠 About Yotta Labs

Yotta Labs is pioneering the development of a Decentralized Operating System (DeOS) for AI workload orchestration at a planetary scale. Our mission is to democratize access to AI resources by aggregating geodistributed GPUs, enabling highperformance computing for AI training and inference on a wide spectrum of hardware—from commodity to highend GPUs. Our platform supports major large language models (LLMs) and offers customizable solutions for new models, facilitating elastic and efficient AI development.

🛠️ Role Overview

We are seeking a GPU Cloud Platform Engineer to join our core infrastructure team and help build the nextgeneration AI compute cloud. In this role, you will design, deploy, and operate largescale, multicluster GPU infrastructure across data centers and cloud environments. You will be responsible for ensuring high availability, performance, and efficiency of containerized AI workloads—ranging from LLMs to generative models—deployed in Kubernetesbased GPU clusters. If youre passionate about highperformance systems, distributed orchestration, and scaling realworld AI infrastructure, this role offers a unique opportunity to shape the backbone of our AI cloud platform.

🎯 Responsibilities

Build and operate largescale, highperformance GPU clusters; ensure stable operation of compute, network, and storage systems; monitor and troubleshoot online issues.
Conduct performance testing and evaluation of multinode GPU clusters using standard benchmarking tools to identify and resolve performance bottlenecks.
Deploy and orchestrate large models (e.g., LLMs, video generation models) across multicluster environments using Kubernetes; implement elastic scaling and crosscluster load balancing to ensure efficient service response under high concurrency for global users.
Participate in the design, development, and iteration of GPU cluster scheduling and optimization systems. Define and lead Kubernetes multicluster configuration standards; Optimize scheduling strategies (e.g., node affinity, taintstolerations) to improve GPU resource utilization.
Build a unified multicluster management and monitoring system to support crossregion resource monitoring, traffic scheduling, and fault failover. Collect key metrics such as GPU memory usage, QPS, and response latency in real time; configure alert mechanisms.
Coordinate with IDC providers for planning and deploying largescale GPU clusters, networks, and storage infrastructure to support internal cloud platforms and external customer needs.