Solid experience in DevOps and cloud infrastructure., Expertise in deploying machine learning models., Proficient in Kubernetes, Terraform, and network optimization., Continuous learner of recent technological advancements..
Key responsabilities:
Deploy and manage Large Language Models (LLMs).
Optimize computing infrastructure for high performance.
Report This Job
Help us maintain the quality of our job listings. If you find any issues with this job post, please let us know.
Select the reason you're reporting this job:
We are democratizing the payments industry in Brazil, by empowering entrepreneurs through technological, inclusive, and life-changing solutions.
Based in Brazil, CloudWalk is a high-end global payment network built on modern technology and proprietary blockchain, focused in bringing a revolution to the payment ecosystem for small and medium-sized businesses. As a unicorn, the company has provided its customers with more than R$ 1 billion in savings by charging fair fees on its transactions and is now present in more than 300.000 businesses across 5.000 brazilian cities.
With investors such as the Valor Capital Group, HIVE Ventures and Coatue, the company has already raised US$ 365.5 million in investments and R$3.4 billion in FDICs for anticipation of receivables in its network of financial solutions. In 2022, it was the only brazilian fintech to be featured in the "The Retail Tech 100" ranking by CB Insights, on the "Protection Solutions for Payments and Frauds".
At CloudWalk, we're at the cutting edge of AI, pioneering the use of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) to drive innovation. As a MLOps Engineer, you will play a critical role in operationalizing the visionary work of our LLM Data Scientists. Your expertise will ensure the smooth deployment, efficient management, and scalable performance of LLMs across our extensive infrastructure. Your contributions will turn advanced AI research into scalable, high-performance solutions, with a particular focus on optimizing network communication and parallel processing capabilities.
What You’ll Do:
Deploy and Manage LLMs: Employ Kubernetes, Terraform, and cloud services to deploy and scale LLMs efficiently, ensuring their adaptability to high-demand scenarios.
Optimize Computing Infrastructure: Focus on enhancing GPU utilization, distributed training, bandwidth efficiency between machines, and VPC connections to maximize system performance.
Leverage Cutting-Edge Technologies: Utilize libraries such as Hugging Face's Accelerate and PyTorch's torchrun to facilitate parallel training across multiple machines in a cluster, optimizing our AI models' training and inference processes.
Collaborate on Innovation: Partner with our R&D team to transition LLM and RAG technologies from conceptual stages to scalable, production-ready systems.
Monitor and Improve System Performance: Implement advanced monitoring and logging practices to ensure system reliability and performance, continuously seeking improvements.
Stay Updated on Industry Advances: Actively pursue the latest developments in MLOps, cloud computing, and AI technologies to implement innovative solutions and maintain our infrastructure's leading edge.
Technologies You Will Work With:
Kubernetes, Terraform, and cloud computing platforms for scalable AI model deployment.
CI/CD pipelines, Git for version control, and Bash scripting for operational efficiency.
Hugging Face's Accelerate and PyTorch's torchrun for parallel training and optimization across multiple machines.
A comprehensive understanding of network infrastructure to optimize bandwidth and secure VPC connections is essential.
What We Expect From You:
Technical Mastery: Solid experience with DevOps, cloud infrastructure, and deploying machine learning models. Expertise in network optimization and parallel computing is crucial.
Problem-Solving Mindset: The ability to navigate complex challenges, strategically manage resources, and improve system efficiency.
Collaborative Approach: Strong communication skills and the ability to contribute effectively within a dynamic, interdisciplinary team.
Lifelong Learner: A commitment to continuous learning, staying abreast of the latest technological advancements, and applying innovative solutions.
Why CloudWalk?
By joining CloudWalk, you become part of a team that's reshaping the future with technological innovations. We cherish creativity, teamwork, and a dedication to excellence. Here, your work contributes to a mission of driving forward technological advancements.
Dare to innovate, dare to impact, dare to join the Wolfpack. Apply now!
Required profile
Experience
Spoken language(s):
English
Check out the description to know which languages are mandatory.