Match score not available

Distinguished Engineer - AI Systems & Machine Learning Infrastructure

Remote:

Full Remote

Contract:

Full time

Experience:

Senior (5-10 years)

Work from:

New York (USA), United States

Offer summary

Qualifications:

Bachelor’s degree in Computer Science or related field, 7 years experience in distributed computing and ML systems, 5 years developing AI and ML algorithms in Python/C/C++, 3 years experience in public cloud ML lifecycle, Master's/PhD preferred with focus on AI techniques.

Key responsabilities:

Collaborate to design secure AI infrastructure
Develop large-scale training clusters and deploy LLMs
Envision future AI system capabilities and developments
Create fault-tolerant infrastructure for long-running tasks
Implement infrastructure for serving large models

Get It Recruit - Information Technology Human Resources, Staffing & Recruiting TPE https://www.get.it/

2 - 10 Employees

See more Get It Recruit - Information Technology offers

Job description

Your missions

Description

We are on a mission to revolutionize the banking industry through trustworthy, reliable, and human-centered AI systems. Our team has led the way in utilizing machine learning to create real-time, intelligent, and automated customer experiences that simplify banking.

Our AI-driven applications have been pivotal in enhancing the customer experience, whether it’s notifying customers about unusual charges or providing instant answers to their questions. With strong investments in public cloud infrastructure and machine learning platforms, we are uniquely positioned to leverage AI’s transformative power.

As a Distinguished Engineer in AI Systems, you will play a crucial role in bringing emerging AI capabilities to life, reimagining how we serve our customers and businesses.

What You’ll Do

Innovate: Collaborate with AI engineers and researchers to design and implement secure, robust, and scalable infrastructure that supports our enterprise AI capabilities.
Build: Develop large-scale distributed training clusters, deploy Large Language Models (LLMs) on GPU instances for real-time use cases, and support cutting-edge AI research in a public cloud environment.
Lead: Envision the future state of our AI systems and guide the development of key services that will drive our AI capabilities forward.
Design: Create fault-tolerant infrastructure to reliably support long-running large-scale training tasks, even in the face of individual node failures, using containers and checkpointing libraries to ensure resilience.
Deploy: Implement infrastructure for serving large ML models in a public cloud, and optimize storage and networking stacks for high-performance training clusters.
Measure: Design and implement benchmarks to assess the performance of our AI systems, making informed recommendations on technology selection.
Develop: Build applications that leverage LLMs and foundation models, and contribute to MLOps capabilities that support the deployment and maintenance of these models.

Qualifications

Basic Qualifications:

Bachelor’s degree in Computer Science, Computer Engineering, or a related technical field.
At least 7 years of experience in designing and building distributed computing HPC and large-scale ML systems.
At least 5 years of experience developing AI and ML algorithms using Python or C/C++.
At least 3 years of experience with the full ML development lifecycle in public cloud environments.

Preferred Qualifications:

Master's degree or PhD in Engineering, Computer Science, or a related technical field, with a focus on modern AI techniques.
Expertise in designing large-scale distributed platforms and/or systems in cloud environments such as AWS, Azure, or GCP.
Experience architecting cloud systems with a focus on security, availability, performance, scalability, and cost efficiency.
Proven track record in delivering large models through the MLOps lifecycle, from exploration to deployment.
Proficiency in building GPU clusters in the public cloud with tightly-coupled storage and networking.
In-depth knowledge of the complete stack for distributed training of large models, including ML compilers, distributed training frameworks, and ML development frameworks like PyTorch, TensorFlow, and Lightning.
Experience with AI technology stack areas such as prompt engineering, guardrails, vector databases/knowledge bases, LLM hosting, and fine-tuning.
Publications in top peer-reviewed conferences or industry-recognized contributions in neural networks, distributed training, and SysML.

Compensation

This position offers a competitive salary, which may vary based on location and experience. Additionally, the role is eligible for performance-based incentive compensation, which may include cash bonuses and/or long-term incentives.

Benefits

We offer a comprehensive, competitive, and inclusive set of health, financial, and other benefits that support your overall well-being.

Application Process

This role will accept applications for a minimum of five business days. If you require any accommodations during the application process, please reach out to our recruiting team.

Equal Opportunity Employer

We are an equal opportunity employer committed to fostering diversity and inclusion in the workplace. All qualified applicants will receive consideration for employment without regard to gender, race, color, age, national origin, religion, disability, or any other protected status.

Employment Type: Full-Time