Job Summary:
We are looking for a talented ML Ops Engineer to streamline and automate the deployment, monitoring, and maintenance of machine learning models at scale. As an ML Ops Engineer, you will bridge the gap between data science and IT operations by building reliable, scalable, and efficient infrastructure for machine learning systems. You will work closely with data scientists, data engineers, and DevOps teams to ensure smooth and continuous deployment of machine learning models while implementing robust monitoring and performance optimization solutions.
Key Responsibilities:
● Design, build, and maintain end-to-end MLOps pipelines to automate the development, testing, deployment, and monitoring of machine learning models.
● Develop tools and frameworks to support continuous integration (CI) and continuous deployment (CD) of machine learning models in production environments.
● Work with data scientists and engineers to ensure that models can be deployed efficiently and maintained over their lifecycle, including model retraining, versioning, and scaling.
● Implement and maintain robust monitoring and alerting systems to track the performance and health of deployed models (e.g., data drift detection, accuracy monitoring).
● Collaborate with cross-functional teams to ensure smooth integration of machine learning models into production systems.
● Optimize the performance of machine learning models and pipelines, including tuning for inference speed, scalability, and cost-effectiveness.
● Leverage cloud infrastructure (e.g., AWS, GCP, Azure) to deploy and scale machine learning systems, utilizing containerization and orchestration tools such as Docker and Kubernetes.
● Automate data workflows, feature pipelines, and ETL processes to ensure reliable, real-time data feeds for machine learning models.
● Ensure models are version-controlled, reproducible, and maintainable using tools such as MLflow, DVC, or similar.
● Implement and enforce security, governance, and compliance practices for machine learning systems, ensuring that data and models are securely handled in production.
● Stay up to date with the latest best practices in MLOps, DevOps, and cloud infrastructure for machine learning.
Required Qualifications:
● Bachelor’s or Master’s degree in Computer Science, Data Engineering, Information Technology, or a related technical field.
● 5+ years of experience in DevOps, Data Engineering, or ML Operations with hands-on experience in deploying and managing ML models in production.
● Strong experience with CI/CD pipelines and tools such as Jenkins, CircleCI, GitLab CI, or similar.
● Proficiency with cloud platforms (e.g., AWS, Google Cloud, Azure) and experience with deploying ML models using cloud-native services (e.g., SageMaker, AI Platform, Azure ML).
● Hands-on experience with containerization tools like Docker and orchestration platforms like Kubernetes for deploying scalable machine learning models.
● Proficiency in Python, Bash, or other scripting languages, with experience in managing machine learning frameworks such as TensorFlow, PyTorch, or Scikit-learn.
● Familiarity with infrastructure as code (IaC) tools such as Terraform or CloudFormation for managing cloud resources.
● Experience with model versioning, reproducibility, and experiment tracking using tools like MLflow, DVC, or Kubeflow.
● Knowledge of monitoring and logging systems for machine learning models, including tools like Prometheus, Grafana, or ELK Stack.
● Strong understanding of the full ML lifecycle, including data preprocessing, model training, validation, deployment, and monitoring.
● Excellent problem-solving and troubleshooting skills with a strong focus on performance optimization and scalability.
Preferred Qualifications:
● Experience with big data tools such as Apache Spark, Kafka, or Hadoop for handling large-scale data.
● Knowledge of MLOps frameworks like Kubeflow, Tecton, or Metaflow.
● Experience in implementing A/B testing, shadow deployment, and blue/green deployments for machine learning models.
● Familiarity with data governance and security practices in machine learning, including data lineage tracking, access controls, and compliance regulations.
● Experience in feature store development and management for reusable and scalable feature engineering pipelines.
● Exposure to edge computing and IoT environments for deploying ML models in distributed systems.
Benefits:
● Competitive salary and performance-based bonuses
● Comprehensive health insurance (medical, dental, vision)
● Generous PTO and flexible working hours
● Learning and development opportunities, including certifications, conferences, and workshops
● Access to the latest machine learning and cloud technologies
● Collaborative and innovative work environment with opportunities for career growth