Strong experience in production support / application support roles (AI/ML systems preferred)
Hands-on experience with Python, SQL, and scripting for troubleshooting
Knowledge of ML lifecycle (training, validation, deployment, monitoring)
Experience with cloud platforms (Azure/AWS/GCP)
Requirements:
Provide L2/L3 production support for AI/ML applications, pipelines, and model deployments
Monitor model performance, drift, and data quality issues in production environments
Troubleshoot and resolve incidents, alerts, and system failures across ML workflows
Support CI/CD pipelines for model deployment and versioning
Job description
Location: Atlanta, GA - Remote option
Duration: 6 months
Job Title: AI/ML Ops Application Support Engineer
Job Summary:
We are seeking an experienced AI/ML Ops Application Support Engineer to support, monitor, and maintain AI/ML platforms and applications in a production environment. The role involves ensuring the stability, performance, and reliability of machine learning pipelines, model deployments, and related cloud infrastructure, with a strong focus on operational excellence and incident management.
Key Responsibilities:
Provide L2/L3 production support for AI/ML applications, pipelines, and model deployments
Monitor model performance, drift, and data quality issues in production environments
Troubleshoot and resolve incidents, alerts, and system failures across ML workflows
Support CI/CD pipelines for model deployment and versioning
Collaborate with Data Scientists, ML Engineers, and DevOps teams for issue resolution and enhancements
Manage model retraining schedules, batch/real-time pipelines, and inference jobs
Perform root cause analysis (RCA) and implement preventive measures
Ensure adherence to SLA/SLO requirements and maintain operational dashboards/reporting
Handle application release support, patching, and environment maintenance
Maintain documentation for runbooks, troubleshooting guides, and standard operating procedures
Required Skills:
Strong experience in production support / application support roles (AI/ML systems preferred)
Hands-on experience with Python, SQL, and scripting for troubleshooting
Knowledge of ML lifecycle (training, validation, deployment, monitoring)
Experience with cloud platforms (Azure/AWS/GCP)
Familiarity with ML Ops tools (e.g., MLflow, Kubeflow, SageMaker, Azure ML)
Experience with containerization (Docker) and orchestration (Kubernetes)
Exposure to CI/CD tools (Jenkins, GitHub Actions, Azure DevOps)
Understanding of monitoring tools (Grafana, Prometheus, ELK, Azure Monitor)
Strong debugging and incident management skills
Preferred Skills:
Experience in healthcare/payer domain (claims, enrollment, analytics platforms)
Knowledge of data engineering tools (Spark, Airflow, Databricks)
Familiarity with model explainability and governance frameworks
ITIL process knowledge (incident, problem, change management)
Education & Experience:
Bachelor’s/Master’s degree in Computer Science, Data Science, or related field
Typically 5+ years of experience in application/production support, with exposure to AI/ML systems
Role Descriptions: AI/ML OPS apps support
Essential Skills: AI/ML Ops Application support
Skills: Digital : Python~Agile Specialisation~AI Agents~AI & Gen AI - Products & Tools~Application Server Deployment & Administration
Experience Required: 8-10
Diverse Lynx LLC is an Equal Employment Opportunity employer. All qualified applicants will receive due consideration for employment without any discrimination. All applicants will be evaluated solely on the basis of their ability, competence and their proven capability to perform the functions outlined in the corresponding role. We promote and support a diverse workforce across all levels in the company.