Experience evaluating ML, LLM, or non-deterministic systems.
Strong test and benchmark design capability.
Comfort working with noisy metrics, thresholds, and probabilistic behavior.
Good scripting and automation skills.
Requirements:
Build and maintain the MVP eval harness: golden tasks, exception tasks, scorecard metrics, and regression packs.
Wire evals into CI so quality regressions fail builds and releases.
Define and maintain release-gate thresholds with Product and the Tech Lead.
Lay the path for later adversarial and drift-testing expansion without overbuilding MVP scope.
Job description
This is a remote position.
Owns the eval harness and quality gate from the beginning. This role replaces the old late-stage “Evals Specialist” model with a standing owner for measurable agent quality.
Key Responsibilities
• Build and maintain the MVP eval harness: golden tasks, exception tasks, scorecard metrics, and regression packs.
• Wire evals into CI so quality regressions fail builds and releases.
• Define and maintain release-gate thresholds with Product and the Tech Lead.
• Lay the path for later adversarial and drift-testing expansion without overbuilding MVP scope.
Requirements
Must-Have Qualifications
• Experience evaluating ML, LLM, or non-deterministic systems.
• Strong test and benchmark design capability.
• Comfort working with noisy metrics, thresholds, and probabilistic behavior.
• Good scripting and automation skills.
AI-First Expectations
• Uses AI to generate candidate eval cases and failure hypotheses, but never confuses generated tests with validated quality.
• Approaches AI quality as an operating system, not a QA afterthought.
What Success Looks Like in the First 90 Days
• The first reference agent has a published scorecard and gated eval path. • Golden and exception tests run automatically. • The team can explain what “good enough to ship” means in measurable terms.