β Role Overview
We are seeking a Senior AI Engineer specializing in LLMs to lead the design, evaluation, and deployment of production-grade generative AI systems. In this role, you will own end-to-end LLM solutions, from prototyping to scalable production, while establishing best practices in evaluation, reliability, and responsible AI.
Own and evolve the LLM evaluation (evals) strategy, including designing gold-standard datasets and benchmarks, building automated eval pipelines and scoring systems and defining metrics for factuality, grounding, robustness, and user impact.
Diagnose and resolve complex failure modes (hallucinations, retrieval issues, agent breakdowns).
Optimize systems for latency, cost, scalability, and reliability in production.
Mentor junior engineers and guide best practices in LLM development and evaluation.
Collaborate cross-functionally with product, data, and leadership to shape AI strategy.
Set standards for responsible AI, including safety, bias mitigation, and observability.
5+ years of experience in software engineering, machine learning, and applied AI with a track record of driving projects to completion.
Strong software engineering fundamentals (testing, modular design, dependency injection) in Python.
A track record of taking AI and LLM-powered features from initial concept through deployment and long-term production maintenance.
Experience implementing automated testing strategies for non-deterministic systems, and strong debugging and analytical skills for ambiguous model behavior.
A strong understanding of prompt engineering and prompt lifecycle management, RAG architectures and retrieval evaluation, and LLM limitations and failure patterns.
Solid experience in using Data Analytics techniques (SQL, analysis and visualizations) to inform Product decisions and delivery.
A heavy product mindset to deeply understand our product and our customer needs to design the right solutions for them.
Strong tech leadership and mentorship skills, and the ability to independently drive projects to completion.
Clear communication of trade-offs, risks, and system performance to stakeholders.
Proven experience driving ambiguous projects to completion, mentoring teams, and communicating complex technical risks to stakeholders.
The ability to design robust, production-grade evaluation at scale using advanced metrics and statistical validation.
Deep expertise in model fine-tuning, adversarial red-teaming, and safety testing to protect the system from edge-case vulnerabilities.

InPost

Bending Spoons

Mondial Relay

Precision For Medicine

Ci&T

Lodgify

Lodgify

Lodgify