Senior Research Data Engineer

Work set-up: 
Full Remote
Contract: 
Experience: 
Senior (5-10 years)
Work from: 

Offer summary

Qualifications:

Proficiency in Python programming., Experience with distributed data frameworks like Ray or Spark., Track record of designing and managing large-scale data pipelines., Knowledge of synthetic data pipelines and lakehouse paradigms..

Key responsibilities:

  • Design and implement data sourcing, synthetic generation, and curation pipelines.
  • Build high-throughput data pipelines for ingesting, generating, and filtering multi-modal data.
  • Collaborate with ML researchers to develop foundation models.
  • Ensure data quality, relevance, and integrity at petabyte scale.

Mirumee Software logo
Mirumee Software SME http://mirumee.com/
51 - 200 Employees
See all jobs

Job description

Kaiko’s Multimodal Large Language Model (MLLM) is trained on domainspecific, highcomplexity medical data. To reach clinicalgrade performance, we’ll need to ramp up our data efforts to manage massive scale, ensure consistent quality, and tightly control data relevance and integrity.

As a Senior Research Data Engineer, you will design and implement our data‑sourcing, synthetic‑generation, and curation pipelines. High‑quality datasets are the fuel for frontier‑scale language models, and you will play a pivotal role in producing them.

You will build high‑throughput data pipelines that:

  • Ingest multi‑modal data at petabyte scale.
  • Generate large volumes of synthetic data.
  • Filter & rate content by topic, quality, and policy compliance.
        • You will work closely with ML researchers and help steer the development of our state‑of‑the‑art foundation models. You will be based in Zurich or Amsterdam, with the expectation of spending half of your time at the office.

              Profile
              • Excellent programming skills in Python and deep experience with distributed frameworks such as Ray or Spark.
              • Proven track record designing & operating large‑scale data pipelines and running data‑quality experiments.
              • Experience building or integrating synthetic‑data pipelines for LLMs.
              • Deep familiarity with lakehouse paradigms (Delta, Iceberg) and columnar formats (Parquet, ORC).
              • Experience with core data‑processing primitives (hashing, deduplication, chunking etc.) and associated scalabilityperformance trade‑offs.
              • Strong communication skills and the ability to present experimental results and technical concepts clearly and concisely.
                • Nice To Have:
                  • Handson production experience orchestrating complex DAGs in Dagster (preferred) or similar workflow engines.
                  • Expertise in dataquality & validation frameworks and monitoringobservability tooling.

Required profile

Experience

Level of experience: Senior (5-10 years)
Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Communication

Data Engineer Related jobs