Overview
Were looking for a hands-on NLP/ML engineer to lead the development of an intelligent
document understanding pipeline for extracting structured data from complex, unstructured RFQ
documents (40100+ pages, in German and English).
You will be responsible for building
scalable systems that combine document parsing, layout analysis, entity extraction, and
knowledge graph construction ultimately feeding downstream (e.g. Analytics and LLM
applications.)
Key Responsibilities - - - - - -
Design and implement document hierarchy and section segmentation pipelines using
layout-aware models (e.g., DocLayout-YOLO, LayoutLM, Donut).
Build multilingual entity recognition and relation extraction systems across both English
and German texts.
Use tools like NLTK, transformers, and spaCy to develop custom tokenization, parsing,
and information extraction logic.
Construct and maintain knowledge graphs representing semantic relationships
between extracted elements using graph data structures and graph databases (e.g.
Neo4j)
Integrate outputs into structured LLM-friendly formats (e.g., JSON, Mark Down) for
downstream extraction of building material elements.
Collaborate with product and domain experts to align on information schema, ontology,
and validation methods. What Were Looking For - - - -
Strong experience in NLP, document understanding, and information extraction
from unstructured/multilingual documents.
Proficiency in Python, with experience using libraries such as transformers, spaCy,
and NLTK.
Hands-on experience with layout-aware models like DocLayout-YOLO, LayoutLM,
Donut, or similar.
Familiarity with knowledge graphs and graph databases such as Neo4j, RDF