5+ years of experience in NLP/Machine Learning, with hands-on work on document understanding and information extraction from multilingual documents.
Proficiency in Python and experience with libraries such as transformers, spaCy, and NLTK.
Hands-on experience with layout-aware models such as DocLayout-YOLO, LayoutLM, Donut, or similar.
Familiarity with knowledge graphs and graph databases (e.g., Neo4j, RDF).
Requirements:
Design and implement document hierarchy and section segmentation pipelines using layout-aware models (e.g., DocLayout-YOLO, LayoutLM, Donut).
Build multilingual entity recognition and relation extraction systems across English and German texts.
Construct and maintain knowledge graphs representing semantic relationships between extracted elements using graph data structures and graph databases (e.g., Neo4j).
Integrate outputs into structured LLM-friendly formats (e.g., JSON, Markdown) for downstream extraction and analytics.
Overview
Were looking for a hands-on NLP/ML engineer to lead the development of an intelligent
document understanding pipeline for extracting structured data from complex, unstructured RFQ
documents (40100+ pages, in German and English).
You will be responsible for building
scalable systems that combine document parsing, layout analysis, entity extraction, and
knowledge graph construction ultimately feeding downstream (e.g. Analytics and LLM
applications.)
Key Responsibilities - - - - - -
Design and implement document hierarchy and section segmentation pipelines using
layout-aware models (e.g., DocLayout-YOLO, LayoutLM, Donut).
Build multilingual entity recognition and relation extraction systems across both English
and German texts.
Use tools like NLTK, transformers, and spaCy to develop custom tokenization, parsing,
and information extraction logic.
Construct and maintain knowledge graphs representing semantic relationships
between extracted elements using graph data structures and graph databases (e.g.
Neo4j)
Integrate outputs into structured LLM-friendly formats (e.g., JSON, Mark Down) for
downstream extraction of building material elements.
Collaborate with product and domain experts to align on information schema, ontology,
and validation methods. What Were Looking For - - - -
Strong experience in NLP, document understanding, and information extraction
from unstructured/multilingual documents.
Proficiency in Python, with experience using libraries such as transformers, spaCy,
and NLTK.
Hands-on experience with layout-aware models like DocLayout-YOLO, LayoutLM,
Donut, or similar.
Familiarity with knowledge graphs and graph databases such as Neo4j, RDF