Key Facts

Remote From:

Freelance

Expert & Leadership (>10 years)

English

Hard Skills

Text-To-Speech Python (Programming Language) Open Source Development 3D Animation Computer Vision Speech Processing Data Architecture WebRTC Real Time Streaming Protocol (RTSP) Linguistics

Roles & Responsibilities

Building low-latency or real-time streaming/async pipelines (OSS)
Integrating multiple AI components (LLMs, TTS, computer vision models)
Working with text-to-speech or voice cloning systems
Experience with lip-sync / talking-head models (e.g. Wav2Lip, SadTalker)

Requirements:

Design and implement a real-time multimodal pipeline connecting LLMs, TTS/voice cloning, and facial animation for a synchronized digital human
Build streaming pipelines and orchestrate token/chunk-level data flow across models to minimize latency
Integrate voice cloning / TTS systems and implement lip-sync and viseme animation for natural speech
Optimize end-to-end latency (target ~2 seconds) and ensure stability under continuous interaction

Hiredge Solutions

Hrtech: Human Resources + Technology

About Hiredge Solutions

Our mission is to revolutionise the freelance industry by providing a talent platform that leverages AI technology to facilitate seamless and efficient connections between highly skilled / expert professionals and businesses in the ICT. We aim to empower client busiennses by matching them with the right talent for their project-based business needs, enabling them to thrive in a rapidly evolving market; pushing the boundaries of conventional staffing firm and delivering unparalleled value to both businesses and freelancers in the industry

Company type: Startup

Industry: Hrtech: Human Resources + Technology

Founded: 2018

Company size: 2 - 10

Website LinkedIn See all jobs →

Job description

Real-Time Multimodal AI Engineer (Digital Human Systems)

We are engaging a Multimodal AI Engineer for building a real-time AI digital human that combines LLMs, voice cloning (text-to-speech) and facial animation / talking avatar, which is led by my client – a global leading strategy led technology build consultancy.

The goal is not a demo, it's a low-latency, production-grade system where responses are generated live, speech sounds natural and personalised, lip movement is synchronised with audio and the entire interaction feels coherent and human,

What You'll Work On

You'll design and implement a real-time multimodal pipeline, connecting multiple AI systems into a single, synchronised experience, including:

Build streaming pipelines (LLM speech avatar output)
Orchestrate token/chunk-level data flow across models
Integrate voice cloning / TTS systems (preferably phoneme-aware)
Implement lip-sync pipelines (audio viseme animation)
Handle audio–video synchronisation in real time
Optimise latency across the full pipeline (<2 seconds target)
Ensure stability under continuous interaction (not just single-turn demos)

What We're Actually Looking For

This is not a generic AI role, you should be comfortable working at the intersection of:

real-time systems
speech processing
computer vision / animation
applied AI integration

Must-Have Experience

Building low-latency or real-time systems (streaming, async pipelines) in OSS
Integrating multiple AI components (LLMs, TTS, CV models)
Working with text-to-speech or voice cloning systems
Experience with lip-sync / talking-head models (e.g. Wav2Lip, SadTalker or similar)
Handling audio/video synchronisation or time-aligned data
Strong Python skills (or equivalent for ML/system integration)

Nice-to-Have

Experience with WebRTC or real-time media streaming
Knowledge of phonemes / prosody / speech timing
GPU inference optimisation (latency tuning)
Exposure to multimodal models (audio + video + text)
Experience building production-grade AI systems (not just notebooks)

If you are experienced in making multiple AI models behave like a single, real-time human interaction system, we'd like to hear from you. Please contact zxie@hiredgesolutions.com for next step and details.

Ready to apply?

APPLY

Share ·