Real-Time Multimodal AI Engineer (Digital Human Systems)
We are engaging a Multimodal AI Engineer for building a real-time AI digital human that combines LLMs voice cloning (text-to-speech) and facial animation / talking avatar which is led by my client a global leading strategy led technology build consultancy.
The goal is not a demo its a low-latency production-grade system where responses are generated live speech sounds natural and personalised lip movement is synchronised with audio and the entire interaction feels coherent and human
What Youll Work On
Youll design and implement a real-time multimodal pipeline connecting multiple AI systems into a single synchronised experience including:
- Build streaming pipelines (LLM speech avatar output)
- Orchestrate token/chunk-level data flow across models
- Integrate voice cloning / TTS systems (preferably phoneme-aware)
- Implement lip-sync pipelines (audio viseme animation)
- Handle audiovideo synchronisation in real time
- Optimise latency across the full pipeline (<2 seconds target)
- Ensure stability under continuous interaction (not just single-turn demos)
What Were Actually Looking For
This is not a generic AI role you should be comfortable working at the intersection of:
- real-time systems
- speech processing
- computer vision / animation
- applied AI integration
Must-Have Experience
- Building low-latency or real-time systems (streaming async pipelines) in OSS
- Integrating multiple AI components (LLMs TTS CV models)
- Working with text-to-speech or voice cloning systems
- Experience with lip-sync / talking-head models (e.g. Wav2Lip SadTalker or similar)
- Handling audio/video synchronisation or time-aligned data
- Strong Python skills (or equivalent for ML/system integration)
Nice-to-Have
- Experience with WebRTC or real-time media streaming
- Knowledge of phonemes / prosody / speech timing
- GPU inference optimisation (latency tuning)
- Exposure to multimodal models (audio video text)
- Experience building production-grade AI systems (not just notebooks)
If you are experienced in making multiple AI models behave like a single real-time human interaction system wed like to hear from you.
Real-Time Multimodal AI Engineer (Digital Human Systems) We are engaging a Multimodal AI Engineer for building a real-time AI digital human that combines LLMs voice cloning (text-to-speech) and facial animation / talking avatar which is led by my client a global leading strategy led technology buil...
Real-Time Multimodal AI Engineer (Digital Human Systems)
We are engaging a Multimodal AI Engineer for building a real-time AI digital human that combines LLMs voice cloning (text-to-speech) and facial animation / talking avatar which is led by my client a global leading strategy led technology build consultancy.
The goal is not a demo its a low-latency production-grade system where responses are generated live speech sounds natural and personalised lip movement is synchronised with audio and the entire interaction feels coherent and human
What Youll Work On
Youll design and implement a real-time multimodal pipeline connecting multiple AI systems into a single synchronised experience including:
- Build streaming pipelines (LLM speech avatar output)
- Orchestrate token/chunk-level data flow across models
- Integrate voice cloning / TTS systems (preferably phoneme-aware)
- Implement lip-sync pipelines (audio viseme animation)
- Handle audiovideo synchronisation in real time
- Optimise latency across the full pipeline (<2 seconds target)
- Ensure stability under continuous interaction (not just single-turn demos)
What Were Actually Looking For
This is not a generic AI role you should be comfortable working at the intersection of:
- real-time systems
- speech processing
- computer vision / animation
- applied AI integration
Must-Have Experience
- Building low-latency or real-time systems (streaming async pipelines) in OSS
- Integrating multiple AI components (LLMs TTS CV models)
- Working with text-to-speech or voice cloning systems
- Experience with lip-sync / talking-head models (e.g. Wav2Lip SadTalker or similar)
- Handling audio/video synchronisation or time-aligned data
- Strong Python skills (or equivalent for ML/system integration)
Nice-to-Have
- Experience with WebRTC or real-time media streaming
- Knowledge of phonemes / prosody / speech timing
- GPU inference optimisation (latency tuning)
- Exposure to multimodal models (audio video text)
- Experience building production-grade AI systems (not just notebooks)
If you are experienced in making multiple AI models behave like a single real-time human interaction system wed like to hear from you.
View more
View less