Voice AIReal-time Audio Processing3D Avatar AnimationBrowser ExtensionAccessibility Tech

SignalCast — Real-Time Voice AI & 3D Avatar Translation Engine

A real-time voice AI system that extracts, transcribes, and translates spoken audio from video content — then drives a 3D animated avatar to perform the translation live, in sync with the video.

Core modules built

Real-time

Audio-to-avatar sync

Multi-lang

Audio transcription support

Live avatar rendering

The Problem

1.5 billion people can't access online video

Over 1.5 billion people worldwide have hearing loss — yet the majority of online video content remains completely inaccessible to them. Existing solutions like closed captions are static, imprecise, and fail to convey the nuance of spoken language. There was no real-time, context-aware, multilingual system that could translate video audio into dynamic sign language — especially one that could handle multiple languages and operate directly inside a browser.

“There is no complete, globally applicable solution that provides real-time, context-aware sign language translation for video content across multiple languages.”

The Voice AI Pipeline

Video URL → live 3D sign language

🎬 Video URL→🔊 Audio Extract→📝 Speech-to-Text→🧠 NLP Context→🤟 Sign Mapping→🧍 3D Avatar→📺 Live Overlay

User pastes a video URL — The browser extension validates the link, detects video format and metadata, and initiates the processing pipeline.

Audio is extracted and preprocessed — Background noise is filtered, audio quality optimised using OpenCV pipelines to maximise transcription accuracy.

Speech-to-text via Groq Cloud API — Multilingual audio transcribed with high accuracy in real time; NLTK cleans and normalises the text output.

Context-aware sign language mapping — The system analyses sentence context using NLP — not just word-by-word mapping — producing natural and semantically accurate sign sequences.

3D avatar rendered via Three.js + Unity — A real-time animated character performs the sign language gestures, rigged and synced frame-by-frame with the video timeline.

Live overlay inside the video frame — The avatar is rendered directly within the video player, not as a separate window — creating a seamless, immersive viewing experience.

Key Features

What's inside

Real-time audio extraction from video URLs

Multilingual speech-to-text transcription

Context-aware NLP sign language mapping

Live 3D avatar animation (Unity + Three.js)

In-video overlay rendering (browser-native)

Real-time video-avatar sync engine

User dashboard with history & controls

Social login + account management

Tech Stack

Built with

Frontend — React JS 18.2, HTML5, CSS3

3D Rendering — Three.js 0.159, Unity 2022.3

Voice AI — Groq Cloud API 4.2 (speech-to-text)

NLP — NLTK 3.8 (text processing & context)

Vision / ML — OpenCV 4.8 (audio optimisation)

Backend — Python 3.13

Design — Figma, PyCharm

Engineering Challenge

Why this was technically hard

Most voice AI projects stop at transcription. SignalCast goes 3 layers deeper — and each layer introduces significant engineering complexity.

The hardest part wasn't speech-to-text — it was synchronising a live 3D avatar with real-time buffered audio inside a browser extension, while maintaining context-aware (not word-for-word) sign language accuracy across multiple languages simultaneously.

Solving the real-time buffering challenge — where sign language gestures must stay in sync with the video even as transcription lag occurs — required building a custom frame-sync engine between the Python backend and the Three.js/Unity 3D renderer.

The Outcome

A genuinely novel voice AI system

SignalCast delivered a working real-time voice AI system that converts any video's spoken audio into live 3D sign language — rendered directly inside the browser, in sync with the video. A genuinely novel approach to voice AI that goes far beyond transcription into real-time motion generation and avatar animation.

This project showcases our capability to build end-to-end voice AI pipelines — from audio extraction and multilingual speech processing to NLP context analysis and real-time 3D rendering. The same pipeline architecture applies directly to voice agents, AI avatars, real-time translation tools, and multimedia automation systems.

Ready to build?

Want something like this?

Tell us about your project. We'll come back with a custom scope and proposal — no pressure.

Book a Free Discovery Call