Sign2Voice / README.md
lilblueyes's picture
Update requirements
dfed012
|
Raw
History Blame Contribute Delete
8.23 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade
metadata
title: Sign2Voice
emoji: 🗣️
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 5.50.0
python_version: '3.12'
app_file: app.py
pinned: false
tags:
  - track:backyard
  - achievement:offgrid
  - achievement:offbrand
  - achievement:llama

A local-first AI stack that translates sign language, intent, and expression into natural speech.

Sign2Voice is a Gradio demo built for the Hugging Face Build Small Hackathon. It takes an uploaded or webcam ASL clip, extracts sign candidates and facial expression signals, converts the result into a compact intent JSON, then speaks the sentence with a local small-model voice stack.

Scope Note

We would have loved to push Sign2Voice further toward open-ended ASL conversation during the hackathon. The main constraint was not the app shell or the voice stack; it was data. Public ASL resources are still uneven for this task: many usable datasets focus on isolated signs, dictionary retrieval, or fingerspelling, while fluent ASL needs signer diversity, real-world lighting and camera angles, temporal boundaries, facial grammar, body movement, and careful expert annotation.

That is why this submission is intentionally evidence-first. It exposes top sign candidates, confidence thresholds, segment diagnostics, and emotion metadata instead of pretending that a small hackathon model can solve full ASL translation end to end. Microsoft Research describes sign-language modeling as being far behind spoken-language modeling largely because of a lack of appropriate training data, and recent SLR survey work calls data acquisition and annotation the main bottleneck for systems that work on fluent signing.

References:

Hackathon Fit

The project is aimed at the Backyard AI track: a practical communication tool for people who need fast sign-to-speech support without sending every clip to a cloud API.

Build Small constraints covered by this repo:

  • Gradio app, ready for a Hugging Face Space.
  • Small-model stack under the 32B parameter limit.
  • Local-first runtime: ASL detection, llama.cpp text generation, and Qwen3-TTS speech generation run in the app process.
  • Step-by-step demo flow that shows intermediate evidence instead of hiding uncertainty.

Badges that fit this build:

  • Off the Grid / Local-first: no cloud inference API is required at runtime.
  • Llama Champion: the intent-to-speech text step uses llama.cpp.
  • Off-Brand / Custom UI: the Space uses custom Gradio styling.
  • Field Notes: claim this only after publishing the build write-up.

Badges not claimed:

  • Well-Tuned: this repo uses published models; it does not publish a new fine-tuned model.
  • Sharing is Caring: no public agent trace is included yet.

Pipeline

Video upload or camera capture
-> Sequential ASL frame sampling
-> MediaPipe landmarks, when installed
-> WLASL2000 I3D or TFLite ASL classifier
-> Ordered gloss sequence with confidence diagnostics
-> DeepFace emotion aggregation, when installed
-> llama.cpp subtitle and voice instruction
-> Qwen3-TTS audio

Each brick can fail independently and return diagnostics instead of blocking the whole interface at startup.

The demo screen is intentionally staged:

1. Analyze ASL -> debug overlay + intent JSON
2. Generate subtitle -> llama.cpp output
3. Generate speech -> Qwen3-TTS audio

When no ASL classifier is available, Sign2Voice reports model_missing and does not invent ASL words. The manual gloss override is empty by default and lives under advanced debug controls for downstream LLM/TTS testing only.

Run Locally

Install the default CPU-friendly dependency set:

pip install -r requirements.txt

Start the Gradio app:

python3 app.py

Run unit tests:

python3 -m pytest -q

Run each brick directly:

python3 scripts/test_asl_brick.py
python3 scripts/test_llm_brick.py
python3 scripts/test_tts_brick.py
python3 scripts/test_full_pipeline.py

scripts/test_asl_brick.py creates a tiny temporary clip when no video path is supplied. It also writes a debug overlay video path. To test the transparent fallback:

python3 scripts/test_asl_brick.py --gloss-override "I LOVE YOU"

ASL Model Files

The ASL classifier assets live under:

data/models/asl/model.tflite
data/models/asl/train.csv
data/models/asl/sign_to_prediction_index_map.json

They come from jamesjbustos/sign-language-recognition.

Without these files, the ASL brick still samples frames and emits model_missing diagnostics. The TFLite model recognizes the isolated signs in sign_to_prediction_index_map.json; it is not a full sentence or fingerspelling recognizer. Predictions below ASL_CONFIDENCE_THRESHOLD, defaulting to 0.70, are reported as low_confidence and are not forwarded as detected glosses.

Uploaded videos use a phrase-prototype mode: frames are read in temporal order, landmarks are extracted once, and the ASL model runs over sliding windows. Accepted window predictions are collapsed into an ordered gloss sequence before llama.cpp rewrites them as natural speech.

Tune it with:

ASL_UPLOAD_TARGET_FPS=12
ASL_UPLOAD_MAX_FRAMES=240
ASL_SEQUENCE_WINDOW=30
ASL_SEQUENCE_STRIDE=15
ASL_CONFIDENCE_THRESHOLD=0.70

This is still not full continuous ASL translation, but it lets recorded phrase clips become hello where water-style gloss sequences instead of one global class.

An experimental WLASL2000 I3D backend is also available for broader vocabulary coverage. It uses raghuhasan/asl2000-i3d from Hugging Face, downloads the I3D architecture helper if needed, and falls back to the TFLite detector when ASL_DETECTOR_BACKEND=auto cannot initialize it.

ASL_DETECTOR_BACKEND=auto       # default: WLASL2000, then TFLite
ASL_DETECTOR_BACKEND=tflite     # lightweight TFLite-only backend
ASL_DETECTOR_BACKEND=wlasl_i3d  # WLASL2000 I3D only
WLASL_I3D_CONFIDENCE_THRESHOLD=0.20
WLASL_I3D_SEQUENCE_WINDOW=64
WLASL_I3D_SEQUENCE_STRIDE=32
WLASL_I3D_FRAME_SIZE=224

The WLASL backend is heavier and more experimental. Its model card reports 2,000 classes with 32.48% top-1, 57.31% top-5, and 66.31% top-10 accuracy, so the UI exposes top candidates and segment diagnostics instead of hiding uncertainty.

Live camera debug prioritizes speed over long temporal batching. It starts predicting after LIVE_ASL_MIN_FRAMES=4, keeps a rolling buffer of LIVE_ASL_MAX_FRAMES=12, and runs ASL prediction every LIVE_ASL_PREDICT_EVERY=1 frame. DeepFace emotion is heavier, so it runs every LIVE_EMOTION_EVERY=45 frames by default.

Good first signs to test because they are in the model vocabulary:

hello, where, who, why, yes, no, thankyou, please, water, happy, sad

Reference clips and GIFs from the upstream demo list:

hello    https://media.giphy.com/media/3o7TKNKOfKlIhbD3gY/giphy.gif
where    https://lifeprint.com/asl101/gifs/w/where.gif
who      https://lifeprint.com/asl101/gifs/w/who.gif
why      https://lifeprint.com/asl101/gifs/w/why.gif
yes      https://media.tenor.com/oYIirlyIih0AAAAC/yes-asl.gif
no       https://lifeprint.com/asl101/gifs/n/no-2-movement.gif
thankyou https://lifeprint.com/asl101/gifs/t/thank-you.gif
please   https://lifeprint.com/asl101/gifs-animated/pleasecloseup.gif
water    https://lifeprint.com/asl101/gifs/w/water-2.gif

GPU Dependencies

flash-attn is only useful on a CUDA GPU Space with compatible PyTorch/CUDA versions. Keep the default requirements.txt for CPU Spaces. If the Space is moved to a compatible GPU runtime, install the GPU dependency set instead:

pip install -r requirements-gpu.txt --no-build-isolation

Full ASL Dependencies

The default requirements keep the Space build lighter. For a full local ASL/emotion runtime with MediaPipe landmarks and DeepFace emotion detection, install:

pip install -r requirements-asl-full.txt