Sign2Voice / README.md
lilblueyes's picture
Update requirements
dfed012
|
Raw
History Blame Contribute Delete
8.23 kB
---
title: Sign2Voice
emoji: 🗣️
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: "5.50.0"
python_version: "3.12"
app_file: app.py
pinned: false
tags:
- track:backyard
- achievement:offgrid
- achievement:offbrand
- achievement:llama
---
A local-first AI stack that translates sign language, intent, and expression
into natural speech.
Sign2Voice is a Gradio demo built for the Hugging Face Build Small Hackathon.
It takes an uploaded or webcam ASL clip, extracts sign candidates and facial
expression signals, converts the result into a compact intent JSON, then speaks
the sentence with a local small-model voice stack.
## Scope Note
We would have loved to push Sign2Voice further toward open-ended ASL
conversation during the hackathon. The main constraint was not the app shell or
the voice stack; it was data. Public ASL resources are still uneven for this
task: many usable datasets focus on isolated signs, dictionary retrieval, or
fingerspelling, while fluent ASL needs signer diversity, real-world lighting and
camera angles, temporal boundaries, facial grammar, body movement, and careful
expert annotation.
That is why this submission is intentionally evidence-first. It exposes top sign
candidates, confidence thresholds, segment diagnostics, and emotion metadata
instead of pretending that a small hackathon model can solve full ASL
translation end to end. Microsoft Research describes sign-language modeling as
being far behind spoken-language modeling largely because of a lack of
appropriate training data, and recent SLR survey work calls data acquisition and
annotation the main bottleneck for systems that work on fluent signing.
References:
- [ASL Citizen, Microsoft Research](https://www.microsoft.com/en-us/research/project/asl-citizen/)
- [Trends and Challenges for Sign Language Recognition with Machine Learning, ESANN 2023](https://www.esann.org/sites/default/files/proceedings/2023/ES2023-7.pdf)
## Hackathon Fit
The project is aimed at the Backyard AI track: a practical communication tool
for people who need fast sign-to-speech support without sending every clip to a
cloud API.
Build Small constraints covered by this repo:
- Gradio app, ready for a Hugging Face Space.
- Small-model stack under the 32B parameter limit.
- Local-first runtime: ASL detection, llama.cpp text generation, and Qwen3-TTS
speech generation run in the app process.
- Step-by-step demo flow that shows intermediate evidence instead of hiding
uncertainty.
Badges that fit this build:
- Off the Grid / Local-first: no cloud inference API is required at runtime.
- Llama Champion: the intent-to-speech text step uses `llama.cpp`.
- Off-Brand / Custom UI: the Space uses custom Gradio styling.
- Field Notes: claim this only after publishing the build write-up.
Badges not claimed:
- Well-Tuned: this repo uses published models; it does not publish a new
fine-tuned model.
- Sharing is Caring: no public agent trace is included yet.
## Pipeline
```text
Video upload or camera capture
-> Sequential ASL frame sampling
-> MediaPipe landmarks, when installed
-> WLASL2000 I3D or TFLite ASL classifier
-> Ordered gloss sequence with confidence diagnostics
-> DeepFace emotion aggregation, when installed
-> llama.cpp subtitle and voice instruction
-> Qwen3-TTS audio
```
Each brick can fail independently and return diagnostics instead of blocking
the whole interface at startup.
The demo screen is intentionally staged:
```text
1. Analyze ASL -> debug overlay + intent JSON
2. Generate subtitle -> llama.cpp output
3. Generate speech -> Qwen3-TTS audio
```
When no ASL classifier is available, Sign2Voice reports `model_missing` and
does not invent ASL words. The manual gloss override is empty by default and
lives under advanced debug controls for downstream LLM/TTS testing only.
## Run Locally
Install the default CPU-friendly dependency set:
```bash
pip install -r requirements.txt
```
Start the Gradio app:
```bash
python3 app.py
```
Run unit tests:
```bash
python3 -m pytest -q
```
Run each brick directly:
```bash
python3 scripts/test_asl_brick.py
python3 scripts/test_llm_brick.py
python3 scripts/test_tts_brick.py
python3 scripts/test_full_pipeline.py
```
`scripts/test_asl_brick.py` creates a tiny temporary clip when no video path is
supplied. It also writes a debug overlay video path. To test the transparent
fallback:
```bash
python3 scripts/test_asl_brick.py --gloss-override "I LOVE YOU"
```
## ASL Model Files
The ASL classifier assets live under:
```text
data/models/asl/model.tflite
data/models/asl/train.csv
data/models/asl/sign_to_prediction_index_map.json
```
They come from
[jamesjbustos/sign-language-recognition](https://github.com/jamesjbustos/sign-language-recognition).
Without these files, the ASL brick still samples frames and emits
`model_missing` diagnostics. The TFLite model recognizes the isolated signs in
`sign_to_prediction_index_map.json`; it is not a full sentence or fingerspelling
recognizer. Predictions below `ASL_CONFIDENCE_THRESHOLD`, defaulting to `0.70`,
are reported as `low_confidence` and are not forwarded as detected glosses.
Uploaded videos use a phrase-prototype mode: frames are read in temporal order,
landmarks are extracted once, and the ASL model runs over sliding windows.
Accepted window predictions are collapsed into an ordered gloss sequence before
`llama.cpp` rewrites them as natural speech.
Tune it with:
```text
ASL_UPLOAD_TARGET_FPS=12
ASL_UPLOAD_MAX_FRAMES=240
ASL_SEQUENCE_WINDOW=30
ASL_SEQUENCE_STRIDE=15
ASL_CONFIDENCE_THRESHOLD=0.70
```
This is still not full continuous ASL translation, but it lets recorded phrase
clips become `hello where water`-style gloss sequences instead of one global
class.
An experimental WLASL2000 I3D backend is also available for broader vocabulary
coverage. It uses `raghuhasan/asl2000-i3d` from Hugging Face, downloads the I3D
architecture helper if needed, and falls back to the TFLite detector when
`ASL_DETECTOR_BACKEND=auto` cannot initialize it.
```text
ASL_DETECTOR_BACKEND=auto # default: WLASL2000, then TFLite
ASL_DETECTOR_BACKEND=tflite # lightweight TFLite-only backend
ASL_DETECTOR_BACKEND=wlasl_i3d # WLASL2000 I3D only
WLASL_I3D_CONFIDENCE_THRESHOLD=0.20
WLASL_I3D_SEQUENCE_WINDOW=64
WLASL_I3D_SEQUENCE_STRIDE=32
WLASL_I3D_FRAME_SIZE=224
```
The WLASL backend is heavier and more experimental. Its model card reports
2,000 classes with 32.48% top-1, 57.31% top-5, and 66.31% top-10 accuracy, so
the UI exposes top candidates and segment diagnostics instead of hiding
uncertainty.
Live camera debug prioritizes speed over long temporal batching. It starts
predicting after `LIVE_ASL_MIN_FRAMES=4`, keeps a rolling buffer of
`LIVE_ASL_MAX_FRAMES=12`, and runs ASL prediction every
`LIVE_ASL_PREDICT_EVERY=1` frame. DeepFace emotion is heavier, so it runs every
`LIVE_EMOTION_EVERY=45` frames by default.
Good first signs to test because they are in the model vocabulary:
```text
hello, where, who, why, yes, no, thankyou, please, water, happy, sad
```
Reference clips and GIFs from the upstream demo list:
```text
hello https://media.giphy.com/media/3o7TKNKOfKlIhbD3gY/giphy.gif
where https://lifeprint.com/asl101/gifs/w/where.gif
who https://lifeprint.com/asl101/gifs/w/who.gif
why https://lifeprint.com/asl101/gifs/w/why.gif
yes https://media.tenor.com/oYIirlyIih0AAAAC/yes-asl.gif
no https://lifeprint.com/asl101/gifs/n/no-2-movement.gif
thankyou https://lifeprint.com/asl101/gifs/t/thank-you.gif
please https://lifeprint.com/asl101/gifs-animated/pleasecloseup.gif
water https://lifeprint.com/asl101/gifs/w/water-2.gif
```
## GPU Dependencies
`flash-attn` is only useful on a CUDA GPU Space with compatible PyTorch/CUDA
versions. Keep the default `requirements.txt` for CPU Spaces. If the Space is
moved to a compatible GPU runtime, install the GPU dependency set instead:
```bash
pip install -r requirements-gpu.txt --no-build-isolation
```
## Full ASL Dependencies
The default requirements keep the Space build lighter. For a full local
ASL/emotion runtime with MediaPipe landmarks and DeepFace emotion detection,
install:
```bash
pip install -r requirements-asl-full.txt
```