Spaces:
Build error
Build error
File size: 8,226 Bytes
94bf482 15ccbbc de2df4e 15ccbbc 94bf482 dfed012 d58060f 94bf482 c550165 94bf482 15ccbbc de2df4e 15ccbbc e28ca38 15ccbbc e28ca38 15ccbbc e28ca38 15ccbbc c6c2ad9 15ccbbc c6c2ad9 15ccbbc c6c2ad9 15ccbbc e28ca38 15ccbbc c6c2ad9 e28ca38 15ccbbc e28ca38 15ccbbc e28ca38 1122f32 15ccbbc 1122f32 15ccbbc e28ca38 15ccbbc 2eb805a 15ccbbc 7d969c6 15ccbbc 7d969c6 15ccbbc 20aa7e1 15ccbbc 9479558 20aa7e1 15ccbbc 20aa7e1 15ccbbc 6dc8a0d c6057fb 15ccbbc c6057fb 15ccbbc 2eb805a 15ccbbc 2eb805a e28ca38 15ccbbc e28ca38 15ccbbc e28ca38 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 | ---
title: Sign2Voice
emoji: 🗣️
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: "5.50.0"
python_version: "3.12"
app_file: app.py
pinned: false
tags:
- track:backyard
- achievement:offgrid
- achievement:offbrand
- achievement:llama
---
A local-first AI stack that translates sign language, intent, and expression
into natural speech.
Sign2Voice is a Gradio demo built for the Hugging Face Build Small Hackathon.
It takes an uploaded or webcam ASL clip, extracts sign candidates and facial
expression signals, converts the result into a compact intent JSON, then speaks
the sentence with a local small-model voice stack.
## Scope Note
We would have loved to push Sign2Voice further toward open-ended ASL
conversation during the hackathon. The main constraint was not the app shell or
the voice stack; it was data. Public ASL resources are still uneven for this
task: many usable datasets focus on isolated signs, dictionary retrieval, or
fingerspelling, while fluent ASL needs signer diversity, real-world lighting and
camera angles, temporal boundaries, facial grammar, body movement, and careful
expert annotation.
That is why this submission is intentionally evidence-first. It exposes top sign
candidates, confidence thresholds, segment diagnostics, and emotion metadata
instead of pretending that a small hackathon model can solve full ASL
translation end to end. Microsoft Research describes sign-language modeling as
being far behind spoken-language modeling largely because of a lack of
appropriate training data, and recent SLR survey work calls data acquisition and
annotation the main bottleneck for systems that work on fluent signing.
References:
- [ASL Citizen, Microsoft Research](https://www.microsoft.com/en-us/research/project/asl-citizen/)
- [Trends and Challenges for Sign Language Recognition with Machine Learning, ESANN 2023](https://www.esann.org/sites/default/files/proceedings/2023/ES2023-7.pdf)
## Hackathon Fit
The project is aimed at the Backyard AI track: a practical communication tool
for people who need fast sign-to-speech support without sending every clip to a
cloud API.
Build Small constraints covered by this repo:
- Gradio app, ready for a Hugging Face Space.
- Small-model stack under the 32B parameter limit.
- Local-first runtime: ASL detection, llama.cpp text generation, and Qwen3-TTS
speech generation run in the app process.
- Step-by-step demo flow that shows intermediate evidence instead of hiding
uncertainty.
Badges that fit this build:
- Off the Grid / Local-first: no cloud inference API is required at runtime.
- Llama Champion: the intent-to-speech text step uses `llama.cpp`.
- Off-Brand / Custom UI: the Space uses custom Gradio styling.
- Field Notes: claim this only after publishing the build write-up.
Badges not claimed:
- Well-Tuned: this repo uses published models; it does not publish a new
fine-tuned model.
- Sharing is Caring: no public agent trace is included yet.
## Pipeline
```text
Video upload or camera capture
-> Sequential ASL frame sampling
-> MediaPipe landmarks, when installed
-> WLASL2000 I3D or TFLite ASL classifier
-> Ordered gloss sequence with confidence diagnostics
-> DeepFace emotion aggregation, when installed
-> llama.cpp subtitle and voice instruction
-> Qwen3-TTS audio
```
Each brick can fail independently and return diagnostics instead of blocking
the whole interface at startup.
The demo screen is intentionally staged:
```text
1. Analyze ASL -> debug overlay + intent JSON
2. Generate subtitle -> llama.cpp output
3. Generate speech -> Qwen3-TTS audio
```
When no ASL classifier is available, Sign2Voice reports `model_missing` and
does not invent ASL words. The manual gloss override is empty by default and
lives under advanced debug controls for downstream LLM/TTS testing only.
## Run Locally
Install the default CPU-friendly dependency set:
```bash
pip install -r requirements.txt
```
Start the Gradio app:
```bash
python3 app.py
```
Run unit tests:
```bash
python3 -m pytest -q
```
Run each brick directly:
```bash
python3 scripts/test_asl_brick.py
python3 scripts/test_llm_brick.py
python3 scripts/test_tts_brick.py
python3 scripts/test_full_pipeline.py
```
`scripts/test_asl_brick.py` creates a tiny temporary clip when no video path is
supplied. It also writes a debug overlay video path. To test the transparent
fallback:
```bash
python3 scripts/test_asl_brick.py --gloss-override "I LOVE YOU"
```
## ASL Model Files
The ASL classifier assets live under:
```text
data/models/asl/model.tflite
data/models/asl/train.csv
data/models/asl/sign_to_prediction_index_map.json
```
They come from
[jamesjbustos/sign-language-recognition](https://github.com/jamesjbustos/sign-language-recognition).
Without these files, the ASL brick still samples frames and emits
`model_missing` diagnostics. The TFLite model recognizes the isolated signs in
`sign_to_prediction_index_map.json`; it is not a full sentence or fingerspelling
recognizer. Predictions below `ASL_CONFIDENCE_THRESHOLD`, defaulting to `0.70`,
are reported as `low_confidence` and are not forwarded as detected glosses.
Uploaded videos use a phrase-prototype mode: frames are read in temporal order,
landmarks are extracted once, and the ASL model runs over sliding windows.
Accepted window predictions are collapsed into an ordered gloss sequence before
`llama.cpp` rewrites them as natural speech.
Tune it with:
```text
ASL_UPLOAD_TARGET_FPS=12
ASL_UPLOAD_MAX_FRAMES=240
ASL_SEQUENCE_WINDOW=30
ASL_SEQUENCE_STRIDE=15
ASL_CONFIDENCE_THRESHOLD=0.70
```
This is still not full continuous ASL translation, but it lets recorded phrase
clips become `hello where water`-style gloss sequences instead of one global
class.
An experimental WLASL2000 I3D backend is also available for broader vocabulary
coverage. It uses `raghuhasan/asl2000-i3d` from Hugging Face, downloads the I3D
architecture helper if needed, and falls back to the TFLite detector when
`ASL_DETECTOR_BACKEND=auto` cannot initialize it.
```text
ASL_DETECTOR_BACKEND=auto # default: WLASL2000, then TFLite
ASL_DETECTOR_BACKEND=tflite # lightweight TFLite-only backend
ASL_DETECTOR_BACKEND=wlasl_i3d # WLASL2000 I3D only
WLASL_I3D_CONFIDENCE_THRESHOLD=0.20
WLASL_I3D_SEQUENCE_WINDOW=64
WLASL_I3D_SEQUENCE_STRIDE=32
WLASL_I3D_FRAME_SIZE=224
```
The WLASL backend is heavier and more experimental. Its model card reports
2,000 classes with 32.48% top-1, 57.31% top-5, and 66.31% top-10 accuracy, so
the UI exposes top candidates and segment diagnostics instead of hiding
uncertainty.
Live camera debug prioritizes speed over long temporal batching. It starts
predicting after `LIVE_ASL_MIN_FRAMES=4`, keeps a rolling buffer of
`LIVE_ASL_MAX_FRAMES=12`, and runs ASL prediction every
`LIVE_ASL_PREDICT_EVERY=1` frame. DeepFace emotion is heavier, so it runs every
`LIVE_EMOTION_EVERY=45` frames by default.
Good first signs to test because they are in the model vocabulary:
```text
hello, where, who, why, yes, no, thankyou, please, water, happy, sad
```
Reference clips and GIFs from the upstream demo list:
```text
hello https://media.giphy.com/media/3o7TKNKOfKlIhbD3gY/giphy.gif
where https://lifeprint.com/asl101/gifs/w/where.gif
who https://lifeprint.com/asl101/gifs/w/who.gif
why https://lifeprint.com/asl101/gifs/w/why.gif
yes https://media.tenor.com/oYIirlyIih0AAAAC/yes-asl.gif
no https://lifeprint.com/asl101/gifs/n/no-2-movement.gif
thankyou https://lifeprint.com/asl101/gifs/t/thank-you.gif
please https://lifeprint.com/asl101/gifs-animated/pleasecloseup.gif
water https://lifeprint.com/asl101/gifs/w/water-2.gif
```
## GPU Dependencies
`flash-attn` is only useful on a CUDA GPU Space with compatible PyTorch/CUDA
versions. Keep the default `requirements.txt` for CPU Spaces. If the Space is
moved to a compatible GPU runtime, install the GPU dependency set instead:
```bash
pip install -r requirements-gpu.txt --no-build-isolation
```
## Full ASL Dependencies
The default requirements keep the Space build lighter. For a full local
ASL/emotion runtime with MediaPipe landmarks and DeepFace emotion detection,
install:
```bash
pip install -r requirements-asl-full.txt
```
|