Spaces:
Build error
Build error
| title: Sign2Voice | |
| emoji: 🗣️ | |
| colorFrom: green | |
| colorTo: yellow | |
| sdk: gradio | |
| sdk_version: "5.50.0" | |
| python_version: "3.12" | |
| app_file: app.py | |
| pinned: false | |
| tags: | |
| - track:backyard | |
| - achievement:offgrid | |
| - achievement:offbrand | |
| - achievement:llama | |
| A local-first AI stack that translates sign language, intent, and expression | |
| into natural speech. | |
| Sign2Voice is a Gradio demo built for the Hugging Face Build Small Hackathon. | |
| It takes an uploaded or webcam ASL clip, extracts sign candidates and facial | |
| expression signals, converts the result into a compact intent JSON, then speaks | |
| the sentence with a local small-model voice stack. | |
| ## Scope Note | |
| We would have loved to push Sign2Voice further toward open-ended ASL | |
| conversation during the hackathon. The main constraint was not the app shell or | |
| the voice stack; it was data. Public ASL resources are still uneven for this | |
| task: many usable datasets focus on isolated signs, dictionary retrieval, or | |
| fingerspelling, while fluent ASL needs signer diversity, real-world lighting and | |
| camera angles, temporal boundaries, facial grammar, body movement, and careful | |
| expert annotation. | |
| That is why this submission is intentionally evidence-first. It exposes top sign | |
| candidates, confidence thresholds, segment diagnostics, and emotion metadata | |
| instead of pretending that a small hackathon model can solve full ASL | |
| translation end to end. Microsoft Research describes sign-language modeling as | |
| being far behind spoken-language modeling largely because of a lack of | |
| appropriate training data, and recent SLR survey work calls data acquisition and | |
| annotation the main bottleneck for systems that work on fluent signing. | |
| References: | |
| - [ASL Citizen, Microsoft Research](https://www.microsoft.com/en-us/research/project/asl-citizen/) | |
| - [Trends and Challenges for Sign Language Recognition with Machine Learning, ESANN 2023](https://www.esann.org/sites/default/files/proceedings/2023/ES2023-7.pdf) | |
| ## Hackathon Fit | |
| The project is aimed at the Backyard AI track: a practical communication tool | |
| for people who need fast sign-to-speech support without sending every clip to a | |
| cloud API. | |
| Build Small constraints covered by this repo: | |
| - Gradio app, ready for a Hugging Face Space. | |
| - Small-model stack under the 32B parameter limit. | |
| - Local-first runtime: ASL detection, llama.cpp text generation, and Qwen3-TTS | |
| speech generation run in the app process. | |
| - Step-by-step demo flow that shows intermediate evidence instead of hiding | |
| uncertainty. | |
| Badges that fit this build: | |
| - Off the Grid / Local-first: no cloud inference API is required at runtime. | |
| - Llama Champion: the intent-to-speech text step uses `llama.cpp`. | |
| - Off-Brand / Custom UI: the Space uses custom Gradio styling. | |
| - Field Notes: claim this only after publishing the build write-up. | |
| Badges not claimed: | |
| - Well-Tuned: this repo uses published models; it does not publish a new | |
| fine-tuned model. | |
| - Sharing is Caring: no public agent trace is included yet. | |
| ## Pipeline | |
| ```text | |
| Video upload or camera capture | |
| -> Sequential ASL frame sampling | |
| -> MediaPipe landmarks, when installed | |
| -> WLASL2000 I3D or TFLite ASL classifier | |
| -> Ordered gloss sequence with confidence diagnostics | |
| -> DeepFace emotion aggregation, when installed | |
| -> llama.cpp subtitle and voice instruction | |
| -> Qwen3-TTS audio | |
| ``` | |
| Each brick can fail independently and return diagnostics instead of blocking | |
| the whole interface at startup. | |
| The demo screen is intentionally staged: | |
| ```text | |
| 1. Analyze ASL -> debug overlay + intent JSON | |
| 2. Generate subtitle -> llama.cpp output | |
| 3. Generate speech -> Qwen3-TTS audio | |
| ``` | |
| When no ASL classifier is available, Sign2Voice reports `model_missing` and | |
| does not invent ASL words. The manual gloss override is empty by default and | |
| lives under advanced debug controls for downstream LLM/TTS testing only. | |
| ## Run Locally | |
| Install the default CPU-friendly dependency set: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| Start the Gradio app: | |
| ```bash | |
| python3 app.py | |
| ``` | |
| Run unit tests: | |
| ```bash | |
| python3 -m pytest -q | |
| ``` | |
| Run each brick directly: | |
| ```bash | |
| python3 scripts/test_asl_brick.py | |
| python3 scripts/test_llm_brick.py | |
| python3 scripts/test_tts_brick.py | |
| python3 scripts/test_full_pipeline.py | |
| ``` | |
| `scripts/test_asl_brick.py` creates a tiny temporary clip when no video path is | |
| supplied. It also writes a debug overlay video path. To test the transparent | |
| fallback: | |
| ```bash | |
| python3 scripts/test_asl_brick.py --gloss-override "I LOVE YOU" | |
| ``` | |
| ## ASL Model Files | |
| The ASL classifier assets live under: | |
| ```text | |
| data/models/asl/model.tflite | |
| data/models/asl/train.csv | |
| data/models/asl/sign_to_prediction_index_map.json | |
| ``` | |
| They come from | |
| [jamesjbustos/sign-language-recognition](https://github.com/jamesjbustos/sign-language-recognition). | |
| Without these files, the ASL brick still samples frames and emits | |
| `model_missing` diagnostics. The TFLite model recognizes the isolated signs in | |
| `sign_to_prediction_index_map.json`; it is not a full sentence or fingerspelling | |
| recognizer. Predictions below `ASL_CONFIDENCE_THRESHOLD`, defaulting to `0.70`, | |
| are reported as `low_confidence` and are not forwarded as detected glosses. | |
| Uploaded videos use a phrase-prototype mode: frames are read in temporal order, | |
| landmarks are extracted once, and the ASL model runs over sliding windows. | |
| Accepted window predictions are collapsed into an ordered gloss sequence before | |
| `llama.cpp` rewrites them as natural speech. | |
| Tune it with: | |
| ```text | |
| ASL_UPLOAD_TARGET_FPS=12 | |
| ASL_UPLOAD_MAX_FRAMES=240 | |
| ASL_SEQUENCE_WINDOW=30 | |
| ASL_SEQUENCE_STRIDE=15 | |
| ASL_CONFIDENCE_THRESHOLD=0.70 | |
| ``` | |
| This is still not full continuous ASL translation, but it lets recorded phrase | |
| clips become `hello where water`-style gloss sequences instead of one global | |
| class. | |
| An experimental WLASL2000 I3D backend is also available for broader vocabulary | |
| coverage. It uses `raghuhasan/asl2000-i3d` from Hugging Face, downloads the I3D | |
| architecture helper if needed, and falls back to the TFLite detector when | |
| `ASL_DETECTOR_BACKEND=auto` cannot initialize it. | |
| ```text | |
| ASL_DETECTOR_BACKEND=auto # default: WLASL2000, then TFLite | |
| ASL_DETECTOR_BACKEND=tflite # lightweight TFLite-only backend | |
| ASL_DETECTOR_BACKEND=wlasl_i3d # WLASL2000 I3D only | |
| WLASL_I3D_CONFIDENCE_THRESHOLD=0.20 | |
| WLASL_I3D_SEQUENCE_WINDOW=64 | |
| WLASL_I3D_SEQUENCE_STRIDE=32 | |
| WLASL_I3D_FRAME_SIZE=224 | |
| ``` | |
| The WLASL backend is heavier and more experimental. Its model card reports | |
| 2,000 classes with 32.48% top-1, 57.31% top-5, and 66.31% top-10 accuracy, so | |
| the UI exposes top candidates and segment diagnostics instead of hiding | |
| uncertainty. | |
| Live camera debug prioritizes speed over long temporal batching. It starts | |
| predicting after `LIVE_ASL_MIN_FRAMES=4`, keeps a rolling buffer of | |
| `LIVE_ASL_MAX_FRAMES=12`, and runs ASL prediction every | |
| `LIVE_ASL_PREDICT_EVERY=1` frame. DeepFace emotion is heavier, so it runs every | |
| `LIVE_EMOTION_EVERY=45` frames by default. | |
| Good first signs to test because they are in the model vocabulary: | |
| ```text | |
| hello, where, who, why, yes, no, thankyou, please, water, happy, sad | |
| ``` | |
| Reference clips and GIFs from the upstream demo list: | |
| ```text | |
| hello https://media.giphy.com/media/3o7TKNKOfKlIhbD3gY/giphy.gif | |
| where https://lifeprint.com/asl101/gifs/w/where.gif | |
| who https://lifeprint.com/asl101/gifs/w/who.gif | |
| why https://lifeprint.com/asl101/gifs/w/why.gif | |
| yes https://media.tenor.com/oYIirlyIih0AAAAC/yes-asl.gif | |
| no https://lifeprint.com/asl101/gifs/n/no-2-movement.gif | |
| thankyou https://lifeprint.com/asl101/gifs/t/thank-you.gif | |
| please https://lifeprint.com/asl101/gifs-animated/pleasecloseup.gif | |
| water https://lifeprint.com/asl101/gifs/w/water-2.gif | |
| ``` | |
| ## GPU Dependencies | |
| `flash-attn` is only useful on a CUDA GPU Space with compatible PyTorch/CUDA | |
| versions. Keep the default `requirements.txt` for CPU Spaces. If the Space is | |
| moved to a compatible GPU runtime, install the GPU dependency set instead: | |
| ```bash | |
| pip install -r requirements-gpu.txt --no-build-isolation | |
| ``` | |
| ## Full ASL Dependencies | |
| The default requirements keep the Space build lighter. For a full local | |
| ASL/emotion runtime with MediaPipe landmarks and DeepFace emotion detection, | |
| install: | |
| ```bash | |
| pip install -r requirements-asl-full.txt | |
| ``` | |