--- title: Sign2Voice emoji: 🗣️ colorFrom: green colorTo: yellow sdk: gradio sdk_version: "5.50.0" python_version: "3.12" app_file: app.py pinned: false tags: - track:backyard - achievement:offgrid - achievement:offbrand - achievement:llama --- A local-first AI stack that translates sign language, intent, and expression into natural speech. Sign2Voice is a Gradio demo built for the Hugging Face Build Small Hackathon. It takes an uploaded or webcam ASL clip, extracts sign candidates and facial expression signals, converts the result into a compact intent JSON, then speaks the sentence with a local small-model voice stack. ## Scope Note We would have loved to push Sign2Voice further toward open-ended ASL conversation during the hackathon. The main constraint was not the app shell or the voice stack; it was data. Public ASL resources are still uneven for this task: many usable datasets focus on isolated signs, dictionary retrieval, or fingerspelling, while fluent ASL needs signer diversity, real-world lighting and camera angles, temporal boundaries, facial grammar, body movement, and careful expert annotation. That is why this submission is intentionally evidence-first. It exposes top sign candidates, confidence thresholds, segment diagnostics, and emotion metadata instead of pretending that a small hackathon model can solve full ASL translation end to end. Microsoft Research describes sign-language modeling as being far behind spoken-language modeling largely because of a lack of appropriate training data, and recent SLR survey work calls data acquisition and annotation the main bottleneck for systems that work on fluent signing. References: - [ASL Citizen, Microsoft Research](https://www.microsoft.com/en-us/research/project/asl-citizen/) - [Trends and Challenges for Sign Language Recognition with Machine Learning, ESANN 2023](https://www.esann.org/sites/default/files/proceedings/2023/ES2023-7.pdf) ## Hackathon Fit The project is aimed at the Backyard AI track: a practical communication tool for people who need fast sign-to-speech support without sending every clip to a cloud API. Build Small constraints covered by this repo: - Gradio app, ready for a Hugging Face Space. - Small-model stack under the 32B parameter limit. - Local-first runtime: ASL detection, llama.cpp text generation, and Qwen3-TTS speech generation run in the app process. - Step-by-step demo flow that shows intermediate evidence instead of hiding uncertainty. Badges that fit this build: - Off the Grid / Local-first: no cloud inference API is required at runtime. - Llama Champion: the intent-to-speech text step uses `llama.cpp`. - Off-Brand / Custom UI: the Space uses custom Gradio styling. - Field Notes: claim this only after publishing the build write-up. Badges not claimed: - Well-Tuned: this repo uses published models; it does not publish a new fine-tuned model. - Sharing is Caring: no public agent trace is included yet. ## Pipeline ```text Video upload or camera capture -> Sequential ASL frame sampling -> MediaPipe landmarks, when installed -> WLASL2000 I3D or TFLite ASL classifier -> Ordered gloss sequence with confidence diagnostics -> DeepFace emotion aggregation, when installed -> llama.cpp subtitle and voice instruction -> Qwen3-TTS audio ``` Each brick can fail independently and return diagnostics instead of blocking the whole interface at startup. The demo screen is intentionally staged: ```text 1. Analyze ASL -> debug overlay + intent JSON 2. Generate subtitle -> llama.cpp output 3. Generate speech -> Qwen3-TTS audio ``` When no ASL classifier is available, Sign2Voice reports `model_missing` and does not invent ASL words. The manual gloss override is empty by default and lives under advanced debug controls for downstream LLM/TTS testing only. ## Run Locally Install the default CPU-friendly dependency set: ```bash pip install -r requirements.txt ``` Start the Gradio app: ```bash python3 app.py ``` Run unit tests: ```bash python3 -m pytest -q ``` Run each brick directly: ```bash python3 scripts/test_asl_brick.py python3 scripts/test_llm_brick.py python3 scripts/test_tts_brick.py python3 scripts/test_full_pipeline.py ``` `scripts/test_asl_brick.py` creates a tiny temporary clip when no video path is supplied. It also writes a debug overlay video path. To test the transparent fallback: ```bash python3 scripts/test_asl_brick.py --gloss-override "I LOVE YOU" ``` ## ASL Model Files The ASL classifier assets live under: ```text data/models/asl/model.tflite data/models/asl/train.csv data/models/asl/sign_to_prediction_index_map.json ``` They come from [jamesjbustos/sign-language-recognition](https://github.com/jamesjbustos/sign-language-recognition). Without these files, the ASL brick still samples frames and emits `model_missing` diagnostics. The TFLite model recognizes the isolated signs in `sign_to_prediction_index_map.json`; it is not a full sentence or fingerspelling recognizer. Predictions below `ASL_CONFIDENCE_THRESHOLD`, defaulting to `0.70`, are reported as `low_confidence` and are not forwarded as detected glosses. Uploaded videos use a phrase-prototype mode: frames are read in temporal order, landmarks are extracted once, and the ASL model runs over sliding windows. Accepted window predictions are collapsed into an ordered gloss sequence before `llama.cpp` rewrites them as natural speech. Tune it with: ```text ASL_UPLOAD_TARGET_FPS=12 ASL_UPLOAD_MAX_FRAMES=240 ASL_SEQUENCE_WINDOW=30 ASL_SEQUENCE_STRIDE=15 ASL_CONFIDENCE_THRESHOLD=0.70 ``` This is still not full continuous ASL translation, but it lets recorded phrase clips become `hello where water`-style gloss sequences instead of one global class. An experimental WLASL2000 I3D backend is also available for broader vocabulary coverage. It uses `raghuhasan/asl2000-i3d` from Hugging Face, downloads the I3D architecture helper if needed, and falls back to the TFLite detector when `ASL_DETECTOR_BACKEND=auto` cannot initialize it. ```text ASL_DETECTOR_BACKEND=auto # default: WLASL2000, then TFLite ASL_DETECTOR_BACKEND=tflite # lightweight TFLite-only backend ASL_DETECTOR_BACKEND=wlasl_i3d # WLASL2000 I3D only WLASL_I3D_CONFIDENCE_THRESHOLD=0.20 WLASL_I3D_SEQUENCE_WINDOW=64 WLASL_I3D_SEQUENCE_STRIDE=32 WLASL_I3D_FRAME_SIZE=224 ``` The WLASL backend is heavier and more experimental. Its model card reports 2,000 classes with 32.48% top-1, 57.31% top-5, and 66.31% top-10 accuracy, so the UI exposes top candidates and segment diagnostics instead of hiding uncertainty. Live camera debug prioritizes speed over long temporal batching. It starts predicting after `LIVE_ASL_MIN_FRAMES=4`, keeps a rolling buffer of `LIVE_ASL_MAX_FRAMES=12`, and runs ASL prediction every `LIVE_ASL_PREDICT_EVERY=1` frame. DeepFace emotion is heavier, so it runs every `LIVE_EMOTION_EVERY=45` frames by default. Good first signs to test because they are in the model vocabulary: ```text hello, where, who, why, yes, no, thankyou, please, water, happy, sad ``` Reference clips and GIFs from the upstream demo list: ```text hello https://media.giphy.com/media/3o7TKNKOfKlIhbD3gY/giphy.gif where https://lifeprint.com/asl101/gifs/w/where.gif who https://lifeprint.com/asl101/gifs/w/who.gif why https://lifeprint.com/asl101/gifs/w/why.gif yes https://media.tenor.com/oYIirlyIih0AAAAC/yes-asl.gif no https://lifeprint.com/asl101/gifs/n/no-2-movement.gif thankyou https://lifeprint.com/asl101/gifs/t/thank-you.gif please https://lifeprint.com/asl101/gifs-animated/pleasecloseup.gif water https://lifeprint.com/asl101/gifs/w/water-2.gif ``` ## GPU Dependencies `flash-attn` is only useful on a CUDA GPU Space with compatible PyTorch/CUDA versions. Keep the default `requirements.txt` for CPU Spaces. If the Space is moved to a compatible GPU runtime, install the GPU dependency set instead: ```bash pip install -r requirements-gpu.txt --no-build-isolation ``` ## Full ASL Dependencies The default requirements keep the Space build lighter. For a full local ASL/emotion runtime with MediaPipe landmarks and DeepFace emotion detection, install: ```bash pip install -r requirements-asl-full.txt ```