---
title: Sign2Voice
emoji: 🗣️
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: "5.50.0"
python_version: "3.12"
app_file: app.py
pinned: false
tags:
  - track:backyard
  - achievement:offgrid
  - achievement:offbrand
  - achievement:llama
---

A local-first AI stack that translates sign language, intent, and expression
into natural speech.

Sign2Voice is a Gradio demo built for the Hugging Face Build Small Hackathon.
It takes an uploaded or webcam ASL clip, extracts sign candidates and facial
expression signals, converts the result into a compact intent JSON, then speaks
the sentence with a local small-model voice stack.

## Scope Note

We would have loved to push Sign2Voice further toward open-ended ASL
conversation during the hackathon. The main constraint was not the app shell or
the voice stack; it was data. Public ASL resources are still uneven for this
task: many usable datasets focus on isolated signs, dictionary retrieval, or
fingerspelling, while fluent ASL needs signer diversity, real-world lighting and
camera angles, temporal boundaries, facial grammar, body movement, and careful
expert annotation.

That is why this submission is intentionally evidence-first. It exposes top sign
candidates, confidence thresholds, segment diagnostics, and emotion metadata
instead of pretending that a small hackathon model can solve full ASL
translation end to end. Microsoft Research describes sign-language modeling as
being far behind spoken-language modeling largely because of a lack of
appropriate training data, and recent SLR survey work calls data acquisition and
annotation the main bottleneck for systems that work on fluent signing.

References:

- [ASL Citizen, Microsoft Research](https://www.microsoft.com/en-us/research/project/asl-citizen/)
- [Trends and Challenges for Sign Language Recognition with Machine Learning, ESANN 2023](https://www.esann.org/sites/default/files/proceedings/2023/ES2023-7.pdf)

## Hackathon Fit

The project is aimed at the Backyard AI track: a practical communication tool
for people who need fast sign-to-speech support without sending every clip to a
cloud API.

Build Small constraints covered by this repo:

- Gradio app, ready for a Hugging Face Space.
- Small-model stack under the 32B parameter limit.
- Local-first runtime: ASL detection, llama.cpp text generation, and Qwen3-TTS
  speech generation run in the app process.
- Step-by-step demo flow that shows intermediate evidence instead of hiding
  uncertainty.

Badges that fit this build:

- Off the Grid / Local-first: no cloud inference API is required at runtime.
- Llama Champion: the intent-to-speech text step uses `llama.cpp`.
- Off-Brand / Custom UI: the Space uses custom Gradio styling.
- Field Notes: claim this only after publishing the build write-up.

Badges not claimed:

- Well-Tuned: this repo uses published models; it does not publish a new
  fine-tuned model.
- Sharing is Caring: no public agent trace is included yet.

## Pipeline

```text
Video upload or camera capture
-> Sequential ASL frame sampling
-> MediaPipe landmarks, when installed
-> WLASL2000 I3D or TFLite ASL classifier
-> Ordered gloss sequence with confidence diagnostics
-> DeepFace emotion aggregation, when installed
-> llama.cpp subtitle and voice instruction
-> Qwen3-TTS audio
```

Each brick can fail independently and return diagnostics instead of blocking
the whole interface at startup.

The demo screen is intentionally staged:

```text
1. Analyze ASL -> debug overlay + intent JSON
2. Generate subtitle -> llama.cpp output
3. Generate speech -> Qwen3-TTS audio
```

When no ASL classifier is available, Sign2Voice reports `model_missing` and
does not invent ASL words. The manual gloss override is empty by default and
lives under advanced debug controls for downstream LLM/TTS testing only.

## Run Locally

Install the default CPU-friendly dependency set:

```bash
pip install -r requirements.txt
```

Start the Gradio app:

```bash
python3 app.py
```

Run unit tests:

```bash
python3 -m pytest -q
```

Run each brick directly:

```bash
python3 scripts/test_asl_brick.py
python3 scripts/test_llm_brick.py
python3 scripts/test_tts_brick.py
python3 scripts/test_full_pipeline.py
```

`scripts/test_asl_brick.py` creates a tiny temporary clip when no video path is
supplied. It also writes a debug overlay video path. To test the transparent
fallback:

```bash
python3 scripts/test_asl_brick.py --gloss-override "I LOVE YOU"
```

## ASL Model Files

The ASL classifier assets live under:

```text
data/models/asl/model.tflite
data/models/asl/train.csv
data/models/asl/sign_to_prediction_index_map.json
```

They come from
[jamesjbustos/sign-language-recognition](https://github.com/jamesjbustos/sign-language-recognition).

Without these files, the ASL brick still samples frames and emits
`model_missing` diagnostics. The TFLite model recognizes the isolated signs in
`sign_to_prediction_index_map.json`; it is not a full sentence or fingerspelling
recognizer. Predictions below `ASL_CONFIDENCE_THRESHOLD`, defaulting to `0.70`,
are reported as `low_confidence` and are not forwarded as detected glosses.

Uploaded videos use a phrase-prototype mode: frames are read in temporal order,
landmarks are extracted once, and the ASL model runs over sliding windows.
Accepted window predictions are collapsed into an ordered gloss sequence before
`llama.cpp` rewrites them as natural speech.

Tune it with:

```text
ASL_UPLOAD_TARGET_FPS=12
ASL_UPLOAD_MAX_FRAMES=240
ASL_SEQUENCE_WINDOW=30
ASL_SEQUENCE_STRIDE=15
ASL_CONFIDENCE_THRESHOLD=0.70
```

This is still not full continuous ASL translation, but it lets recorded phrase
clips become `hello where water`-style gloss sequences instead of one global
class.

An experimental WLASL2000 I3D backend is also available for broader vocabulary
coverage. It uses `raghuhasan/asl2000-i3d` from Hugging Face, downloads the I3D
architecture helper if needed, and falls back to the TFLite detector when
`ASL_DETECTOR_BACKEND=auto` cannot initialize it.

```text
ASL_DETECTOR_BACKEND=auto       # default: WLASL2000, then TFLite
ASL_DETECTOR_BACKEND=tflite     # lightweight TFLite-only backend
ASL_DETECTOR_BACKEND=wlasl_i3d  # WLASL2000 I3D only
WLASL_I3D_CONFIDENCE_THRESHOLD=0.20
WLASL_I3D_SEQUENCE_WINDOW=64
WLASL_I3D_SEQUENCE_STRIDE=32
WLASL_I3D_FRAME_SIZE=224
```

The WLASL backend is heavier and more experimental. Its model card reports
2,000 classes with 32.48% top-1, 57.31% top-5, and 66.31% top-10 accuracy, so
the UI exposes top candidates and segment diagnostics instead of hiding
uncertainty.

Live camera debug prioritizes speed over long temporal batching. It starts
predicting after `LIVE_ASL_MIN_FRAMES=4`, keeps a rolling buffer of
`LIVE_ASL_MAX_FRAMES=12`, and runs ASL prediction every
`LIVE_ASL_PREDICT_EVERY=1` frame. DeepFace emotion is heavier, so it runs every
`LIVE_EMOTION_EVERY=45` frames by default.

Good first signs to test because they are in the model vocabulary:

```text
hello, where, who, why, yes, no, thankyou, please, water, happy, sad
```

Reference clips and GIFs from the upstream demo list:

```text
hello    https://media.giphy.com/media/3o7TKNKOfKlIhbD3gY/giphy.gif
where    https://lifeprint.com/asl101/gifs/w/where.gif
who      https://lifeprint.com/asl101/gifs/w/who.gif
why      https://lifeprint.com/asl101/gifs/w/why.gif
yes      https://media.tenor.com/oYIirlyIih0AAAAC/yes-asl.gif
no       https://lifeprint.com/asl101/gifs/n/no-2-movement.gif
thankyou https://lifeprint.com/asl101/gifs/t/thank-you.gif
please   https://lifeprint.com/asl101/gifs-animated/pleasecloseup.gif
water    https://lifeprint.com/asl101/gifs/w/water-2.gif
```

## GPU Dependencies

`flash-attn` is only useful on a CUDA GPU Space with compatible PyTorch/CUDA
versions. Keep the default `requirements.txt` for CPU Spaces. If the Space is
moved to a compatible GPU runtime, install the GPU dependency set instead:

```bash
pip install -r requirements-gpu.txt --no-build-isolation
```

## Full ASL Dependencies

The default requirements keep the Space build lighter. For a full local
ASL/emotion runtime with MediaPipe landmarks and DeepFace emotion detection,
install:

```bash
pip install -r requirements-asl-full.txt
```