Spaces:

build-small-hackathon
/

Sign2Voice

Build error

File size: 8,226 Bytes

94bf482
15ccbbc
de2df4e
15ccbbc
 
94bf482
dfed012
d58060f
94bf482
 
c550165
 
 
 
 
94bf482
 
15ccbbc
 
de2df4e
15ccbbc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e28ca38
 
 
 
15ccbbc
 
 
 
 
 
 
e28ca38
 
 
15ccbbc
e28ca38
 
15ccbbc
c6c2ad9
 
15ccbbc
 
 
c6c2ad9
 
15ccbbc
 
 
 
 
 
 
 
 
 
 
c6c2ad9
15ccbbc
 
 
 
 
e28ca38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15ccbbc
 
 
c6c2ad9
 
 
 
e28ca38
15ccbbc
e28ca38
15ccbbc
e28ca38
 
 
 
1122f32
 
 
15ccbbc
 
1122f32
15ccbbc
 
 
 
 
e28ca38
15ccbbc
 
 
 
2eb805a
15ccbbc
7d969c6
 
 
 
 
 
 
 
 
15ccbbc
 
 
7d969c6
15ccbbc
 
 
 
20aa7e1
 
15ccbbc
9479558
20aa7e1
 
 
 
 
 
 
15ccbbc
 
 
 
20aa7e1
15ccbbc
 
 
 
 
6dc8a0d
c6057fb
 
 
 
 
 
15ccbbc
c6057fb
 
 
 
 
 
 
 
 
 
 
 
 
15ccbbc
2eb805a
15ccbbc
 
 
2eb805a
 
 
 
e28ca38
15ccbbc
e28ca38
15ccbbc
 
 
e28ca38

---
title: Sign2Voice
emoji: 🗣️
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: "5.50.0"
python_version: "3.12"
app_file: app.py
pinned: false
tags:
  - track:backyard
  - achievement:offgrid
  - achievement:offbrand
  - achievement:llama
---

A local-first AI stack that translates sign language, intent, and expression
into natural speech.

Sign2Voice is a Gradio demo built for the Hugging Face Build Small Hackathon.
It takes an uploaded or webcam ASL clip, extracts sign candidates and facial
expression signals, converts the result into a compact intent JSON, then speaks
the sentence with a local small-model voice stack.

## Scope Note

We would have loved to push Sign2Voice further toward open-ended ASL
conversation during the hackathon. The main constraint was not the app shell or
the voice stack; it was data. Public ASL resources are still uneven for this
task: many usable datasets focus on isolated signs, dictionary retrieval, or
fingerspelling, while fluent ASL needs signer diversity, real-world lighting and
camera angles, temporal boundaries, facial grammar, body movement, and careful
expert annotation.

That is why this submission is intentionally evidence-first. It exposes top sign
candidates, confidence thresholds, segment diagnostics, and emotion metadata
instead of pretending that a small hackathon model can solve full ASL
translation end to end. Microsoft Research describes sign-language modeling as
being far behind spoken-language modeling largely because of a lack of
appropriate training data, and recent SLR survey work calls data acquisition and
annotation the main bottleneck for systems that work on fluent signing.

References:

- [ASL Citizen, Microsoft Research](https://www.microsoft.com/en-us/research/project/asl-citizen/)
- [Trends and Challenges for Sign Language Recognition with Machine Learning, ESANN 2023](https://www.esann.org/sites/default/files/proceedings/2023/ES2023-7.pdf)

## Hackathon Fit

The project is aimed at the Backyard AI track: a practical communication tool
for people who need fast sign-to-speech support without sending every clip to a
cloud API.

Build Small constraints covered by this repo:

- Gradio app, ready for a Hugging Face Space.
- Small-model stack under the 32B parameter limit.
- Local-first runtime: ASL detection, llama.cpp text generation, and Qwen3-TTS
  speech generation run in the app process.
- Step-by-step demo flow that shows intermediate evidence instead of hiding
  uncertainty.

Badges that fit this build:

- Off the Grid / Local-first: no cloud inference API is required at runtime.
- Llama Champion: the intent-to-speech text step uses `llama.cpp`.
- Off-Brand / Custom UI: the Space uses custom Gradio styling.
- Field Notes: claim this only after publishing the build write-up.

Badges not claimed:

- Well-Tuned: this repo uses published models; it does not publish a new
  fine-tuned model.
- Sharing is Caring: no public agent trace is included yet.

## Pipeline

```text
Video upload or camera capture
-> Sequential ASL frame sampling
-> MediaPipe landmarks, when installed
-> WLASL2000 I3D or TFLite ASL classifier
-> Ordered gloss sequence with confidence diagnostics
-> DeepFace emotion aggregation, when installed
-> llama.cpp subtitle and voice instruction
-> Qwen3-TTS audio
```

Each brick can fail independently and return diagnostics instead of blocking
the whole interface at startup.

The demo screen is intentionally staged:

```text
1. Analyze ASL -> debug overlay + intent JSON
2. Generate subtitle -> llama.cpp output
3. Generate speech -> Qwen3-TTS audio
```

When no ASL classifier is available, Sign2Voice reports `model_missing` and
does not invent ASL words. The manual gloss override is empty by default and
lives under advanced debug controls for downstream LLM/TTS testing only.

## Run Locally

Install the default CPU-friendly dependency set:

```bash
pip install -r requirements.txt
```

Start the Gradio app:

```bash
python3 app.py
```

Run unit tests:

```bash
python3 -m pytest -q
```

Run each brick directly:

```bash
python3 scripts/test_asl_brick.py
python3 scripts/test_llm_brick.py
python3 scripts/test_tts_brick.py
python3 scripts/test_full_pipeline.py
```

`scripts/test_asl_brick.py` creates a tiny temporary clip when no video path is
supplied. It also writes a debug overlay video path. To test the transparent
fallback:

```bash
python3 scripts/test_asl_brick.py --gloss-override "I LOVE YOU"
```

## ASL Model Files

The ASL classifier assets live under:

```text
data/models/asl/model.tflite
data/models/asl/train.csv
data/models/asl/sign_to_prediction_index_map.json
```

They come from
[jamesjbustos/sign-language-recognition](https://github.com/jamesjbustos/sign-language-recognition).

Without these files, the ASL brick still samples frames and emits
`model_missing` diagnostics. The TFLite model recognizes the isolated signs in
`sign_to_prediction_index_map.json`; it is not a full sentence or fingerspelling
recognizer. Predictions below `ASL_CONFIDENCE_THRESHOLD`, defaulting to `0.70`,
are reported as `low_confidence` and are not forwarded as detected glosses.

Uploaded videos use a phrase-prototype mode: frames are read in temporal order,
landmarks are extracted once, and the ASL model runs over sliding windows.
Accepted window predictions are collapsed into an ordered gloss sequence before
`llama.cpp` rewrites them as natural speech.

Tune it with:

```text
ASL_UPLOAD_TARGET_FPS=12
ASL_UPLOAD_MAX_FRAMES=240
ASL_SEQUENCE_WINDOW=30
ASL_SEQUENCE_STRIDE=15
ASL_CONFIDENCE_THRESHOLD=0.70
```

This is still not full continuous ASL translation, but it lets recorded phrase
clips become `hello where water`-style gloss sequences instead of one global
class.

An experimental WLASL2000 I3D backend is also available for broader vocabulary
coverage. It uses `raghuhasan/asl2000-i3d` from Hugging Face, downloads the I3D
architecture helper if needed, and falls back to the TFLite detector when
`ASL_DETECTOR_BACKEND=auto` cannot initialize it.

```text
ASL_DETECTOR_BACKEND=auto       # default: WLASL2000, then TFLite
ASL_DETECTOR_BACKEND=tflite     # lightweight TFLite-only backend
ASL_DETECTOR_BACKEND=wlasl_i3d  # WLASL2000 I3D only
WLASL_I3D_CONFIDENCE_THRESHOLD=0.20
WLASL_I3D_SEQUENCE_WINDOW=64
WLASL_I3D_SEQUENCE_STRIDE=32
WLASL_I3D_FRAME_SIZE=224
```

The WLASL backend is heavier and more experimental. Its model card reports
2,000 classes with 32.48% top-1, 57.31% top-5, and 66.31% top-10 accuracy, so
the UI exposes top candidates and segment diagnostics instead of hiding
uncertainty.

Live camera debug prioritizes speed over long temporal batching. It starts
predicting after `LIVE_ASL_MIN_FRAMES=4`, keeps a rolling buffer of
`LIVE_ASL_MAX_FRAMES=12`, and runs ASL prediction every
`LIVE_ASL_PREDICT_EVERY=1` frame. DeepFace emotion is heavier, so it runs every
`LIVE_EMOTION_EVERY=45` frames by default.

Good first signs to test because they are in the model vocabulary:

```text
hello, where, who, why, yes, no, thankyou, please, water, happy, sad
```

Reference clips and GIFs from the upstream demo list:

```text
hello    https://media.giphy.com/media/3o7TKNKOfKlIhbD3gY/giphy.gif
where    https://lifeprint.com/asl101/gifs/w/where.gif
who      https://lifeprint.com/asl101/gifs/w/who.gif
why      https://lifeprint.com/asl101/gifs/w/why.gif
yes      https://media.tenor.com/oYIirlyIih0AAAAC/yes-asl.gif
no       https://lifeprint.com/asl101/gifs/n/no-2-movement.gif
thankyou https://lifeprint.com/asl101/gifs/t/thank-you.gif
please   https://lifeprint.com/asl101/gifs-animated/pleasecloseup.gif
water    https://lifeprint.com/asl101/gifs/w/water-2.gif
```

## GPU Dependencies

`flash-attn` is only useful on a CUDA GPU Space with compatible PyTorch/CUDA
versions. Keep the default `requirements.txt` for CPU Spaces. If the Space is
moved to a compatible GPU runtime, install the GPU dependency set instead:

```bash
pip install -r requirements-gpu.txt --no-build-isolation
```

## Full ASL Dependencies

The default requirements keep the Space build lighter. For a full local
ASL/emotion runtime with MediaPipe landmarks and DeepFace emotion detection,
install:

```bash
pip install -r requirements-asl-full.txt
```