Spaces:

build-small-hackathon
/

Sign2Voice

Build error

App Files Files Community

Sign2Voice / README.md

lilblueyes

Update requirements

dfed012 14 days ago

preview code

Raw

History Blame Contribute Delete

8.23 kB

	---
	title: Sign2Voice
	emoji: 🗣️
	colorFrom: green
	colorTo: yellow
	sdk: gradio
	sdk_version: "5.50.0"
	python_version: "3.12"
	app_file: app.py
	pinned: false
	tags:
	- track:backyard
	- achievement:offgrid
	- achievement:offbrand
	- achievement:llama
	---

	A local-first AI stack that translates sign language, intent, and expression
	into natural speech.

	Sign2Voice is a Gradio demo built for the Hugging Face Build Small Hackathon.
	It takes an uploaded or webcam ASL clip, extracts sign candidates and facial
	expression signals, converts the result into a compact intent JSON, then speaks
	the sentence with a local small-model voice stack.

	## Scope Note

	We would have loved to push Sign2Voice further toward open-ended ASL
	conversation during the hackathon. The main constraint was not the app shell or
	the voice stack; it was data. Public ASL resources are still uneven for this
	task: many usable datasets focus on isolated signs, dictionary retrieval, or
	fingerspelling, while fluent ASL needs signer diversity, real-world lighting and
	camera angles, temporal boundaries, facial grammar, body movement, and careful
	expert annotation.

	That is why this submission is intentionally evidence-first. It exposes top sign
	candidates, confidence thresholds, segment diagnostics, and emotion metadata
	instead of pretending that a small hackathon model can solve full ASL
	translation end to end. Microsoft Research describes sign-language modeling as
	being far behind spoken-language modeling largely because of a lack of
	appropriate training data, and recent SLR survey work calls data acquisition and
	annotation the main bottleneck for systems that work on fluent signing.

	References:

	- [ASL Citizen, Microsoft Research](https://www.microsoft.com/en-us/research/project/asl-citizen/)
	- [Trends and Challenges for Sign Language Recognition with Machine Learning, ESANN 2023](https://www.esann.org/sites/default/files/proceedings/2023/ES2023-7.pdf)

	## Hackathon Fit

	The project is aimed at the Backyard AI track: a practical communication tool
	for people who need fast sign-to-speech support without sending every clip to a
	cloud API.

	Build Small constraints covered by this repo:

	- Gradio app, ready for a Hugging Face Space.
	- Small-model stack under the 32B parameter limit.
	- Local-first runtime: ASL detection, llama.cpp text generation, and Qwen3-TTS
	speech generation run in the app process.
	- Step-by-step demo flow that shows intermediate evidence instead of hiding
	uncertainty.

	Badges that fit this build:

	- Off the Grid / Local-first: no cloud inference API is required at runtime.
	- Llama Champion: the intent-to-speech text step uses `llama.cpp`.
	- Off-Brand / Custom UI: the Space uses custom Gradio styling.
	- Field Notes: claim this only after publishing the build write-up.

	Badges not claimed:

	- Well-Tuned: this repo uses published models; it does not publish a new
	fine-tuned model.
	- Sharing is Caring: no public agent trace is included yet.

	## Pipeline

	```text
	Video upload or camera capture
	-> Sequential ASL frame sampling
	-> MediaPipe landmarks, when installed
	-> WLASL2000 I3D or TFLite ASL classifier
	-> Ordered gloss sequence with confidence diagnostics
	-> DeepFace emotion aggregation, when installed
	-> llama.cpp subtitle and voice instruction
	-> Qwen3-TTS audio
	```

	Each brick can fail independently and return diagnostics instead of blocking
	the whole interface at startup.

	The demo screen is intentionally staged:

	```text
	1. Analyze ASL -> debug overlay + intent JSON
	2. Generate subtitle -> llama.cpp output
	3. Generate speech -> Qwen3-TTS audio
	```

	When no ASL classifier is available, Sign2Voice reports `model_missing` and
	does not invent ASL words. The manual gloss override is empty by default and
	lives under advanced debug controls for downstream LLM/TTS testing only.

	## Run Locally

	Install the default CPU-friendly dependency set:

	```bash
	pip install -r requirements.txt
	```

	Start the Gradio app:

	```bash
	python3 app.py
	```

	Run unit tests:

	```bash
	python3 -m pytest -q
	```

	Run each brick directly:

	```bash
	python3 scripts/test_asl_brick.py
	python3 scripts/test_llm_brick.py
	python3 scripts/test_tts_brick.py
	python3 scripts/test_full_pipeline.py
	```

	`scripts/test_asl_brick.py` creates a tiny temporary clip when no video path is
	supplied. It also writes a debug overlay video path. To test the transparent
	fallback:

	```bash
	python3 scripts/test_asl_brick.py --gloss-override "I LOVE YOU"
	```

	## ASL Model Files

	The ASL classifier assets live under:

	```text
	data/models/asl/model.tflite
	data/models/asl/train.csv
	data/models/asl/sign_to_prediction_index_map.json
	```

	They come from
	[jamesjbustos/sign-language-recognition](https://github.com/jamesjbustos/sign-language-recognition).

	Without these files, the ASL brick still samples frames and emits
	`model_missing` diagnostics. The TFLite model recognizes the isolated signs in
	`sign_to_prediction_index_map.json`; it is not a full sentence or fingerspelling
	recognizer. Predictions below `ASL_CONFIDENCE_THRESHOLD`, defaulting to `0.70`,
	are reported as `low_confidence` and are not forwarded as detected glosses.

	Uploaded videos use a phrase-prototype mode: frames are read in temporal order,
	landmarks are extracted once, and the ASL model runs over sliding windows.
	Accepted window predictions are collapsed into an ordered gloss sequence before
	`llama.cpp` rewrites them as natural speech.

	Tune it with:

	```text
	ASL_UPLOAD_TARGET_FPS=12
	ASL_UPLOAD_MAX_FRAMES=240
	ASL_SEQUENCE_WINDOW=30
	ASL_SEQUENCE_STRIDE=15
	ASL_CONFIDENCE_THRESHOLD=0.70
	```

	This is still not full continuous ASL translation, but it lets recorded phrase
	clips become `hello where water`-style gloss sequences instead of one global
	class.

	An experimental WLASL2000 I3D backend is also available for broader vocabulary
	coverage. It uses `raghuhasan/asl2000-i3d` from Hugging Face, downloads the I3D
	architecture helper if needed, and falls back to the TFLite detector when
	`ASL_DETECTOR_BACKEND=auto` cannot initialize it.

	```text
	ASL_DETECTOR_BACKEND=auto # default: WLASL2000, then TFLite
	ASL_DETECTOR_BACKEND=tflite # lightweight TFLite-only backend
	ASL_DETECTOR_BACKEND=wlasl_i3d # WLASL2000 I3D only
	WLASL_I3D_CONFIDENCE_THRESHOLD=0.20
	WLASL_I3D_SEQUENCE_WINDOW=64
	WLASL_I3D_SEQUENCE_STRIDE=32
	WLASL_I3D_FRAME_SIZE=224
	```

	The WLASL backend is heavier and more experimental. Its model card reports
	2,000 classes with 32.48% top-1, 57.31% top-5, and 66.31% top-10 accuracy, so
	the UI exposes top candidates and segment diagnostics instead of hiding
	uncertainty.

	Live camera debug prioritizes speed over long temporal batching. It starts
	predicting after `LIVE_ASL_MIN_FRAMES=4`, keeps a rolling buffer of
	`LIVE_ASL_MAX_FRAMES=12`, and runs ASL prediction every
	`LIVE_ASL_PREDICT_EVERY=1` frame. DeepFace emotion is heavier, so it runs every
	`LIVE_EMOTION_EVERY=45` frames by default.

	Good first signs to test because they are in the model vocabulary:

	```text
	hello, where, who, why, yes, no, thankyou, please, water, happy, sad
	```

	Reference clips and GIFs from the upstream demo list:

	```text
	hello https://media.giphy.com/media/3o7TKNKOfKlIhbD3gY/giphy.gif
	where https://lifeprint.com/asl101/gifs/w/where.gif
	who https://lifeprint.com/asl101/gifs/w/who.gif
	why https://lifeprint.com/asl101/gifs/w/why.gif
	yes https://media.tenor.com/oYIirlyIih0AAAAC/yes-asl.gif
	no https://lifeprint.com/asl101/gifs/n/no-2-movement.gif
	thankyou https://lifeprint.com/asl101/gifs/t/thank-you.gif
	please https://lifeprint.com/asl101/gifs-animated/pleasecloseup.gif
	water https://lifeprint.com/asl101/gifs/w/water-2.gif
	```

	## GPU Dependencies

	`flash-attn` is only useful on a CUDA GPU Space with compatible PyTorch/CUDA
	versions. Keep the default `requirements.txt` for CPU Spaces. If the Space is
	moved to a compatible GPU runtime, install the GPU dependency set instead:

	```bash
	pip install -r requirements-gpu.txt --no-build-isolation
	```

	## Full ASL Dependencies

	The default requirements keep the Space build lighter. For a full local
	ASL/emotion runtime with MediaPipe landmarks and DeepFace emotion detection,
	install:

	```bash
	pip install -r requirements-asl-full.txt
	```