File size: 5,416 Bytes
18d028b
 
 
 
 
961668b
 
18d028b
96fe5d4
4499b6e
d7d6b24
7e77fa5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18d028b
 
 
 
 
 
 
 
8b64ea8
18d028b
 
 
 
a5ffd9e
 
 
 
 
 
 
 
 
 
 
 
18d028b
 
a5ffd9e
 
 
 
 
 
18d028b
 
 
 
 
 
 
 
 
 
a5ffd9e
18d028b
 
 
 
 
 
 
 
 
277d6c0
 
 
 
18d028b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5952553
 
18d028b
 
 
 
 
4499b6e
18d028b
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
title: SignBridge
emoji: 🀟
colorFrom: indigo
colorTo: pink
sdk: docker
app_port: 7860
pinned: false
thumbnail: assets/cover.png
license: mit
short_description: Real-time ASL β†’ English speech on AMD MI300X.
tags:
  - accessibility
  - sign-language
  - asl
  - vision
  - multimodal
  - speech-synthesis
  - qwen
  - qwen3-vl
  - amd
  - amd-mi300x
  - rocm
  - vllm
  - lora
  - fine-tuning
  - mediapipe
  - gradio
  - hackathon
---

# SignBridge β€” real-time ASL β†’ speech

Two people who couldn't communicate, now can.

A deaf person signs into the webcam. SignBridge β€” a multi-stage vision + reasoning + voice pipeline running on a single AMD Instinct MI300X β€” translates the signs into spoken English in under 2 seconds.

Submission for the **AMD Developer Hackathon** (LabLab.ai, May 2026) β€” **Track 3: Vision & Multimodal AI**.

## How it works

```
                  β”Œβ”€β–Ί MediaPipe Hand β†’ trained MLP    (90% acc, 50ms CPU)
webcam frame ─────                       β”‚
                  └─► fine-tuned Qwen3-VL-8B (LoRA on AMD MI300X)
                                          β”‚      (92% acc, motion + fallback)
                                          β–Ό
                          Qwen3-8B sentence composer
                                          β”‚   (AMD MI300X)
                                          β–Ό
                              Coqui XTTS-v2 TTS
                                          β”‚
                                          β–Ό
                                       πŸ”Š speech
```

A hybrid pipeline: a small classical-ML classifier handles static fingerspelling at 90% accuracy with 50 ms CPU latency; a LoRA-fine-tuned Qwen3-VL-8B handles motion-dependent signs and ambiguous static frames; Qwen3-8B turns sign tokens into natural English. The two LLMs run **concurrently on a single AMD Instinct MI300X** via vLLM 0.17.1 on ROCm 7.2 β€” combined ~34 GB on a 192 GB GPU.

The fine-tune itself was trained on a single MI300X in **54 minutes** with LoRA (rank 16, target q/k/v/o, 2 epochs on 9,786 ASL Alphabet samples). Final eval loss 0.48; gold-set accuracy 92.3% β€” a 4.8Γ— lift over the 19.2% zero-shot baseline.

- Fine-tuned model: `huggingface.co/LucasLooTan/signbridge-qwen3vl-8b-asl`
- Landmark classifier: `huggingface.co/LucasLooTan/signbridge-asl-classifier`

## V1 use cases

1. **ASL fingerspelling alphabet** β€” sign A–Z and 0–9 β†’ AI speaks the letters / numbers
2. **Top-50 WLASL signs** (hello, thank you, name, please, sorry, family, eat, drink, work, …) β†’ AI composes grammatical English sentences

V1 is **one-way**: deaf signs β†’ hearing hears. Reverse direction (speech β†’ on-screen text) is V2.

## Why AMD

The MI300X did three jobs in this project on a single GPU: (1) ran the LoRA fine-tune of Qwen3-VL-8B in 54 minutes; (2) hosts the merged model for inference via vLLM; (3) hosts the Qwen3-8B composer in parallel for sentence composition. 192 GB HBM3 means we never had to reload weights, swap, or shard between training and serving. NVIDIA H100 (80 GB) would require a 3-GPU cluster for the same V2 70B reasoner upgrade β€” practical accessibility tools running globally need the cost-and-availability profile that AMD enables.

## Why this matters (business case)

Sign-language interpreters cost **$50–200 per hour** and are scarce. Courts, hospitals, schools, and public services **must by law** provide interpretation (ADA Title II/III in the US, EAA 2025 in the EU). Sorenson VRS β€” the dominant relay-services provider β€” books **$4B+ in annual revenue** in this space. SignBridge is the open-source backbone that any country, NGO, or enterprise can deploy on their own AMD compute.

## Privacy

Session-only. Frames and audio are processed in-memory and not persisted server-side beyond the WebSocket / HTTP session.

## For Deaf-led teams

SignBridge is open-source under MIT license and intentionally scoped to ASL-only V1. The pipeline is a substrate, not a finished product β€” Deaf-led organisations (schools-for-the-Deaf, NGOs, ministries) are the intended deployers. Other sign languages (BSL, MSL, CSL, ISL, +200 more) deserve their own teams, training data, and Deaf community leadership. See [`docs/walkthrough.md`](docs/walkthrough.md) β†’ "Deployment ethics" for the design principles drawn from the Deaf-led academic literature.

## Local dev

```bash
# Setup
pip install -r requirements.txt
cp .env.example .env   # fill in HF_TOKEN, AMD_DEV_CLOUD_*, OPENAI_API_KEY (fallback)

# Run the Gradio app
python app.py

# Run the inference backend (point at AMD Dev Cloud or local ROCm)
python -m signbridge.backend

# Train the classifier on WLASL Top-100 (Day 2 task β€” run on AMD Dev Cloud)
python -m signbridge.scripts.train_classifier --dataset data/wlasl --epochs 30
```

## Datasets used

- [WLASL](https://github.com/dxli94/WLASL) β€” Word-Level American Sign Language; we use the Top-100 subset
- ASL fingerspelling alphabet (open dataset)

## Models pulled from Hugging Face Hub

- `Qwen/Qwen3-VL-32B-Instruct` β€” sign vision (recognizer)
- `Qwen/Qwen3-8B` β€” sentence composer
- `coqui/XTTS-v2` β€” text-to-speech
- (V2 stretch) `openai/whisper-large-v3` β€” for the reverse direction

## License

MIT. See [`LICENSE`](LICENSE).

## Status

Active development β€” see `CLAUDE.md` for the working state and `docs/walkthrough.md` for the technical writeup.