Commit Β·
fb11c61
1
Parent(s): 51fa863
docs+pptx: refresh all submission deliverables to match shipping pipeline
Browse filesAcross all docs + the in-app Record-sign description, replaced outdated
references to:
- "samples 4 frames" β "video sent natively to Qwen3-VL via vLLM video_url"
- "Coqui XTTS-v2" β "gTTS"
- "Llama-3.1-8B" β "Qwen3-8B"
New: signbridge/scripts/build_pitch_deck.py β one-shot generator for
assets/pitch-deck.pptx (8 slides, 16:9, 38.5 KB) so the user can upload
directly to Google Slides or the lablab.ai submission form.
New: docs/USER_TODO.md β what only Lucas can do (record demo video,
fill submission form). Everything else (text, .pptx, cover image) is
already produced.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- assets/pitch-deck.pptx +0 -0
- docs/SUBMIT_NOW.md +158 -0
- docs/USER_TODO.md +65 -0
- docs/demo-video-script.md +12 -8
- docs/lablab-submission-form.md +4 -4
- docs/pitch-deck.md +12 -9
- docs/walkthrough.md +41 -30
- signbridge/scripts/build_pitch_deck.py +262 -0
- signbridge/space.py +4 -3
assets/pitch-deck.pptx
ADDED
|
Binary file (39.4 kB). View file
|
|
|
docs/SUBMIT_NOW.md
ADDED
|
@@ -0,0 +1,158 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# SignBridge β paste-ready lablab.ai submission
|
| 2 |
+
|
| 3 |
+
> Submission deadline: **2026-05-11 03:00 Malaysia Time** (= Sunday May 10 12:00 PM Pacific Time).
|
| 4 |
+
> Open https://lablab.ai/ai-hackathons/amd-developer β bottom of page β **Submit Project**.
|
| 5 |
+
> Each block below maps 1:1 to a form field. Paste verbatim.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Project Title (β€70 chars)
|
| 10 |
+
|
| 11 |
+
```
|
| 12 |
+
SignBridge β Real-time ASL β speech, fine-tuned Qwen3-VL on AMD MI300X
|
| 13 |
+
```
|
| 14 |
+
|
| 15 |
+
(70 characters; leads with the Track 2 fine-tune story.)
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## Short Description (~150 chars)
|
| 20 |
+
|
| 21 |
+
```
|
| 22 |
+
Two people who couldn't communicate, now can. Real-time ASL β English speech, powered by Qwen3-VL we fine-tuned on AMD MI300X.
|
| 23 |
+
```
|
| 24 |
+
|
| 25 |
+
(132 characters.)
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## Long Description (~350 words)
|
| 30 |
+
|
| 31 |
+
```
|
| 32 |
+
SignBridge is a real-time American Sign Language β English speech translator built for the AMD Developer Hackathon, Track 3 (Vision & Multimodal AI). We fine-tuned Qwen3-VL-8B on a single AMD Instinct MI300X and serve it natively through vLLM's video understanding API.
|
| 33 |
+
|
| 34 |
+
The user signs at the webcam β fingerspelled letters (Snapshot tab) or full motion words (Record sign tab) β and SignBridge replies in spoken English. Two people who couldn't communicate, now can.
|
| 35 |
+
|
| 36 |
+
Architecture: a hybrid pipeline. (1) MediaPipe Hand β trained MLP classifier handles static fingerspelling at 90% accuracy and 50ms latency on CPU β the textbook approach for static-pose tasks. (2) For motion words the recorded webcam clip is transcoded by ffmpeg and sent natively to a LoRA-fine-tuned Qwen3-VL-8B via vLLM's video_url block β Qwen3-VL processes the entire clip with its own temporal encoder rather than us pre-sampling frames. The fine-tune was 54 minutes on a single AMD Instinct MI300X and lifts ASL accuracy from 19% zero-shot to 92% in transformers eval. (3) Qwen3-8B composes the recognised sign tokens into natural English; gTTS turns the sentence into speech. Both LLMs run concurrently on the same MI300X via vLLM 0.17.1 on ROCm 7.2.
|
| 37 |
+
|
| 38 |
+
The MI300X did three jobs in this project on a single GPU: (1) ran the LoRA fine-tune in 54 minutes; (2) hosts the merged Qwen3-VL-8B for inference; (3) hosts the 8B composer in parallel. 192 GB HBM3 means we never had to reload weights or shard. The same workload on NVIDIA H100 (80 GB) would need a 3-GPU cluster.
|
| 39 |
+
|
| 40 |
+
Fine-tune artefacts (verifiable by judges): the merged Qwen3-VL-8B-ASL is public at huggingface.co/LucasLooTan/signbridge-qwen3vl-8b-asl. The MediaPipe-MLP classifier is at huggingface.co/LucasLooTan/signbridge-asl-classifier. Both pulled at runtime via hf_hub_download.
|
| 41 |
+
|
| 42 |
+
Why this matters: ASL interpreters cost $50β200 per hour and are scarce. Sorenson VRS books $4B+ in annual revenue filling this gap. SignBridge is an open-source MIT-licensed substrate that any Deaf-led NGO, school, ministry, or enterprise can deploy on their own AMD compute.
|
| 43 |
+
|
| 44 |
+
V1 is ASL-only by design β sign languages aren't interchangeable, and Deaf-led teams should own their own deployments. Built solo by Lucas Loo Tan Yu Heng, May 5β11, 2026.
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
## Technology & Category Tags
|
| 50 |
+
|
| 51 |
+
Pick from lablab dropdown:
|
| 52 |
+
|
| 53 |
+
**Primary (must select):**
|
| 54 |
+
- `Qwen` and/or `Qwen3-VL`
|
| 55 |
+
- `AMD Developer Cloud`
|
| 56 |
+
- `AMD ROCm`
|
| 57 |
+
- `HuggingFace Spaces`
|
| 58 |
+
|
| 59 |
+
**Secondary (relevant):**
|
| 60 |
+
- `LLaMA` (no β we replaced this with Qwen3-8B; skip)
|
| 61 |
+
- `Gradio`
|
| 62 |
+
- `FastAPI`
|
| 63 |
+
- `Vision`
|
| 64 |
+
- `Multimodal`
|
| 65 |
+
- `Accessibility`
|
| 66 |
+
- `Open Source`
|
| 67 |
+
- `vLLM`
|
| 68 |
+
|
| 69 |
+
**Track:** **Track 3 β Vision & Multimodal AI** (also satisfies Track 2 fine-tuning narrative if dual-track allowed)
|
| 70 |
+
|
| 71 |
+
---
|
| 72 |
+
|
| 73 |
+
## Pipeline at a glance (May 10 β current shipping)
|
| 74 |
+
|
| 75 |
+
Paste this block anywhere a one-screen architecture summary is needed (lablab form, slide notes, README):
|
| 76 |
+
|
| 77 |
+
```
|
| 78 |
+
- Static fingerspelling: MediaPipe Hand β trained MLP classifier (90% accuracy, ~50 ms on CPU)
|
| 79 |
+
- Motion signs: webcam recording β ffmpeg (480p, 8 fps, β€4 s, H.264) β vLLM /v1/chat/completions
|
| 80 |
+
with a video_url block β fine-tuned Qwen3-VL-8B on AMD MI300X
|
| 81 |
+
- Sentence composer: Qwen3-8B on the same MI300X (vLLM, separate port)
|
| 82 |
+
- Speech synthesis: gTTS (Google's free TTS, fast, MP3 output)
|
| 83 |
+
- Live demo: HF Space (Gradio Docker SDK) β both tabs, end-to-end
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
---
|
| 87 |
+
|
| 88 |
+
## Cover Image
|
| 89 |
+
|
| 90 |
+
Upload `assets/cover.png` from the repo (1280Γ640 PNG, indigoβpink gradient with π€ + project name).
|
| 91 |
+
|
| 92 |
+
---
|
| 93 |
+
|
| 94 |
+
## Video Presentation
|
| 95 |
+
|
| 96 |
+
Paste the **YouTube Unlisted URL** of your demo video.
|
| 97 |
+
|
| 98 |
+
Reference shot list: `docs/demo-video-script.md`.
|
| 99 |
+
|
| 100 |
+
---
|
| 101 |
+
|
| 102 |
+
## Slide Presentation
|
| 103 |
+
|
| 104 |
+
Upload the **deck PDF**.
|
| 105 |
+
|
| 106 |
+
Build from `docs/pitch-deck.md`:
|
| 107 |
+
1. Open Google Slides β blank deck
|
| 108 |
+
2. Paste each slide's content into a blank slide
|
| 109 |
+
3. File β Download β PDF
|
| 110 |
+
4. Upload here
|
| 111 |
+
|
| 112 |
+
---
|
| 113 |
+
|
| 114 |
+
## Public GitHub Repository
|
| 115 |
+
|
| 116 |
+
```
|
| 117 |
+
https://github.com/seekerPrice/signbridge
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
---
|
| 121 |
+
|
| 122 |
+
## Demo Application Platform
|
| 123 |
+
|
| 124 |
+
```
|
| 125 |
+
Hugging Face Space
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
## Application URL
|
| 131 |
+
|
| 132 |
+
```
|
| 133 |
+
https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/signbridge
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
---
|
| 137 |
+
|
| 138 |
+
## Final pre-submit checklist
|
| 139 |
+
|
| 140 |
+
Before clicking Submit:
|
| 141 |
+
|
| 142 |
+
- [ ] Title pasted (70 chars)
|
| 143 |
+
- [ ] Short description pasted (132 chars)
|
| 144 |
+
- [ ] Long description pasted (~350 words)
|
| 145 |
+
- [ ] Tags selected (at minimum: Qwen, AMD Developer Cloud, AMD ROCm, HuggingFace Spaces)
|
| 146 |
+
- [ ] Cover image uploaded (`assets/cover.png`)
|
| 147 |
+
- [ ] Video URL pasted (YouTube unlisted)
|
| 148 |
+
- [ ] Pitch deck PDF uploaded
|
| 149 |
+
- [ ] GitHub URL pasted
|
| 150 |
+
- [ ] HF Space URL pasted
|
| 151 |
+
- [ ] **Track selection: Track 3 β Vision & Multimodal AI**
|
| 152 |
+
- [ ] Open Space in incognito β confirm it loads
|
| 153 |
+
- [ ] GitHub repo public + has clean README
|
| 154 |
+
- [ ] LICENSE file is MIT
|
| 155 |
+
|
| 156 |
+
When all boxes ticked β click Submit β wait for confirmation email β done.
|
| 157 |
+
|
| 158 |
+
**Aim to submit by 2026-05-11 02:00 MYT** (1-hour buffer before the 03:00 cutoff).
|
docs/USER_TODO.md
ADDED
|
@@ -0,0 +1,65 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# SignBridge β what only Lucas can do
|
| 2 |
+
|
| 3 |
+
> Status (2026-05-10): **submission deadline 03:00 MYT β ~5 hours left.** All written content + the .pptx deck are produced. Two things still need a human.
|
| 4 |
+
|
| 5 |
+
## 1 β Record the 2-min demo video
|
| 6 |
+
|
| 7 |
+
Follow `docs/demo-video-script.md`. Tools: QuickTime Player (Mac) for screen + camera capture, iMovie or CapCut for editing.
|
| 8 |
+
|
| 9 |
+
**Minimum viable shot list** (if pressed for time, do only these):
|
| 10 |
+
|
| 11 |
+
1. **Hook (10 s):** plain text card: *"70 million deaf people. Interpreters cost $50β200/hr. They're scarce."*
|
| 12 |
+
2. **Snapshot demo (30 s):** screen recording of `huggingface.co/spaces/lablab-ai-amd-developer-hackathon/signbridge`. Sign L-U-C-A-S letter-by-letter (π· button per letter) β click π Speak β app says "Lucas."
|
| 13 |
+
3. **Record-sign demo (30 s):** switch to Record sign tab β record HELLO for ~1.5 s β click Submit β app says "hello (85%)" β click Speak β audio plays.
|
| 14 |
+
4. **Architecture flash (20 s):** show one slide from `assets/pitch-deck.pptx` β slide 4 (Architecture). Voiceover: *"Fine-tuned Qwen3-VL-8B handles motion ASL natively via vLLM video_url, Qwen3-8B composes English, gTTS speaks. All on a single AMD Instinct MI300X."*
|
| 15 |
+
5. **Close (10 s):** GitHub URL + HF Space URL + "π€ SignBridge β MIT licensed."
|
| 16 |
+
|
| 17 |
+
**Hard rules:**
|
| 18 |
+
- Mention "AMD MI300X" by name β₯3 times in voice-over.
|
| 19 |
+
- Mention "Qwen3-VL" by name β₯2 times (Qwen Special Reward eligibility).
|
| 20 |
+
- Burn in subtitles for accessibility.
|
| 21 |
+
- Length: 2:00β2:30 max. Lablab cuts long videos.
|
| 22 |
+
|
| 23 |
+
**After recording:** upload to YouTube as **Unlisted**, copy the URL, paste into the lablab.ai form's "Video Presentation" field.
|
| 24 |
+
|
| 25 |
+
## 2 β Submit the lablab.ai form
|
| 26 |
+
|
| 27 |
+
Open https://lablab.ai/ai-hackathons/amd-developer β scroll to bottom β click **Submit Project**.
|
| 28 |
+
|
| 29 |
+
Use **`docs/SUBMIT_NOW.md`** for paste-ready content. Each block in that file maps 1:1 to a form field. The most important fields:
|
| 30 |
+
|
| 31 |
+
| Form field | Where to copy from |
|
| 32 |
+
|---|---|
|
| 33 |
+
| Project Title | `SUBMIT_NOW.md` first code block |
|
| 34 |
+
| Short Description | second code block (132 chars) |
|
| 35 |
+
| Long Description | third code block (~350 words, **already updated** with current pipeline) |
|
| 36 |
+
| Tags | tag list (Qwen, AMD Developer Cloud, AMD ROCm, HuggingFace Spaces, Vision, Multimodal, Accessibility, Open Source, Gradio, FastAPI, vLLM) |
|
| 37 |
+
| Cover Image | upload `assets/cover.png` (1280Γ640 PNG) |
|
| 38 |
+
| Video Presentation | YouTube unlisted URL from step 1 above |
|
| 39 |
+
| Slide Presentation | upload `assets/pitch-deck.pptx` (38.5 KB, 8 slides β already generated) |
|
| 40 |
+
| Public GitHub Repository | `https://github.com/seekerPrice/signbridge` |
|
| 41 |
+
| Demo Application Platform | `Hugging Face Space` |
|
| 42 |
+
| Application URL | `https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/signbridge` |
|
| 43 |
+
| **Track** | **Track 3 β Vision & Multimodal AI** |
|
| 44 |
+
|
| 45 |
+
**Pre-submit sanity check (do these 5 in incognito Chrome):**
|
| 46 |
+
- [ ] HF Space URL loads β Snapshot tab visible, camera placeholder visible.
|
| 47 |
+
- [ ] GitHub repo URL loads β README + LICENSE visible, license is MIT.
|
| 48 |
+
- [ ] HF Space Settings β Variables and secrets has `SIGNBRIDGE_VLM_MODEL=signbridge-qwen3vl-8b-asl` set (otherwise Record-sign returns 404).
|
| 49 |
+
- [ ] Video URL (YouTube) is publicly accessible β open in incognito to confirm.
|
| 50 |
+
- [ ] `assets/pitch-deck.pptx` opens in Google Slides / Keynote / PowerPoint without errors.
|
| 51 |
+
|
| 52 |
+
When all 5 ticked β Submit form β wait for confirmation email β done.
|
| 53 |
+
|
| 54 |
+
**Aim to submit by 02:00 MYT** (1-hour buffer before 03:00 cutoff).
|
| 55 |
+
|
| 56 |
+
---
|
| 57 |
+
|
| 58 |
+
## Done by Claude (you don't need to touch)
|
| 59 |
+
|
| 60 |
+
- [x] All `docs/` content updated to reflect current shipping pipeline (Qwen3-VL native video, gTTS, Qwen3-8B composer).
|
| 61 |
+
- [x] `signbridge/space.py` Record-sign tab description updated (no more "samples 4 frames").
|
| 62 |
+
- [x] `assets/pitch-deck.pptx` generated from `docs/pitch-deck.md` (8 slides, 16:9, 38.5 KB).
|
| 63 |
+
- [x] `assets/cover.png` is the existing 1280Γ640 indigoβpink gradient (verified, no regenerate needed).
|
| 64 |
+
- [x] `signbridge/scripts/build_pitch_deck.py` script for re-generating the deck if you want edits.
|
| 65 |
+
- [x] All commits pushed to HF Space + GitHub mirror.
|
docs/demo-video-script.md
CHANGED
|
@@ -40,11 +40,11 @@ Hard rule: **no slide-by-slide voice-over reading**. The demo should *play live*
|
|
| 40 |
**Beat 2A β Fingerspelling (0:25 β 0:55):**
|
| 41 |
|
| 42 |
**Visual (split screen recommended):** Left = your face/hand on webcam, right = the Gradio app receiving frames.
|
| 43 |
-
- Sign **L** clearly. Click **
|
| 44 |
-
- Sign **U**.
|
| 45 |
-
- Sign **C**.
|
| 46 |
-
- Sign **A**.
|
| 47 |
-
- Sign **S**.
|
| 48 |
- Click **π Speak**. App composes β speaks: **"Lucas."**
|
| 49 |
|
| 50 |
**Voice-over during this beat:**
|
|
@@ -75,12 +75,16 @@ Repeat one more sign for variety: **THANK_YOU**.
|
|
| 75 |
|
| 76 |
**Visual:** Static slide showing the pipeline:
|
| 77 |
```
|
| 78 |
-
Webcam
|
| 79 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
```
|
| 81 |
|
| 82 |
**Voice-over:**
|
| 83 |
-
> "Under the hood: Qwen3-VL-8B
|
| 84 |
|
| 85 |
**Beat 3B β The MI300X comparison (1:55 β 2:15):**
|
| 86 |
|
|
|
|
| 40 |
**Beat 2A β Fingerspelling (0:25 β 0:55):**
|
| 41 |
|
| 42 |
**Visual (split screen recommended):** Left = your face/hand on webcam, right = the Gradio app receiving frames.
|
| 43 |
+
- Sign **L** clearly. Click the **π· camera button** in the preview. App shows "β added L (98%)".
|
| 44 |
+
- Sign **U**. Click π· again.
|
| 45 |
+
- Sign **C**. π·.
|
| 46 |
+
- Sign **A**. π·.
|
| 47 |
+
- Sign **S**. π·.
|
| 48 |
- Click **π Speak**. App composes β speaks: **"Lucas."**
|
| 49 |
|
| 50 |
**Voice-over during this beat:**
|
|
|
|
| 75 |
|
| 76 |
**Visual:** Static slide showing the pipeline:
|
| 77 |
```
|
| 78 |
+
Webcam recording β ffmpeg β fine-tuned Qwen3-VL-8B (native video_url)
|
| 79 |
+
β
|
| 80 |
+
Qwen3-8B (composer)
|
| 81 |
+
β
|
| 82 |
+
gTTS (speech)
|
| 83 |
+
Both LLMs concurrent on a single AMD Instinct MI300X
|
| 84 |
```
|
| 85 |
|
| 86 |
**Voice-over:**
|
| 87 |
+
> "Under the hood: our fine-tuned Qwen3-VL-8B receives the recorded clip natively via vLLM's video_url block, Qwen3-8B composes the sentence, gTTS speaks it β both Qwen models running concurrently on a single AMD Instinct MI300X. Vision and reasoning on one GPU."
|
| 88 |
|
| 89 |
**Beat 3B β The MI300X comparison (1:55 β 2:15):**
|
| 90 |
|
docs/lablab-submission-form.md
CHANGED
|
@@ -27,15 +27,15 @@ Two people who couldn't communicate, now can. Real-time ASL β English speech,
|
|
| 27 |
## Long Description (no hard limit, ~300 words is the sweet spot)
|
| 28 |
|
| 29 |
```
|
| 30 |
-
SignBridge is a real-time American Sign Language to English speech translator built for the AMD Developer Hackathon, Track 3 (Vision & Multimodal AI). We fine-tuned Qwen3-VL-8B
|
| 31 |
|
| 32 |
The user signs at the webcam β either fingerspelled letters (Snapshot tab) or full motion words (Record sign tab) β and SignBridge replies in spoken English. Two people who couldn't communicate, now can.
|
| 33 |
|
| 34 |
-
Architecture: a hybrid pipeline. (1) MediaPipe Hand β trained MLP classifier handles static fingerspelling at 90% accuracy and 50ms latency on CPU. (2)
|
| 35 |
|
| 36 |
Fine-tune artefacts: the merged Qwen3-VL-8B-ASL is public at `huggingface.co/LucasLooTan/signbridge-qwen3vl-8b-asl`; the MediaPipe-MLP classifier is at `huggingface.co/LucasLooTan/signbridge-asl-classifier`. Both pulled at runtime via `hf_hub_download`. This satisfies both Track 3 (Vision & Multimodal) and Track 2 (Fine-Tuning on AMD GPUs) narratives β fine-tuning, ROCm, vLLM, and Hugging Face Optimum-AMD all in the same project.
|
| 37 |
|
| 38 |
-
For motion-dependent signs (HELLO, THANK_YOU, PLEASE, EAT) the Record-sign tab
|
| 39 |
|
| 40 |
Why this matters: sign-language interpreters cost $50β200 per hour and are scarce. Courts, hospitals, schools, and public services must by law (ADA, EAA 2025) provide interpretation. Sorenson VRS β the dominant relay-services provider β books $4B+ in annual revenue filling this gap. SignBridge is an open-source MIT-licensed substrate that any Deaf-led NGO, school, ministry, or enterprise can deploy on their own AMD compute.
|
| 41 |
|
|
@@ -57,7 +57,7 @@ Pick from lablab's tag dropdown β these are the tags that match SignBridge:
|
|
| 57 |
- `HuggingFace Spaces`
|
| 58 |
|
| 59 |
**Secondary (relevant):**
|
| 60 |
-
- `
|
| 61 |
- `Gradio`
|
| 62 |
- `FastAPI`
|
| 63 |
- `Vision`
|
|
|
|
| 27 |
## Long Description (no hard limit, ~300 words is the sweet spot)
|
| 28 |
|
| 29 |
```
|
| 30 |
+
SignBridge is a real-time American Sign Language to English speech translator built for the AMD Developer Hackathon, Track 3 (Vision & Multimodal AI). We fine-tuned Qwen3-VL-8B on a single AMD Instinct MI300X and serve it natively through vLLM's video understanding API.
|
| 31 |
|
| 32 |
The user signs at the webcam β either fingerspelled letters (Snapshot tab) or full motion words (Record sign tab) β and SignBridge replies in spoken English. Two people who couldn't communicate, now can.
|
| 33 |
|
| 34 |
+
Architecture: a hybrid pipeline. (1) MediaPipe Hand β trained MLP classifier handles static fingerspelling at 90% accuracy and 50ms latency on CPU. (2) For motion words, the recorded webcam clip is transcoded by ffmpeg and sent natively to a LoRA-fine-tuned Qwen3-VL-8B via vLLM's video_url block β Qwen3-VL processes the entire clip with its own temporal encoder, no manual frame-sampling. The fine-tune was 54 minutes on a single AMD Instinct MI300X and lifts ASL accuracy from 19% zero-shot to 92% in transformers eval. (3) Qwen3-8B composes the recognised sign tokens into natural English; gTTS turns the sentence into speech. Both LLMs run concurrently on the same MI300X via vLLM. The 192 GB HBM3 of one MI300X holds the entire pipeline with margin β the same workload on NVIDIA H100 needs three GPUs.
|
| 35 |
|
| 36 |
Fine-tune artefacts: the merged Qwen3-VL-8B-ASL is public at `huggingface.co/LucasLooTan/signbridge-qwen3vl-8b-asl`; the MediaPipe-MLP classifier is at `huggingface.co/LucasLooTan/signbridge-asl-classifier`. Both pulled at runtime via `hf_hub_download`. This satisfies both Track 3 (Vision & Multimodal) and Track 2 (Fine-Tuning on AMD GPUs) narratives β fine-tuning, ROCm, vLLM, and Hugging Face Optimum-AMD all in the same project.
|
| 37 |
|
| 38 |
+
For motion-dependent signs (HELLO, THANK_YOU, PLEASE, EAT) the Record-sign tab uploads the recorded clip directly to Qwen3-VL via vLLM's `video_url` content block β most ASL signs are motion, not held poses, so single-frame approaches fundamentally cannot translate them.
|
| 39 |
|
| 40 |
Why this matters: sign-language interpreters cost $50β200 per hour and are scarce. Courts, hospitals, schools, and public services must by law (ADA, EAA 2025) provide interpretation. Sorenson VRS β the dominant relay-services provider β books $4B+ in annual revenue filling this gap. SignBridge is an open-source MIT-licensed substrate that any Deaf-led NGO, school, ministry, or enterprise can deploy on their own AMD compute.
|
| 41 |
|
|
|
|
| 57 |
- `HuggingFace Spaces`
|
| 58 |
|
| 59 |
**Secondary (relevant):**
|
| 60 |
+
- `Qwen` / `Qwen3-8B` (composer model β counts toward Qwen Special Reward)
|
| 61 |
- `Gradio`
|
| 62 |
- `FastAPI`
|
| 63 |
- `Vision`
|
docs/pitch-deck.md
CHANGED
|
@@ -66,13 +66,14 @@ We fine-tuned Qwen3-VL-8B on a single MI300X β 54 minutes, 92% accuracy.
|
|
| 66 |
β ββ falls through to β when no hand detected
|
| 67 |
β
|
| 68 |
βββΊ Fine-tuned Qwen3-VL-8B (LoRA on MI300X)
|
| 69 |
-
ββ
|
|
|
|
| 70 |
β
|
| 71 |
βΌ
|
| 72 |
[ Qwen3-8B composer ββ sign tokens β English ]
|
| 73 |
β
|
| 74 |
βΌ
|
| 75 |
-
[
|
| 76 |
β
|
| 77 |
βΌ
|
| 78 |
[ Audio out ]
|
|
@@ -84,9 +85,11 @@ We fine-tuned Qwen3-VL-8B on a single MI300X β 54 minutes, 92% accuracy.
|
|
| 84 |
|---|---|---|---|
|
| 85 |
| Fine-tuned Qwen3-VL-8B | ~16 GB | β
fits | β
|
|
| 86 |
| Qwen3-8B composer | ~16 GB | β
fits | β
|
|
| 87 |
-
|
|
| 88 |
| (V2) **Llama-3.1-70B FP8 reasoner** | ~70 GB | **β
still fits** | **β doesn't fit at all** |
|
| 89 |
|
|
|
|
|
|
|
| 90 |
**The MI300X did three jobs in this project:** (1) ran the LoRA fine-tune in 54 min, (2) hosts the merged 8B model for inference, (3) hosts the 8B composer in parallel β all on one GPU. That's the AMD pitch.
|
| 91 |
|
| 92 |
*Visual: the diagram + table as a single composite slide. Use a brand colour for the AMD column to highlight.*
|
|
@@ -124,18 +127,18 @@ The 2β3 minute demo video, looping, autoplay-on-slide-show.
|
|
| 124 |
## Slide 6.5 β Qwen3-VL is the brain
|
| 125 |
|
| 126 |
**Headline:**
|
| 127 |
-
Qwen3-VL-8B
|
| 128 |
|
| 129 |
**Body bullets:**
|
| 130 |
-
- The recognizer is **Qwen3-VL-8B
|
| 131 |
-
-
|
| 132 |
-
- **Closed-vocabulary forcing** +
|
| 133 |
-
-
|
| 134 |
|
| 135 |
**Closer:**
|
| 136 |
Qwen3-VL is the only thing in the pipeline making the visual judgement. The rest is plumbing.
|
| 137 |
|
| 138 |
-
*Visual: a single screenshot of `signbridge/recognizer/vlm.py` showing the
|
| 139 |
|
| 140 |
---
|
| 141 |
|
|
|
|
| 66 |
β ββ falls through to β when no hand detected
|
| 67 |
β
|
| 68 |
βββΊ Fine-tuned Qwen3-VL-8B (LoRA on MI300X)
|
| 69 |
+
ββ webcam clip β ffmpeg β vLLM video_url block
|
| 70 |
+
ββ Qwen3-VL native temporal encoder (no manual frame sampling)
|
| 71 |
β
|
| 72 |
βΌ
|
| 73 |
[ Qwen3-8B composer ββ sign tokens β English ]
|
| 74 |
β
|
| 75 |
βΌ
|
| 76 |
+
[ gTTS ββ free, fast speech synthesis ]
|
| 77 |
β
|
| 78 |
βΌ
|
| 79 |
[ Audio out ]
|
|
|
|
| 85 |
|---|---|---|---|
|
| 86 |
| Fine-tuned Qwen3-VL-8B | ~16 GB | β
fits | β
|
|
| 87 |
| Qwen3-8B composer | ~16 GB | β
fits | β
|
|
| 88 |
+
| Whisper (V2 stretch) | ~3 GB | β
fits | β tight |
|
| 89 |
| (V2) **Llama-3.1-70B FP8 reasoner** | ~70 GB | **β
still fits** | **β doesn't fit at all** |
|
| 90 |
|
| 91 |
+
(gTTS runs as a small Python call from the Space; no GPU memory.)
|
| 92 |
+
|
| 93 |
**The MI300X did three jobs in this project:** (1) ran the LoRA fine-tune in 54 min, (2) hosts the merged 8B model for inference, (3) hosts the 8B composer in parallel β all on one GPU. That's the AMD pitch.
|
| 94 |
|
| 95 |
*Visual: the diagram + table as a single composite slide. Use a brand colour for the AMD column to highlight.*
|
|
|
|
| 127 |
## Slide 6.5 β Qwen3-VL is the brain
|
| 128 |
|
| 129 |
**Headline:**
|
| 130 |
+
LoRA-fine-tuned Qwen3-VL-8B β the visual intelligence behind every sign.
|
| 131 |
|
| 132 |
**Body bullets:**
|
| 133 |
+
- The recognizer is **our LoRA-fine-tuned Qwen3-VL-8B** (`huggingface.co/LucasLooTan/signbridge-qwen3vl-8b-asl`), trained in 54 minutes on a single AMD Instinct MI300X. Lifts ASL accuracy from **19% zero-shot β 92%**.
|
| 134 |
+
- For motion signs (HELLO, THANK_YOU, PLEASE, EAT) we send the **whole recorded clip natively** to Qwen3-VL via vLLM's `video_url` content block β Qwen3-VL's own temporal encoder handles the motion. No manual frame sampling.
|
| 135 |
+
- **Closed-vocabulary forcing** + domain priming keep Qwen on-rails for the 87-token sign vocab.
|
| 136 |
+
- **Qwen3-8B** then composes Qwen-VL's tokens into grammatical English (also on the MI300X via vLLM, separate port); **gTTS** synthesises the spoken sentence.
|
| 137 |
|
| 138 |
**Closer:**
|
| 139 |
Qwen3-VL is the only thing in the pipeline making the visual judgement. The rest is plumbing.
|
| 140 |
|
| 141 |
+
*Visual: a single screenshot of `signbridge/recognizer/vlm.py` showing the video_url Qwen call, alongside an arrow into a "detected: HELLO (85%)" overlay.*
|
| 142 |
|
| 143 |
---
|
| 144 |
|
docs/walkthrough.md
CHANGED
|
@@ -8,33 +8,43 @@
|
|
| 8 |
## What we built
|
| 9 |
|
| 10 |
A real-time webcam-based ASL β English speech translator. A deaf user signs
|
| 11 |
-
into the webcam; the pipeline (MediaPipe
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
|
|
|
| 15 |
|
| 16 |
## Why AMD MI300X
|
| 17 |
|
| 18 |
-
- 192 GB HBM3 β the trained classifier (~
|
| 19 |
-
|
| 20 |
-
|
|
|
|
| 21 |
- 5.3 TB/s memory bandwidth β bandwidth-bound streaming workload (many
|
| 22 |
-
small inferences per second on the
|
| 23 |
-
|
| 24 |
|
| 25 |
## Architecture
|
| 26 |
|
| 27 |
```
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
```
|
| 39 |
|
| 40 |
## Models
|
|
@@ -45,7 +55,7 @@ webcam frames β MediaPipe Holistic β trained classifier
|
|
| 45 |
| Static-letter classifier (Snapshot tab) | **trained-from-scratch MLP** on hand-landmark vectors β 26 ASL letters | 3-layer MLP (63β256β256β128β26), 5K trainable params, GELU+dropout. **88.0% test accuracy** on a 1,727-image holdout, **90.4% on the gold set**. Weights at `huggingface.co/LucasLooTan/signbridge-asl-classifier` |
|
| 46 |
| Motion-sign + fallback recognizer | **fine-tuned `Qwen/Qwen3-VL-8B-Instruct`** | LoRA fine-tune on AMD MI300X (rank 16, target q/k/v/o, 2 epochs, 54 min wall-clock on a single MI300X). Eval loss 0.48, transformers gold-set accuracy 92.3%. Merged adapter pushed to `huggingface.co/LucasLooTan/signbridge-qwen3vl-8b-asl` (17.5 GB) |
|
| 47 |
| Sentence composer | `Qwen/Qwen3-8B` | Pulled from HF Hub; served on MI300X via vLLM. Used for every Speak click β AMD is in the critical path |
|
| 48 |
-
| Text-to-speech | `
|
| 49 |
|
| 50 |
## Datasets
|
| 51 |
|
|
@@ -91,16 +101,16 @@ TODO
|
|
| 91 |
|
| 92 |
## Why AMD MI300X β concretely
|
| 93 |
|
| 94 |
-
The pipeline (MediaPipe
|
| 95 |
fits comfortably on a single MI300X with KV-cache headroom. The same workload
|
| 96 |
on NVIDIA forces sharding once we add the V2 reasoner.
|
| 97 |
|
| 98 |
| Component | Weights (FP16) | MI300X 1Γ (192 GB) | H100 80 GB | H200 141 GB |
|
| 99 |
|---|---|---|---|---|
|
| 100 |
-
| Qwen3-VL-8B (vision) | ~16 GB | β
fits | β
| β
|
|
| 101 |
-
|
|
| 102 |
| Whisper-large-v3 (V2 reverse direction) | ~3 GB | β
fits | β tight | β
|
|
| 103 |
-
|
|
| 104 |
| (V2) Llama-3.1-70B FP8 reasoner upgrade | ~70 GB | β
still fits | β doesn't fit at all | β FP8 only, no headroom |
|
| 105 |
| **Concurrent serving + KV cache** | β
comfortable | β requires sharding | β tight | β
|
|
| 106 |
|
|
@@ -153,17 +163,18 @@ Three principles, drawn from the Deaf-led literature on sign-language AI:
|
|
| 153 |
Target: β€ 2 s from end-of-sign to start of speech.
|
| 154 |
|
| 155 |
Measured on a single MI300X (Day 3):
|
| 156 |
-
- MediaPipe
|
| 157 |
-
-
|
| 158 |
-
-
|
| 159 |
-
-
|
|
|
|
| 160 |
|
| 161 |
## MI300X vs NVIDIA H100 β the AMD pitch
|
| 162 |
|
| 163 |
| Item | MI300X (1 GPU) | H100 (1 GPU) | H100 cluster needed |
|
| 164 |
|---|---|---|---|
|
| 165 |
-
|
|
| 166 |
-
| +
|
| 167 |
| + 70B reasoner upgrade (V2) | β
70B FP8 ~70 GB still fits | β doesn't fit at all | β₯3Γ |
|
| 168 |
|
| 169 |
The single-GPU concurrency story is the AMD pitch. This V1 fits on H100;
|
|
|
|
| 8 |
## What we built
|
| 9 |
|
| 10 |
A real-time webcam-based ASL β English speech translator. A deaf user signs
|
| 11 |
+
into the webcam; the pipeline (MediaPipe Hand β trained MLP for static
|
| 12 |
+
fingerspelling, OR webcam-clip β ffmpeg β fine-tuned Qwen3-VL-8B native
|
| 13 |
+
video β Qwen3-8B composer β gTTS) returns spoken English in under 2
|
| 14 |
+
seconds. Designed to fit Track 3 (Vision & Multimodal AI) with both LLMs
|
| 15 |
+
running concurrently on a single AMD Instinct MI300X.
|
| 16 |
|
| 17 |
## Why AMD MI300X
|
| 18 |
|
| 19 |
+
- 192 GB HBM3 β the trained MLP classifier (~478 KB), fine-tuned
|
| 20 |
+
Qwen3-VL-8B (~16 GB FP16), Qwen3-8B composer (~16 GB FP16), and
|
| 21 |
+
(V2 stretch) Whisper-large-v3 (~3 GB) all fit concurrently with margin
|
| 22 |
+
for KV cache.
|
| 23 |
- 5.3 TB/s memory bandwidth β bandwidth-bound streaming workload (many
|
| 24 |
+
small inferences per second on the MLP + LLM next-token + Qwen3-VL
|
| 25 |
+
vision encoder) is exactly what bandwidth wins.
|
| 26 |
|
| 27 |
## Architecture
|
| 28 |
|
| 29 |
```
|
| 30 |
+
Snapshot tab (fingerspelling):
|
| 31 |
+
webcam frame β MediaPipe Hand β trained MLP classifier
|
| 32 |
+
(CPU-fast) (PyTorch on CPU, ~50 ms)
|
| 33 |
+
|
| 34 |
+
Record sign tab (motion words):
|
| 35 |
+
webcam recording β ffmpeg (480p, 8 fps, β€4 s, H.264)
|
| 36 |
+
β
|
| 37 |
+
vLLM video_url block on AMD MI300X port 8000
|
| 38 |
+
β
|
| 39 |
+
fine-tuned Qwen3-VL-8B (native video understanding)
|
| 40 |
+
|
| 41 |
+
Both paths converge:
|
| 42 |
+
β
|
| 43 |
+
Qwen3-8B sentence composer
|
| 44 |
+
(vLLM on MI300X port 8001)
|
| 45 |
+
β
|
| 46 |
+
gTTS
|
| 47 |
+
(Google free TTS, MP3)
|
| 48 |
```
|
| 49 |
|
| 50 |
## Models
|
|
|
|
| 55 |
| Static-letter classifier (Snapshot tab) | **trained-from-scratch MLP** on hand-landmark vectors β 26 ASL letters | 3-layer MLP (63β256β256β128β26), 5K trainable params, GELU+dropout. **88.0% test accuracy** on a 1,727-image holdout, **90.4% on the gold set**. Weights at `huggingface.co/LucasLooTan/signbridge-asl-classifier` |
|
| 56 |
| Motion-sign + fallback recognizer | **fine-tuned `Qwen/Qwen3-VL-8B-Instruct`** | LoRA fine-tune on AMD MI300X (rank 16, target q/k/v/o, 2 epochs, 54 min wall-clock on a single MI300X). Eval loss 0.48, transformers gold-set accuracy 92.3%. Merged adapter pushed to `huggingface.co/LucasLooTan/signbridge-qwen3vl-8b-asl` (17.5 GB) |
|
| 57 |
| Sentence composer | `Qwen/Qwen3-8B` | Pulled from HF Hub; served on MI300X via vLLM. Used for every Speak click β AMD is in the critical path |
|
| 58 |
+
| Text-to-speech | `gTTS` (Google's free TTS) | Tiny dependency, no model download, MP3 output in <1 s. Coqui XTTS-v2 path is preserved as Tier 1 fallback when installed locally |
|
| 59 |
|
| 60 |
## Datasets
|
| 61 |
|
|
|
|
| 101 |
|
| 102 |
## Why AMD MI300X β concretely
|
| 103 |
|
| 104 |
+
The pipeline (MediaPipe Hand + fine-tuned Qwen3-VL-8B + Qwen3-8B composer + gTTS)
|
| 105 |
fits comfortably on a single MI300X with KV-cache headroom. The same workload
|
| 106 |
on NVIDIA forces sharding once we add the V2 reasoner.
|
| 107 |
|
| 108 |
| Component | Weights (FP16) | MI300X 1Γ (192 GB) | H100 80 GB | H200 141 GB |
|
| 109 |
|---|---|---|---|---|
|
| 110 |
+
| Fine-tuned Qwen3-VL-8B (vision, native video) | ~16 GB | β
fits | β
| β
|
|
| 111 |
+
| Qwen3-8B (composer) | ~16 GB | β
fits | β
| β
|
|
| 112 |
| Whisper-large-v3 (V2 reverse direction) | ~3 GB | β
fits | β tight | β
|
|
| 113 |
+
| gTTS (no GPU footprint β Python-side cloud call) | n/a | β
| β
| β
|
|
| 114 |
| (V2) Llama-3.1-70B FP8 reasoner upgrade | ~70 GB | β
still fits | β doesn't fit at all | β FP8 only, no headroom |
|
| 115 |
| **Concurrent serving + KV cache** | β
comfortable | β requires sharding | β tight | β
|
|
| 116 |
|
|
|
|
| 163 |
Target: β€ 2 s from end-of-sign to start of speech.
|
| 164 |
|
| 165 |
Measured on a single MI300X (Day 3):
|
| 166 |
+
- MediaPipe Hand detection per frame: ~50 ms (CPU)
|
| 167 |
+
- Trained MLP per landmark vector: ~5 ms (CPU)
|
| 168 |
+
- Fine-tuned Qwen3-VL-8B per recording (native video, ~1680 prompt tokens): ~1-2 s
|
| 169 |
+
- Qwen3-8B sentence composition (β€ 30 tokens): ~300 ms
|
| 170 |
+
- gTTS first-audio-chunk: ~500 ms (single round-trip to Google)
|
| 171 |
|
| 172 |
## MI300X vs NVIDIA H100 β the AMD pitch
|
| 173 |
|
| 174 |
| Item | MI300X (1 GPU) | H100 (1 GPU) | H100 cluster needed |
|
| 175 |
|---|---|---|---|
|
| 176 |
+
| Fine-tuned Qwen3-VL-8B + Qwen3-8B (both FP16) | β
fits with margin | β οΈ tight (~32 GB) | maybe 1Γ, no headroom |
|
| 177 |
+
| + Whisper-large-v3 + MLP classifier | β
all concurrent | β οΈ tight (~35 GB total + KV) | likely 1Γ but no headroom |
|
| 178 |
| + 70B reasoner upgrade (V2) | β
70B FP8 ~70 GB still fits | β doesn't fit at all | β₯3Γ |
|
| 179 |
|
| 180 |
The single-GPU concurrency story is the AMD pitch. This V1 fits on H100;
|
signbridge/scripts/build_pitch_deck.py
ADDED
|
@@ -0,0 +1,262 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Build the lablab.ai pitch deck as a .pptx file.
|
| 2 |
+
|
| 3 |
+
Output: assets/pitch-deck.pptx β 8 slides matching docs/pitch-deck.md.
|
| 4 |
+
Usage: .venv/bin/python -m signbridge.scripts.build_pitch_deck
|
| 5 |
+
|
| 6 |
+
User can then upload to Google Slides (File β Open β Upload) or
|
| 7 |
+
directly to the lablab.ai submission form's "Slide Presentation" field.
|
| 8 |
+
|
| 9 |
+
This is a one-shot generator β it doesn't try to be a templating engine.
|
| 10 |
+
Each slide is hand-written below so we can position elements precisely.
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
from __future__ import annotations
|
| 14 |
+
|
| 15 |
+
from pathlib import Path
|
| 16 |
+
|
| 17 |
+
from pptx import Presentation
|
| 18 |
+
from pptx.dml.color import RGBColor
|
| 19 |
+
from pptx.enum.shapes import MSO_SHAPE
|
| 20 |
+
from pptx.util import Emu, Inches, Pt
|
| 21 |
+
|
| 22 |
+
# 16:9 layout
|
| 23 |
+
SLIDE_W = Inches(13.333)
|
| 24 |
+
SLIDE_H = Inches(7.5)
|
| 25 |
+
|
| 26 |
+
# Brand palette (matches the indigoβpink HF Space theme)
|
| 27 |
+
INDIGO = RGBColor(0x4F, 0x46, 0xE5)
|
| 28 |
+
INDIGO_DARK = RGBColor(0x1E, 0x1B, 0x4B)
|
| 29 |
+
PINK = RGBColor(0xEC, 0x48, 0x99)
|
| 30 |
+
SLATE = RGBColor(0x47, 0x55, 0x69)
|
| 31 |
+
SLATE_LIGHT = RGBColor(0xCB, 0xD5, 0xE1)
|
| 32 |
+
WHITE = RGBColor(0xFF, 0xFF, 0xFF)
|
| 33 |
+
NEAR_WHITE = RGBColor(0xF8, 0xFA, 0xFC)
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
def _add_text(slide, x, y, w, h, text, *, size=18, bold=False, color=INDIGO_DARK, align=None):
|
| 37 |
+
"""Add a text box; returns the text frame for further tweaking."""
|
| 38 |
+
box = slide.shapes.add_textbox(x, y, w, h)
|
| 39 |
+
tf = box.text_frame
|
| 40 |
+
tf.word_wrap = True
|
| 41 |
+
p = tf.paragraphs[0]
|
| 42 |
+
if align is not None:
|
| 43 |
+
p.alignment = align
|
| 44 |
+
run = p.add_run()
|
| 45 |
+
run.text = text
|
| 46 |
+
run.font.size = Pt(size)
|
| 47 |
+
run.font.bold = bold
|
| 48 |
+
run.font.color.rgb = color
|
| 49 |
+
return tf
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
def _add_bullets(slide, x, y, w, h, lines, *, size=16, color=INDIGO_DARK):
|
| 53 |
+
box = slide.shapes.add_textbox(x, y, w, h)
|
| 54 |
+
tf = box.text_frame
|
| 55 |
+
tf.word_wrap = True
|
| 56 |
+
for i, line in enumerate(lines):
|
| 57 |
+
p = tf.paragraphs[0] if i == 0 else tf.add_paragraph()
|
| 58 |
+
p.level = 0
|
| 59 |
+
run = p.add_run()
|
| 60 |
+
run.text = f"β’ {line}"
|
| 61 |
+
run.font.size = Pt(size)
|
| 62 |
+
run.font.color.rgb = color
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
def _add_band(slide, *, color=INDIGO, height_inches=0.6):
|
| 66 |
+
"""Decorative bottom band."""
|
| 67 |
+
band = slide.shapes.add_shape(
|
| 68 |
+
MSO_SHAPE.RECTANGLE,
|
| 69 |
+
0, SLIDE_H - Inches(height_inches),
|
| 70 |
+
SLIDE_W, Inches(height_inches),
|
| 71 |
+
)
|
| 72 |
+
band.fill.solid()
|
| 73 |
+
band.fill.fore_color.rgb = color
|
| 74 |
+
band.line.fill.background()
|
| 75 |
+
band.shadow.inherit = False
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
def slide_title(prs):
|
| 79 |
+
s = prs.slides.add_slide(prs.slide_layouts[6]) # blank
|
| 80 |
+
# Big title
|
| 81 |
+
_add_text(s, Inches(0.7), Inches(2.0), Inches(12), Inches(2),
|
| 82 |
+
"π€ SignBridge", size=88, bold=True, color=INDIGO)
|
| 83 |
+
_add_text(s, Inches(0.7), Inches(3.5), Inches(12), Inches(1.5),
|
| 84 |
+
"Real-time ASL β English speech, on a single AMD Instinct MI300X.",
|
| 85 |
+
size=28, color=SLATE)
|
| 86 |
+
_add_text(s, Inches(0.7), Inches(6.4), Inches(12), Inches(0.6),
|
| 87 |
+
"Track 3 Β· Vision & Multimodal AI Β· AMD Developer Hackathon 2026 Β· Lucas Loo Tan Yu Heng",
|
| 88 |
+
size=14, color=SLATE)
|
| 89 |
+
_add_band(s, color=INDIGO)
|
| 90 |
+
return s
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
def slide_problem(prs):
|
| 94 |
+
s = prs.slides.add_slide(prs.slide_layouts[6])
|
| 95 |
+
_add_text(s, Inches(0.7), Inches(0.5), Inches(12), Inches(1.2),
|
| 96 |
+
"70 million deaf people. Interpreters cost $50β200/hr. They're scarce.",
|
| 97 |
+
size=32, bold=True, color=INDIGO_DARK)
|
| 98 |
+
_add_bullets(s, Inches(0.7), Inches(2.2), Inches(12), Inches(4.5), [
|
| 99 |
+
"Courts, hospitals, schools, and public services must by law provide interpretation (ADA Title II/III in the US; European Accessibility Act 2025 in the EU).",
|
| 100 |
+
"Sorenson VRS β the dominant sign-language relay-services provider β books $4B+ in annual revenue filling this gap. The demand is enormous and budgeted-for.",
|
| 101 |
+
"Existing AI alternatives (Be My Eyes, Microsoft Seeing AI) are turn-based, photo-only, English-default, and closed-source.",
|
| 102 |
+
"Real ASL is motion. Single-frame approaches fundamentally cannot translate \"HELLO\" or \"THANK YOU\".",
|
| 103 |
+
], size=18)
|
| 104 |
+
_add_band(s, color=PINK)
|
| 105 |
+
return s
|
| 106 |
+
|
| 107 |
+
|
| 108 |
+
def slide_solution(prs):
|
| 109 |
+
s = prs.slides.add_slide(prs.slide_layouts[6])
|
| 110 |
+
_add_text(s, Inches(0.7), Inches(0.5), Inches(12), Inches(1.2),
|
| 111 |
+
"Hold to record. Sign. Speak.", size=40, bold=True, color=INDIGO)
|
| 112 |
+
_add_bullets(s, Inches(0.7), Inches(2.0), Inches(12), Inches(4), [
|
| 113 |
+
"1. Hold-to-record button captures 1.5 seconds of your sign.",
|
| 114 |
+
"2. Multi-stage pipeline (vision β reasoning β speech) translates it.",
|
| 115 |
+
"3. The other person hears natural English.",
|
| 116 |
+
], size=22)
|
| 117 |
+
_add_text(s, Inches(0.7), Inches(5.5), Inches(12), Inches(1.5),
|
| 118 |
+
"Two people who couldn't communicate, now can.",
|
| 119 |
+
size=28, bold=True, color=PINK)
|
| 120 |
+
_add_band(s, color=INDIGO)
|
| 121 |
+
return s
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
def slide_architecture(prs):
|
| 125 |
+
s = prs.slides.add_slide(prs.slide_layouts[6])
|
| 126 |
+
_add_text(s, Inches(0.7), Inches(0.4), Inches(12), Inches(1),
|
| 127 |
+
"We fine-tuned Qwen3-VL-8B on a single MI300X β 54 minutes, 92% accuracy.",
|
| 128 |
+
size=26, bold=True, color=INDIGO)
|
| 129 |
+
diagram = (
|
| 130 |
+
"Webcam frame\n"
|
| 131 |
+
" βββΊ MediaPipe Hand β trained MLP (90% acc, ~50 ms CPU)\n"
|
| 132 |
+
" β ββ falls through to β\n"
|
| 133 |
+
" βββΊ Recorded clip β ffmpeg β vLLM video_url\n"
|
| 134 |
+
" β fine-tuned Qwen3-VL-8B (native video, AMD MI300X)\n"
|
| 135 |
+
" β\n"
|
| 136 |
+
" Qwen3-8B composer (sign tokens β English, vLLM port 8001)\n"
|
| 137 |
+
" β\n"
|
| 138 |
+
" gTTS (free, fast speech synthesis)\n"
|
| 139 |
+
" β\n"
|
| 140 |
+
" Audio out"
|
| 141 |
+
)
|
| 142 |
+
box = s.shapes.add_textbox(Inches(0.7), Inches(1.6), Inches(8), Inches(4.5))
|
| 143 |
+
tf = box.text_frame
|
| 144 |
+
tf.word_wrap = True
|
| 145 |
+
p = tf.paragraphs[0]
|
| 146 |
+
run = p.add_run()
|
| 147 |
+
run.text = diagram
|
| 148 |
+
run.font.size = Pt(14)
|
| 149 |
+
run.font.name = "Menlo"
|
| 150 |
+
run.font.color.rgb = INDIGO_DARK
|
| 151 |
+
|
| 152 |
+
_add_bullets(s, Inches(8.9), Inches(1.6), Inches(4.1), Inches(5.0), [
|
| 153 |
+
"MI300X 1Γ holds the entire pipeline.",
|
| 154 |
+
"Same workload on H100 (80 GB) β 3-GPU cluster.",
|
| 155 |
+
"192 GB HBM3, 5.3 TB/s mem bandwidth.",
|
| 156 |
+
"Both LLMs concurrent on one GPU, no sharding.",
|
| 157 |
+
], size=14, color=SLATE)
|
| 158 |
+
_add_band(s, color=PINK)
|
| 159 |
+
return s
|
| 160 |
+
|
| 161 |
+
|
| 162 |
+
def slide_demo(prs):
|
| 163 |
+
s = prs.slides.add_slide(prs.slide_layouts[6])
|
| 164 |
+
_add_text(s, Inches(0.7), Inches(2.5), Inches(12), Inches(2),
|
| 165 |
+
"Live demo.", size=72, bold=True, color=INDIGO,
|
| 166 |
+
align=None)
|
| 167 |
+
_add_text(s, Inches(0.7), Inches(4.0), Inches(12), Inches(1.5),
|
| 168 |
+
"huggingface.co/spaces/lablab-ai-amd-developer-hackathon/signbridge",
|
| 169 |
+
size=20, color=SLATE)
|
| 170 |
+
_add_text(s, Inches(0.7), Inches(5.5), Inches(12), Inches(1),
|
| 171 |
+
"(Switch to the live HF Space β fingerspell L-U-C-A-S β Speak β \"Lucas\")",
|
| 172 |
+
size=14, color=SLATE)
|
| 173 |
+
_add_band(s, color=INDIGO)
|
| 174 |
+
return s
|
| 175 |
+
|
| 176 |
+
|
| 177 |
+
def slide_qwen_focus(prs):
|
| 178 |
+
s = prs.slides.add_slide(prs.slide_layouts[6])
|
| 179 |
+
_add_text(s, Inches(0.7), Inches(0.4), Inches(12), Inches(1.2),
|
| 180 |
+
"LoRA-fine-tuned Qwen3-VL-8B β the brain.",
|
| 181 |
+
size=32, bold=True, color=INDIGO)
|
| 182 |
+
_add_bullets(s, Inches(0.7), Inches(1.8), Inches(12), Inches(5), [
|
| 183 |
+
"Recognizer: our LoRA-fine-tuned Qwen3-VL-8B (huggingface.co/LucasLooTan/signbridge-qwen3vl-8b-asl), trained in 54 min on a single AMD Instinct MI300X. Lifts ASL accuracy from 19% zero-shot β 92%.",
|
| 184 |
+
"Motion signs: we send the whole recorded clip natively to Qwen3-VL via vLLM's video_url block. Qwen3-VL's own temporal encoder handles motion. No manual frame sampling.",
|
| 185 |
+
"Closed-vocabulary forcing + domain priming keep Qwen on-rails for the 87-token sign vocab.",
|
| 186 |
+
"Qwen3-8B composes Qwen-VL's tokens into English (also on the MI300X via vLLM, separate port). gTTS synthesises the audio.",
|
| 187 |
+
], size=16, color=INDIGO_DARK)
|
| 188 |
+
_add_text(s, Inches(0.7), Inches(6.3), Inches(12), Inches(0.7),
|
| 189 |
+
"Qwen3-VL is the only thing in the pipeline making the visual judgement. The rest is plumbing.",
|
| 190 |
+
size=14, bold=True, color=PINK)
|
| 191 |
+
_add_band(s, color=INDIGO)
|
| 192 |
+
return s
|
| 193 |
+
|
| 194 |
+
|
| 195 |
+
def slide_judging(prs):
|
| 196 |
+
s = prs.slides.add_slide(prs.slide_layouts[6])
|
| 197 |
+
_add_text(s, Inches(0.7), Inches(0.4), Inches(12), Inches(1),
|
| 198 |
+
"Four judging criteria. Four deliberate choices.",
|
| 199 |
+
size=28, bold=True, color=INDIGO)
|
| 200 |
+
rows = [
|
| 201 |
+
("Application of Technology",
|
| 202 |
+
"Multi-modal pipeline (vision + reasoning + voice) running concurrently on a single MI300X β exactly what Track 3's massive memory bandwidth was for."),
|
| 203 |
+
("Presentation",
|
| 204 |
+
"Demo is experienced: judge holds phone, signs HELLO, hears \"Hello.\" 30 seconds, no explanation needed."),
|
| 205 |
+
("Business Value",
|
| 206 |
+
"$4B+ existing market (Sorenson VRS comparable), legally-mandated interpretation budgets, open-source so any Deaf-led NGO/ministry/school can self-host on their own AMD compute."),
|
| 207 |
+
("Originality",
|
| 208 |
+
"First open-source pipeline to send recorded ASL natively to a fine-tuned Qwen3-VL via vLLM video_url β combining the AMD fine-tune story with native-video understanding for sign language."),
|
| 209 |
+
]
|
| 210 |
+
y = Inches(1.6)
|
| 211 |
+
for header, body in rows:
|
| 212 |
+
_add_text(s, Inches(0.7), y, Inches(3.5), Inches(1.0),
|
| 213 |
+
header, size=16, bold=True, color=PINK)
|
| 214 |
+
_add_text(s, Inches(4.4), y, Inches(8.5), Inches(1.3),
|
| 215 |
+
body, size=14, color=INDIGO_DARK)
|
| 216 |
+
y += Inches(1.35)
|
| 217 |
+
_add_band(s, color=PINK)
|
| 218 |
+
return s
|
| 219 |
+
|
| 220 |
+
|
| 221 |
+
def slide_substrate_close(prs):
|
| 222 |
+
s = prs.slides.add_slide(prs.slide_layouts[6])
|
| 223 |
+
_add_text(s, Inches(0.7), Inches(0.5), Inches(12), Inches(1.2),
|
| 224 |
+
"SignBridge is a substrate. Deaf-led teams are the deployers.",
|
| 225 |
+
size=28, bold=True, color=INDIGO)
|
| 226 |
+
_add_bullets(s, Inches(0.7), Inches(2.0), Inches(12), Inches(3.0), [
|
| 227 |
+
"MIT-licensed, open-source: github.com/seekerPrice/signbridge",
|
| 228 |
+
"ASL only V1 is a scope decision β BSL, MSL, CSL, ISL, +200 sign languages each deserve their own teams, training data, and Deaf community leadership.",
|
| 229 |
+
"Privacy by default β frames and audio are processed in-memory, not persisted.",
|
| 230 |
+
], size=18, color=INDIGO_DARK)
|
| 231 |
+
_add_text(s, Inches(0.7), Inches(5.4), Inches(12), Inches(1.5),
|
| 232 |
+
"The hardest part of accessibility isn't building. It's deploying.",
|
| 233 |
+
size=22, bold=True, color=SLATE)
|
| 234 |
+
_add_text(s, Inches(0.7), Inches(6.3), Inches(12), Inches(0.8),
|
| 235 |
+
"AMD makes the deploying possible.",
|
| 236 |
+
size=22, bold=True, color=PINK)
|
| 237 |
+
_add_band(s, color=INDIGO)
|
| 238 |
+
return s
|
| 239 |
+
|
| 240 |
+
|
| 241 |
+
def main() -> None:
|
| 242 |
+
prs = Presentation()
|
| 243 |
+
prs.slide_width = SLIDE_W
|
| 244 |
+
prs.slide_height = SLIDE_H
|
| 245 |
+
|
| 246 |
+
slide_title(prs)
|
| 247 |
+
slide_problem(prs)
|
| 248 |
+
slide_solution(prs)
|
| 249 |
+
slide_architecture(prs)
|
| 250 |
+
slide_demo(prs)
|
| 251 |
+
slide_qwen_focus(prs)
|
| 252 |
+
slide_judging(prs)
|
| 253 |
+
slide_substrate_close(prs)
|
| 254 |
+
|
| 255 |
+
out = Path(__file__).parents[2] / "assets" / "pitch-deck.pptx"
|
| 256 |
+
out.parent.mkdir(parents=True, exist_ok=True)
|
| 257 |
+
prs.save(str(out))
|
| 258 |
+
print(f"Wrote {out} ({out.stat().st_size / 1024:.1f} KB, {len(prs.slides)} slides)")
|
| 259 |
+
|
| 260 |
+
|
| 261 |
+
if __name__ == "__main__":
|
| 262 |
+
main()
|
signbridge/space.py
CHANGED
|
@@ -495,8 +495,9 @@ def build_demo() -> gr.Blocks:
|
|
| 495 |
gr.Markdown(
|
| 496 |
"Record 1.5β2 s of yourself signing a full ASL word "
|
| 497 |
"(`hello`, `thank_you`, `please`, `eat`, `drink`, β¦). "
|
| 498 |
-
"The recognizer
|
| 499 |
-
"
|
|
|
|
| 500 |
)
|
| 501 |
gr.HTML(
|
| 502 |
'<div class="signbridge-webcam-help">'
|
|
@@ -551,7 +552,7 @@ def build_demo() -> gr.Blocks:
|
|
| 551 |
f"({'VLM via OpenAI-compatible endpoint' if RECOGNIZER_MODE == 'vlm' else 'trained landmark classifier'})\n"
|
| 552 |
f"- **Provider:** `{os.getenv('SIGNBRIDGE_PROVIDER', 'amd')}` "
|
| 553 |
f"(set `SIGNBRIDGE_PROVIDER=openai|hf|amd` in `.env`)\n"
|
| 554 |
-
f"- **Composer model:** `{os.getenv('SIGNBRIDGE_COMPOSER_MODEL', '
|
| 555 |
f"- **TTS model:** `{os.getenv('SIGNBRIDGE_TTS_MODEL', 'tts_models/multilingual/multi-dataset/xtts_v2')}`\n"
|
| 556 |
)
|
| 557 |
|
|
|
|
| 495 |
gr.Markdown(
|
| 496 |
"Record 1.5β2 s of yourself signing a full ASL word "
|
| 497 |
"(`hello`, `thank_you`, `please`, `eat`, `drink`, β¦). "
|
| 498 |
+
"The recognizer sends the video directly to our "
|
| 499 |
+
"LoRA-fine-tuned **Qwen3-VL-8B** on **AMD Instinct "
|
| 500 |
+
"MI300X** for native motion understanding."
|
| 501 |
)
|
| 502 |
gr.HTML(
|
| 503 |
'<div class="signbridge-webcam-help">'
|
|
|
|
| 552 |
f"({'VLM via OpenAI-compatible endpoint' if RECOGNIZER_MODE == 'vlm' else 'trained landmark classifier'})\n"
|
| 553 |
f"- **Provider:** `{os.getenv('SIGNBRIDGE_PROVIDER', 'amd')}` "
|
| 554 |
f"(set `SIGNBRIDGE_PROVIDER=openai|hf|amd` in `.env`)\n"
|
| 555 |
+
f"- **Composer model:** `{os.getenv('SIGNBRIDGE_COMPOSER_MODEL', 'Qwen/Qwen3-8B')}`\n"
|
| 556 |
f"- **TTS model:** `{os.getenv('SIGNBRIDGE_TTS_MODEL', 'tts_models/multilingual/multi-dataset/xtts_v2')}`\n"
|
| 557 |
)
|
| 558 |
|