signbridge / docs /pitch-deck.md
LucasLooTan's picture
docs+pptx: refresh all submission deliverables to match shipping pipeline
fb11c61
# SignBridge β€” Pitch Deck (8 slides)
> Open a Google Slides deck (or Pitch). Paste each slide's content into the matching blank slide. Visuals are described in italics β€” replace with actual screenshots / diagrams / table renders.
> Aspect ratio: 16:9. Theme: indigo→pink gradient (matches HF Space card).
---
## Slide 1 β€” Title
**Title (huge):**
SignBridge
**Subtitle:**
Real-time ASL β†’ English speech, on a single AMD Instinct MI300X.
**Footer (small):**
Track 3 Β· Vision & Multimodal AI Β· AMD Developer Hackathon 2026 Β· Lucas Loo Tan Yu Heng
*Visual: the cover.png we already shipped (1280Γ—640 indigoβ†’pink gradient with 🀟 + project name).*
---
## Slide 2 β€” The problem
**Headline:**
70 million deaf people. Sign-language interpreters cost $50–200 per hour. They're scarce.
**Body bullets:**
- Courts, hospitals, schools, public services **must by law** provide interpretation (ADA Title II/III in the US; European Accessibility Act 2025 in the EU).
- **Sorenson VRS**, the dominant sign-language relay-services provider, books **$4B+ in annual revenue** filling this gap β€” proof the demand is enormous and budgeted-for.
- Existing AI alternatives (Be My Eyes, Microsoft Seeing AI) are turn-based, photo-only, English-default, and closed-source. Real ASL is *motion* β€” they fundamentally can't translate "HELLO" or "THANK YOU".
*Visual: a row of three context icons β€” courthouse / hospital / classroom β€” labeled with the mandates.*
---
## Slide 3 β€” The solution
**Headline:**
Hold to record. Sign. Speak.
**Body (3-step arc):**
1. **Hold-to-record button** captures 1.5 seconds of your sign.
2. A multi-stage pipeline (vision β†’ reasoning β†’ speech) translates it.
3. The other person hears natural English.
**Tag line under the arc:**
Two people who couldn't communicate, now can.
*Visual: 3 screenshots of the live Gradio Space β€” (a) user signing into webcam; (b) "detected: HELLO (85%)"; (c) audio waveform playing "Hello.".*
*If single screenshot: just the Gradio "Record sign" tab mid-demo.*
---
## Slide 4 β€” Architecture (the AMD pitch)
**Headline:**
We fine-tuned Qwen3-VL-8B on a single MI300X β€” 54 minutes, 92% accuracy.
**Diagram (build in Slides; described as bullets):**
```
[ Webcam frame ]
β”‚
β”œβ”€β–Ί MediaPipe Hand β†’ trained MLP classifier
β”‚ (90% on ASL fingerspelling, 50ms CPU)
β”‚ └─ falls through to ↓ when no hand detected
β”‚
└─► Fine-tuned Qwen3-VL-8B (LoRA on MI300X)
── webcam clip β†’ ffmpeg β†’ vLLM video_url block
── Qwen3-VL native temporal encoder (no manual frame sampling)
β”‚
β–Ό
[ Qwen3-8B composer ── sign tokens β†’ English ]
β”‚
β–Ό
[ gTTS ── free, fast speech synthesis ]
β”‚
β–Ό
[ Audio out ]
```
**Comparison table (small print under diagram):**
| Component | Weights (FP16) | MI300X 1Γ— (192 GB) | H100 80 GB |
|---|---|---|---|
| Fine-tuned Qwen3-VL-8B | ~16 GB | βœ… fits | βœ… |
| Qwen3-8B composer | ~16 GB | βœ… fits | βœ… |
| Whisper (V2 stretch) | ~3 GB | βœ… fits | ⚠ tight |
| (V2) **Llama-3.1-70B FP8 reasoner** | ~70 GB | **βœ… still fits** | **❌ doesn't fit at all** |
(gTTS runs as a small Python call from the Space; no GPU memory.)
**The MI300X did three jobs in this project:** (1) ran the LoRA fine-tune in 54 min, (2) hosts the merged 8B model for inference, (3) hosts the 8B composer in parallel β€” all on one GPU. That's the AMD pitch.
*Visual: the diagram + table as a single composite slide. Use a brand colour for the AMD column to highlight.*
---
## Slide 5 β€” Live demo
**Headline:**
*(blank β€” this slide is the live demo)*
**Speaker note:**
Switch to the live HF Space at huggingface.co/spaces/lablab-ai-amd-developer-hackathon/signbridge. 30 seconds:
1. **Snapshot tab** β€” fingerspell L-U-C-A-S β†’ click Speak β†’ AI says "Lucas."
2. **Record sign tab** β€” record HELLO β†’ click Submit β†’ "hello" detected β†’ click Speak β†’ AI says "Hello."
If demo fails / network down β†’ fall back to the pre-recorded 2-min video on slide 6.
*Visual: leave the slide blank or use a single QR code linking to the Space URL for the audience to scan and try themselves.*
---
## Slide 6 β€” Demo video (fallback)
**Headline:**
*(blank β€” this slide embeds the demo video)*
**Embed:**
The 2–3 minute demo video, looping, autoplay-on-slide-show.
*Visual: video player.*
---
## Slide 6.5 β€” Qwen3-VL is the brain
**Headline:**
LoRA-fine-tuned Qwen3-VL-8B β€” the visual intelligence behind every sign.
**Body bullets:**
- The recognizer is **our LoRA-fine-tuned Qwen3-VL-8B** (`huggingface.co/LucasLooTan/signbridge-qwen3vl-8b-asl`), trained in 54 minutes on a single AMD Instinct MI300X. Lifts ASL accuracy from **19% zero-shot β†’ 92%**.
- For motion signs (HELLO, THANK_YOU, PLEASE, EAT) we send the **whole recorded clip natively** to Qwen3-VL via vLLM's `video_url` content block β€” Qwen3-VL's own temporal encoder handles the motion. No manual frame sampling.
- **Closed-vocabulary forcing** + domain priming keep Qwen on-rails for the 87-token sign vocab.
- **Qwen3-8B** then composes Qwen-VL's tokens into grammatical English (also on the MI300X via vLLM, separate port); **gTTS** synthesises the spoken sentence.
**Closer:**
Qwen3-VL is the only thing in the pipeline making the visual judgement. The rest is plumbing.
*Visual: a single screenshot of `signbridge/recognizer/vlm.py` showing the video_url Qwen call, alongside an arrow into a "detected: HELLO (85%)" overlay.*
---
## Slide 7 β€” Why this is the right submission for Track 3
**Headline:**
Four judging criteria, four deliberate choices.
**Two-column layout:**
| Judging criterion | Our choice |
|---|---|
| **Application of Technology** | Multi-modal pipeline (vision + reasoning + voice) running concurrently on a single MI300X β€” exactly what Track 3's "massive memory bandwidth of AMD GPUs" was for. |
| **Presentation** | Demo is *experienced*: judge holds phone, signs HELLO, hears "Hello." 30 seconds, no explanation needed. |
| **Business Value** | $4B+ existing market (Sorenson VRS comparable), legally-mandated interpretation budgets, open-source so any Deaf-led NGO / ministry / school can self-host on their own AMD compute. |
| **Originality** | Streaming continuous multi-frame VLM agent for sign language β€” no peer-reviewed benchmark exists for this approach yet (we checked the literature). Real ASL motion-words, not just fingerspelling. |
*Visual: 2Γ—2 grid of icons, one per criterion.*
---
## Slide 8 β€” Substrate, not product Β· Open Β· Deaf-led future
**Headline:**
SignBridge is a substrate. Deaf-led teams are the deployers.
**Body:**
- **MIT-licensed**, code at github.com/seekerPrice/signbridge β€” anyone can self-host.
- **ASL only V1 is a scope decision.** BSL, MSL, CSL, ISL, +200 sign languages each deserve their own teams, training data, and Deaf community leadership. (Citing Bragg et al., *"Systemic Biases in Sign Language AI Research"*, arXiv 2403.02563.)
- **Privacy by default** β€” frames and audio are processed in-memory and not persisted server-side beyond the request lifetime.
**Closing line (large):**
The hardest part of accessibility isn't building. It's deploying. AMD makes the deploying possible.
*Visual: world map outline with sign-language regional dots; or just the SignBridge logo with the closing tagline.*
---
## Speaker-note tips (read these before recording)
1. **Lead with the human problem (Slide 2), not the architecture.** Architecture is for criterion 1; emotion is what closes criteria 2–4.
2. **Time the live demo** β€” 30 seconds max. If it fails, switch to fallback video without comment.
3. **Always say "AMD MI300X" by name** at least 3 times in the talk track. Sponsors notice.
4. **End on the substrate framing** β€” pre-empts the "savior tech" critique that Deaf-AI judges look out for.
---
## Export
Once filled in: File β†’ Download β†’ PDF document β†’ upload to lablab.ai submission form's "Slide Presentation" field.