Spaces:

lablab-ai-amd-developer-hackathon
/

signbridge

Sleeping

App Files Files Community

signbridge / docs /pitch-deck.md

LucasLooTan

docs+pptx: refresh all submission deliverables to match shipping pipeline

fb11c61 27 days ago

preview code

raw

history blame contribute delete

8.36 kB

	# SignBridge — Pitch Deck (8 slides)

	> Open a Google Slides deck (or Pitch). Paste each slide's content into the matching blank slide. Visuals are described in italics — replace with actual screenshots / diagrams / table renders.
	> Aspect ratio: 16:9. Theme: indigo→pink gradient (matches HF Space card).

	---

	## Slide 1 — Title

	Title (huge):
	SignBridge

	Subtitle:
	Real-time ASL → English speech, on a single AMD Instinct MI300X.

	Footer (small):
	Track 3 · Vision & Multimodal AI · AMD Developer Hackathon 2026 · Lucas Loo Tan Yu Heng

	Visual: the cover.png we already shipped (1280×640 indigo→pink gradient with 🤟 + project name).

	---

	## Slide 2 — The problem

	Headline:
	70 million deaf people. Sign-language interpreters cost $50–200 per hour. They're scarce.

	Body bullets:
	- Courts, hospitals, schools, public services must by law provide interpretation (ADA Title II/III in the US; European Accessibility Act 2025 in the EU).
	- Sorenson VRS, the dominant sign-language relay-services provider, books $4B+ in annual revenue filling this gap — proof the demand is enormous and budgeted-for.
	- Existing AI alternatives (Be My Eyes, Microsoft Seeing AI) are turn-based, photo-only, English-default, and closed-source. Real ASL is motion — they fundamentally can't translate "HELLO" or "THANK YOU".

	Visual: a row of three context icons — courthouse / hospital / classroom — labeled with the mandates.

	---

	## Slide 3 — The solution

	Headline:
	Hold to record. Sign. Speak.

	Body (3-step arc):
	1. Hold-to-record button captures 1.5 seconds of your sign.
	2. A multi-stage pipeline (vision → reasoning → speech) translates it.
	3. The other person hears natural English.

	Tag line under the arc:
	Two people who couldn't communicate, now can.

	Visual: 3 screenshots of the live Gradio Space — (a) user signing into webcam; (b) "detected: HELLO (85%)"; (c) audio waveform playing "Hello.".
	If single screenshot: just the Gradio "Record sign" tab mid-demo.

	---

	## Slide 4 — Architecture (the AMD pitch)

	Headline:
	We fine-tuned Qwen3-VL-8B on a single MI300X — 54 minutes, 92% accuracy.

	Diagram (build in Slides; described as bullets):
	```
	[ Webcam frame ]
	│
	├─► MediaPipe Hand → trained MLP classifier
	│ (90% on ASL fingerspelling, 50ms CPU)
	│ └─ falls through to ↓ when no hand detected
	│
	└─► Fine-tuned Qwen3-VL-8B (LoRA on MI300X)
	── webcam clip → ffmpeg → vLLM video_url block
	── Qwen3-VL native temporal encoder (no manual frame sampling)
	│
	▼
	[ Qwen3-8B composer ── sign tokens → English ]
	│
	▼
	[ gTTS ── free, fast speech synthesis ]
	│
	▼
	[ Audio out ]
	```

	Comparison table (small print under diagram):

	\| Component \| Weights (FP16) \| MI300X 1× (192 GB) \| H100 80 GB \|
	\|---\|---\|---\|---\|
	\| Fine-tuned Qwen3-VL-8B \| ~16 GB \| ✅ fits \| ✅ \|
	\| Qwen3-8B composer \| ~16 GB \| ✅ fits \| ✅ \|
	\| Whisper (V2 stretch) \| ~3 GB \| ✅ fits \| ⚠ tight \|
	\| (V2) Llama-3.1-70B FP8 reasoner \| ~70 GB \| ✅ still fits \| ❌ doesn't fit at all \|

	(gTTS runs as a small Python call from the Space; no GPU memory.)

	The MI300X did three jobs in this project: (1) ran the LoRA fine-tune in 54 min, (2) hosts the merged 8B model for inference, (3) hosts the 8B composer in parallel — all on one GPU. That's the AMD pitch.

	Visual: the diagram + table as a single composite slide. Use a brand colour for the AMD column to highlight.

	---

	## Slide 5 — Live demo

	Headline:
	(blank — this slide is the live demo)

	Speaker note:
	Switch to the live HF Space at huggingface.co/spaces/lablab-ai-amd-developer-hackathon/signbridge. 30 seconds:
	1. Snapshot tab — fingerspell L-U-C-A-S → click Speak → AI says "Lucas."
	2. Record sign tab — record HELLO → click Submit → "hello" detected → click Speak → AI says "Hello."

	If demo fails / network down → fall back to the pre-recorded 2-min video on slide 6.

	Visual: leave the slide blank or use a single QR code linking to the Space URL for the audience to scan and try themselves.

	---

	## Slide 6 — Demo video (fallback)

	Headline:
	(blank — this slide embeds the demo video)

	Embed:
	The 2–3 minute demo video, looping, autoplay-on-slide-show.

	Visual: video player.

	---

	## Slide 6.5 — Qwen3-VL is the brain

	Headline:
	LoRA-fine-tuned Qwen3-VL-8B — the visual intelligence behind every sign.

	Body bullets:
	- The recognizer is our LoRA-fine-tuned Qwen3-VL-8B (`huggingface.co/LucasLooTan/signbridge-qwen3vl-8b-asl`), trained in 54 minutes on a single AMD Instinct MI300X. Lifts ASL accuracy from 19% zero-shot → 92%.
	- For motion signs (HELLO, THANK_YOU, PLEASE, EAT) we send the whole recorded clip natively to Qwen3-VL via vLLM's `video_url` content block — Qwen3-VL's own temporal encoder handles the motion. No manual frame sampling.
	- Closed-vocabulary forcing + domain priming keep Qwen on-rails for the 87-token sign vocab.
	- Qwen3-8B then composes Qwen-VL's tokens into grammatical English (also on the MI300X via vLLM, separate port); gTTS synthesises the spoken sentence.

	Closer:
	Qwen3-VL is the only thing in the pipeline making the visual judgement. The rest is plumbing.

	Visual: a single screenshot of `signbridge/recognizer/vlm.py` showing the video_url Qwen call, alongside an arrow into a "detected: HELLO (85%)" overlay.

	---

	## Slide 7 — Why this is the right submission for Track 3

	Headline:
	Four judging criteria, four deliberate choices.

	Two-column layout:

	\| Judging criterion \| Our choice \|
	\|---\|---\|
	\| Application of Technology \| Multi-modal pipeline (vision + reasoning + voice) running concurrently on a single MI300X — exactly what Track 3's "massive memory bandwidth of AMD GPUs" was for. \|
	\| Presentation \| Demo is experienced: judge holds phone, signs HELLO, hears "Hello." 30 seconds, no explanation needed. \|
	\| Business Value \| $4B+ existing market (Sorenson VRS comparable), legally-mandated interpretation budgets, open-source so any Deaf-led NGO / ministry / school can self-host on their own AMD compute. \|
	\| Originality \| Streaming continuous multi-frame VLM agent for sign language — no peer-reviewed benchmark exists for this approach yet (we checked the literature). Real ASL motion-words, not just fingerspelling. \|

	Visual: 2×2 grid of icons, one per criterion.

	---

	## Slide 8 — Substrate, not product · Open · Deaf-led future

	Headline:
	SignBridge is a substrate. Deaf-led teams are the deployers.

	Body:
	- MIT-licensed, code at github.com/seekerPrice/signbridge — anyone can self-host.
	- ASL only V1 is a scope decision. BSL, MSL, CSL, ISL, +200 sign languages each deserve their own teams, training data, and Deaf community leadership. (Citing Bragg et al., "Systemic Biases in Sign Language AI Research", arXiv 2403.02563.)
	- Privacy by default — frames and audio are processed in-memory and not persisted server-side beyond the request lifetime.

	Closing line (large):
	The hardest part of accessibility isn't building. It's deploying. AMD makes the deploying possible.

	Visual: world map outline with sign-language regional dots; or just the SignBridge logo with the closing tagline.

	---

	## Speaker-note tips (read these before recording)

	1. Lead with the human problem (Slide 2), not the architecture. Architecture is for criterion 1; emotion is what closes criteria 2–4.
	2. Time the live demo — 30 seconds max. If it fails, switch to fallback video without comment.
	3. Always say "AMD MI300X" by name at least 3 times in the talk track. Sponsors notice.
	4. End on the substrate framing — pre-empts the "savior tech" critique that Deaf-AI judges look out for.

	---

	## Export

	Once filled in: File → Download → PDF document → upload to lablab.ai submission form's "Slide Presentation" field.

	# SignBridge — Pitch Deck (8 slides)

	> Open a Google Slides deck (or Pitch). Paste each slide's content into the matching blank slide. Visuals are described in italics — replace with actual screenshots / diagrams / table renders.
	> Aspect ratio: 16:9. Theme: indigo→pink gradient (matches HF Space card).

	---

	## Slide 1 — Title

	Title (huge):
	SignBridge

	Subtitle:
	Real-time ASL → English speech, on a single AMD Instinct MI300X.

	Footer (small):
	Track 3 · Vision & Multimodal AI · AMD Developer Hackathon 2026 · Lucas Loo Tan Yu Heng

	Visual: the cover.png we already shipped (1280×640 indigo→pink gradient with 🤟 + project name).

	---

	## Slide 2 — The problem

	Headline:
	70 million deaf people. Sign-language interpreters cost $50–200 per hour. They're scarce.

	Body bullets:
	- Courts, hospitals, schools, public services must by law provide interpretation (ADA Title II/III in the US; European Accessibility Act 2025 in the EU).
	- Sorenson VRS, the dominant sign-language relay-services provider, books $4B+ in annual revenue filling this gap — proof the demand is enormous and budgeted-for.
	- Existing AI alternatives (Be My Eyes, Microsoft Seeing AI) are turn-based, photo-only, English-default, and closed-source. Real ASL is motion — they fundamentally can't translate "HELLO" or "THANK YOU".

	Visual: a row of three context icons — courthouse / hospital / classroom — labeled with the mandates.

	---

	## Slide 3 — The solution

	Headline:
	Hold to record. Sign. Speak.

	Body (3-step arc):
	1. Hold-to-record button captures 1.5 seconds of your sign.
	2. A multi-stage pipeline (vision → reasoning → speech) translates it.
	3. The other person hears natural English.

	Tag line under the arc:
	Two people who couldn't communicate, now can.

	Visual: 3 screenshots of the live Gradio Space — (a) user signing into webcam; (b) "detected: HELLO (85%)"; (c) audio waveform playing "Hello.".
	If single screenshot: just the Gradio "Record sign" tab mid-demo.

	---

	## Slide 4 — Architecture (the AMD pitch)

	Headline:
	We fine-tuned Qwen3-VL-8B on a single MI300X — 54 minutes, 92% accuracy.

	Diagram (build in Slides; described as bullets):
	```
	[ Webcam frame ]
	│
	├─► MediaPipe Hand → trained MLP classifier
	│ (90% on ASL fingerspelling, 50ms CPU)
	│ └─ falls through to ↓ when no hand detected
	│
	└─► Fine-tuned Qwen3-VL-8B (LoRA on MI300X)
	── webcam clip → ffmpeg → vLLM video_url block
	── Qwen3-VL native temporal encoder (no manual frame sampling)
	│
	▼
	[ Qwen3-8B composer ── sign tokens → English ]
	│
	▼
	[ gTTS ── free, fast speech synthesis ]
	│
	▼
	[ Audio out ]
	```

	Comparison table (small print under diagram):

	\| Component \| Weights (FP16) \| MI300X 1× (192 GB) \| H100 80 GB \|
	\|---\|---\|---\|---\|
	\| Fine-tuned Qwen3-VL-8B \| ~16 GB \| ✅ fits \| ✅ \|
	\| Qwen3-8B composer \| ~16 GB \| ✅ fits \| ✅ \|
	\| Whisper (V2 stretch) \| ~3 GB \| ✅ fits \| ⚠ tight \|
	\| (V2) Llama-3.1-70B FP8 reasoner \| ~70 GB \| ✅ still fits \| ❌ doesn't fit at all \|

	(gTTS runs as a small Python call from the Space; no GPU memory.)

	The MI300X did three jobs in this project: (1) ran the LoRA fine-tune in 54 min, (2) hosts the merged 8B model for inference, (3) hosts the 8B composer in parallel — all on one GPU. That's the AMD pitch.

	Visual: the diagram + table as a single composite slide. Use a brand colour for the AMD column to highlight.

	---

	## Slide 5 — Live demo

	Headline:
	(blank — this slide is the live demo)

	Speaker note:
	Switch to the live HF Space at huggingface.co/spaces/lablab-ai-amd-developer-hackathon/signbridge. 30 seconds:
	1. Snapshot tab — fingerspell L-U-C-A-S → click Speak → AI says "Lucas."
	2. Record sign tab — record HELLO → click Submit → "hello" detected → click Speak → AI says "Hello."

	If demo fails / network down → fall back to the pre-recorded 2-min video on slide 6.

	Visual: leave the slide blank or use a single QR code linking to the Space URL for the audience to scan and try themselves.

	---

	## Slide 6 — Demo video (fallback)

	Headline:
	(blank — this slide embeds the demo video)

	Embed:
	The 2–3 minute demo video, looping, autoplay-on-slide-show.

	Visual: video player.

	---

	## Slide 6.5 — Qwen3-VL is the brain

	Headline:
	LoRA-fine-tuned Qwen3-VL-8B — the visual intelligence behind every sign.

	Body bullets:
	- The recognizer is our LoRA-fine-tuned Qwen3-VL-8B (`huggingface.co/LucasLooTan/signbridge-qwen3vl-8b-asl`), trained in 54 minutes on a single AMD Instinct MI300X. Lifts ASL accuracy from 19% zero-shot → 92%.
	- For motion signs (HELLO, THANK_YOU, PLEASE, EAT) we send the whole recorded clip natively to Qwen3-VL via vLLM's `video_url` content block — Qwen3-VL's own temporal encoder handles the motion. No manual frame sampling.
	- Closed-vocabulary forcing + domain priming keep Qwen on-rails for the 87-token sign vocab.
	- Qwen3-8B then composes Qwen-VL's tokens into grammatical English (also on the MI300X via vLLM, separate port); gTTS synthesises the spoken sentence.

	Closer:
	Qwen3-VL is the only thing in the pipeline making the visual judgement. The rest is plumbing.

	Visual: a single screenshot of `signbridge/recognizer/vlm.py` showing the video_url Qwen call, alongside an arrow into a "detected: HELLO (85%)" overlay.

	---

	## Slide 7 — Why this is the right submission for Track 3

	Headline:
	Four judging criteria, four deliberate choices.

	Two-column layout:

	\| Judging criterion \| Our choice \|
	\|---\|---\|
	\| Application of Technology \| Multi-modal pipeline (vision + reasoning + voice) running concurrently on a single MI300X — exactly what Track 3's "massive memory bandwidth of AMD GPUs" was for. \|
	\| Presentation \| Demo is experienced: judge holds phone, signs HELLO, hears "Hello." 30 seconds, no explanation needed. \|
	\| Business Value \| $4B+ existing market (Sorenson VRS comparable), legally-mandated interpretation budgets, open-source so any Deaf-led NGO / ministry / school can self-host on their own AMD compute. \|
	\| Originality \| Streaming continuous multi-frame VLM agent for sign language — no peer-reviewed benchmark exists for this approach yet (we checked the literature). Real ASL motion-words, not just fingerspelling. \|

	Visual: 2×2 grid of icons, one per criterion.

	---

	## Slide 8 — Substrate, not product · Open · Deaf-led future

	Headline:
	SignBridge is a substrate. Deaf-led teams are the deployers.

	Body:
	- MIT-licensed, code at github.com/seekerPrice/signbridge — anyone can self-host.
	- ASL only V1 is a scope decision. BSL, MSL, CSL, ISL, +200 sign languages each deserve their own teams, training data, and Deaf community leadership. (Citing Bragg et al., "Systemic Biases in Sign Language AI Research", arXiv 2403.02563.)
	- Privacy by default — frames and audio are processed in-memory and not persisted server-side beyond the request lifetime.

	Closing line (large):
	The hardest part of accessibility isn't building. It's deploying. AMD makes the deploying possible.

	Visual: world map outline with sign-language regional dots; or just the SignBridge logo with the closing tagline.

	---

	## Speaker-note tips (read these before recording)

	1. Lead with the human problem (Slide 2), not the architecture. Architecture is for criterion 1; emotion is what closes criteria 2–4.
	2. Time the live demo — 30 seconds max. If it fails, switch to fallback video without comment.
	3. Always say "AMD MI300X" by name at least 3 times in the talk track. Sponsors notice.
	4. End on the substrate framing — pre-empts the "savior tech" critique that Deaf-AI judges look out for.

	---

	## Export

	Once filled in: File → Download → PDF document → upload to lablab.ai submission form's "Slide Presentation" field.