Spaces:

lablab-ai-amd-developer-hackathon
/

signbridge

Sleeping

App Files Files Community

signbridge / docs /walkthrough.md

LucasLooTan

docs+pptx: refresh all submission deliverables to match shipping pipeline

fb11c61 11 days ago

preview code

raw

history blame contribute delete

11.2 kB

	# SignBridge — technical walkthrough

	> Internal technical record of the build. Not a submission deliverable
	> (Build-in-Public extra challenge was dropped on 2026-05-07).
	> Kept around because it documents the AMD-specific engineering thinking
	> and is useful if anyone later asks "why these design choices?".

	## What we built

	A real-time webcam-based ASL → English speech translator. A deaf user signs
	into the webcam; the pipeline (MediaPipe Hand → trained MLP for static
	fingerspelling, OR webcam-clip → ffmpeg → fine-tuned Qwen3-VL-8B native
	video → Qwen3-8B composer → gTTS) returns spoken English in under 2
	seconds. Designed to fit Track 3 (Vision & Multimodal AI) with both LLMs
	running concurrently on a single AMD Instinct MI300X.

	## Why AMD MI300X

	- 192 GB HBM3 — the trained MLP classifier (~478 KB), fine-tuned
	Qwen3-VL-8B (~16 GB FP16), Qwen3-8B composer (~16 GB FP16), and
	(V2 stretch) Whisper-large-v3 (~3 GB) all fit concurrently with margin
	for KV cache.
	- 5.3 TB/s memory bandwidth — bandwidth-bound streaming workload (many
	small inferences per second on the MLP + LLM next-token + Qwen3-VL
	vision encoder) is exactly what bandwidth wins.

	## Architecture

	```
	Snapshot tab (fingerspelling):
	webcam frame → MediaPipe Hand → trained MLP classifier
	(CPU-fast) (PyTorch on CPU, ~50 ms)

	Record sign tab (motion words):
	webcam recording → ffmpeg (480p, 8 fps, ≤4 s, H.264)
	↓
	vLLM video_url block on AMD MI300X port 8000
	↓
	fine-tuned Qwen3-VL-8B (native video understanding)

	Both paths converge:
	↓
	Qwen3-8B sentence composer
	(vLLM on MI300X port 8001)
	↓
	gTTS
	(Google free TTS, MP3)
	```

	## Models

	\| Component \| Source \| Notes \|
	\|---\|---\|---\|
	\| Hand-pose extractor \| MediaPipe HandLandmarker (Google) \| CPU-only, ~50ms/frame — runs on the HF Space CPU \|
	\| Static-letter classifier (Snapshot tab) \| trained-from-scratch MLP on hand-landmark vectors → 26 ASL letters \| 3-layer MLP (63→256→256→128→26), 5K trainable params, GELU+dropout. 88.0% test accuracy on a 1,727-image holdout, 90.4% on the gold set. Weights at `huggingface.co/LucasLooTan/signbridge-asl-classifier` \|
	\| Motion-sign + fallback recognizer \| fine-tuned `Qwen/Qwen3-VL-8B-Instruct` \| LoRA fine-tune on AMD MI300X (rank 16, target q/k/v/o, 2 epochs, 54 min wall-clock on a single MI300X). Eval loss 0.48, transformers gold-set accuracy 92.3%. Merged adapter pushed to `huggingface.co/LucasLooTan/signbridge-qwen3vl-8b-asl` (17.5 GB) \|
	\| Sentence composer \| `Qwen/Qwen3-8B` \| Pulled from HF Hub; served on MI300X via vLLM. Used for every Speak click — AMD is in the critical path \|
	\| Text-to-speech \| `gTTS` (Google's free TTS) \| Tiny dependency, no model download, MP3 output in <1 s. Coqui XTTS-v2 path is preserved as Tier 1 fallback when installed locally \|

	## Datasets

	- Marxulia/asl_sign_languages_alphabets_v03 (HF Hub) — 10,873 photographic ASL letter samples; we extracted MediaPipe landmarks (8,639 hands detected) + used the same images for the LoRA fine-tune (9,786 train / 1,087 eval split)
	- [WLASL](https://github.com/dxli94/WLASL) Top-100 subset — referenced for V2 motion-sign training (not used in V1)

	## ROCm / AMD Developer Cloud experience

	### Day 1 — environment + sanity
	- Provisioned an MI300X-1× droplet (192 GB HBM3, 240 GB RAM, 5 TB scratch) at $1.99/hr in ATL1
	- Selected the prebuilt vLLM 0.17.1 / ROCm 7.2 Quick-Start image — saved ~30 min vs hand-installing
	- ROCm reported the GPU correctly via `rocm-smi`; vLLM spun up Qwen3-VL-32B + Qwen3-8B in parallel within 12 minutes
	- One real friction: vLLM's default `0.0.0.0` binding tripped a Gloo/NCCL error on the host's NIC; fixed by setting `VLLM_HOST_IP=127.0.0.1` and `GLOO_SOCKET_IFNAME=lo` env vars

	### Day 2 — fine-tuning Qwen3-VL-8B with LoRA on MI300X
	- Used the AMD-provided `rocm:latest` Docker image — torch 2.9.1+ROCm, transformers 4.57.6, peft 0.18.1, accelerate 1.13.0 all preinstalled
	- LoRA rank 16 on q/k/v/o projections, FP16, gradient checkpointing with `use_reentrant=False`
	- Critical fix for PEFT + grad-checkpoint: call `model.enable_input_require_grads()` BEFORE wrapping in PEFT (without it, training stalls at step 0 with "None of the inputs have requires_grad=True")
	- 1,224 steps × 4×4 effective batch = 9,786 samples × 2 epochs in 54 minutes; eval loss 0.48
	- Spent ~$2 of the $100 credit on this single fine-tune

	### Day 3 — serving + accuracy comparison
	- Three approaches benchmarked on the same 52-image gold set:
	- Qwen3-VL-32B zero-shot: 19.2% — VLMs without ASL-specific tuning struggle with subtle hand shapes
	- MediaPipe + 5K-param MLP: 90.4% — the textbook approach for static pose classification still wins for cost/accuracy ratio
	- LoRA-tuned Qwen3-VL-8B (transformers eval): 92.3% — best, but 4× slower per inference
	- Hybrid pipeline ships: MediaPipe+MLP for typical fingerspelling (50ms, ~90%), fine-tuned VLM for motion signs and as fallback when no hand is detected
	- Latency on MI300X: Qwen3-8B composer ~0.5s/call, fine-tuned 8B vision recognizer ~1.3s/call

	### What worked well
	- AMD Developer Cloud provisioning was 5 min from "approved" to SSH — credit landed via email and the Quick-Start vLLM image meant zero ROCm setup pain
	- 192 GB HBM3 hosted both the 32B vision model and the 8B composer concurrently (gpu-mem 0.55 + 0.30) with margin for KV cache
	- Fine-tuning + inference + composing on a single MI300X with no swapping or reloading — the multi-tenant story is real
	- The `rocm:latest` Docker image had the entire training stack (torch, transformers, peft, accelerate, datasets) preinstalled and tested

	### What we'd flag as friction
	- vLLM 0.17.1's image-preprocessing for Qwen3-VL doesn't exactly match transformers' processor — the LoRA-tuned model that scored 92.3% in transformers eval drops to 63.5% via the OpenAI-compatible vLLM endpoint. This is upstream and not AMD-specific, but it limited how aggressively we could lean on the fine-tune for the live demo
	- The `low-power state` warning in `rocm-smi` while the GPU was idle was cosmetic but confusing — clarifying that "low-power" doesn't mean "stalled" would help first-time users
	- Setting `VLLM_HOST_IP=127.0.0.1` for single-GPU vLLM on a Gloo backend isn't documented in the AMD vLLM Quick-Start; we found it from a vLLM GitHub issue

	### What we'd flag as friction
	TODO

	## Why AMD MI300X — concretely

	The pipeline (MediaPipe Hand + fine-tuned Qwen3-VL-8B + Qwen3-8B composer + gTTS)
	fits comfortably on a single MI300X with KV-cache headroom. The same workload
	on NVIDIA forces sharding once we add the V2 reasoner.

	\| Component \| Weights (FP16) \| MI300X 1× (192 GB) \| H100 80 GB \| H200 141 GB \|
	\|---\|---\|---\|---\|---\|
	\| Fine-tuned Qwen3-VL-8B (vision, native video) \| ~16 GB \| ✅ fits \| ✅ \| ✅ \|
	\| Qwen3-8B (composer) \| ~16 GB \| ✅ fits \| ✅ \| ✅ \|
	\| Whisper-large-v3 (V2 reverse direction) \| ~3 GB \| ✅ fits \| ⚠ tight \| ✅ \|
	\| gTTS (no GPU footprint — Python-side cloud call) \| n/a \| ✅ \| ✅ \| ✅ \|
	\| (V2) Llama-3.1-70B FP8 reasoner upgrade \| ~70 GB \| ✅ still fits \| ❌ doesn't fit at all \| ⚠ FP8 only, no headroom \|
	\| Concurrent serving + KV cache \| ✅ comfortable \| ❌ requires sharding \| ⚠ tight \| ✅ \|

	The single-GPU concurrency story is the AMD pitch. V1 fits anywhere; the
	architecture has clear MI300X headroom for V2 model upgrades that NVIDIA
	H100 cannot match without sharding across multiple cards.

	## Deployment ethics

	SignBridge is a substrate, not a finished product. We ship the open-source
	multi-modal pipeline so Deaf-led organisations — schools-for-the-Deaf, regional
	NGOs, ministries of social services — can deploy on their own AMD compute,
	fine-tune for their dialect, and own the deployment.

	Three principles, drawn from the Deaf-led literature on sign-language AI:

	1. ASL only V1 is a scope decision. Sign languages are not interchangeable
	— BSL, ISL, MSL, CSL, and 200+ other sign languages each deserve their own
	teams, training data, and Deaf community leadership. Bragg et al.
	["Systemic Biases in Sign Language AI Research"](https://arxiv.org/html/2403.02563v1)
	(2024, Deaf-led position paper) is direct on this point.

	2. Deaf community engagement before deployment. Per the ACM ASSETS 2025
	["Exploring Collaboration to Center the Deaf Community in Sign Language AI"](https://dl.acm.org/doi/10.1145/3663547.3746390)
	the productive ML/Deaf collaboration question isn't "how do we build this?"
	but "should we build this, for whom, with whom?". Any deployment
	downstream of this code must answer that locally.

	3. Privacy by default. SignBridge sessions are ephemeral — webcam frames
	and audio are processed in-memory and not persisted server-side beyond the
	request lifetime. In the spirit of [Privacy-Aware Sign Language Translation
	at Scale](https://aclanthology.org/2024.acl-long.467.pdf) (ACL 2024).

	## Future work — academic foundations we'd build on next

	- SignCLIP ([EMNLP 2024](https://aclanthology.org/2024.emnlp-main.518.pdf)) —
	learned text↔sign embeddings; replaces the prompt-only composer with a
	CLIP-style alignment head for higher-quality sign-to-English mapping.
	- SL-SLR ([arXiv 2509.05188v1](https://arxiv.org/html/2509.05188v1)) —
	self-supervised representation learning with motion-aware data augmentation;
	the right path if we ever train a custom classifier on raw signer footage.
	- Continuous SLT trained models (Swin-MSTP, Stack Transformer) — the
	current trained-from-scratch ceiling on WLASL is ~93.5% Top-1. The VLM
	zero-shot path we ship here is a deployment-cost play, not an
	accuracy-ceiling play; SignCLIP-style learned embeddings are the natural
	V2 step toward that ceiling.

	## Latency

	Target: ≤ 2 s from end-of-sign to start of speech.

	Measured on a single MI300X (Day 3):
	- MediaPipe Hand detection per frame: ~50 ms (CPU)
	- Trained MLP per landmark vector: ~5 ms (CPU)
	- Fine-tuned Qwen3-VL-8B per recording (native video, ~1680 prompt tokens): ~1-2 s
	- Qwen3-8B sentence composition (≤ 30 tokens): ~300 ms
	- gTTS first-audio-chunk: ~500 ms (single round-trip to Google)

	## MI300X vs NVIDIA H100 — the AMD pitch

	\| Item \| MI300X (1 GPU) \| H100 (1 GPU) \| H100 cluster needed \|
	\|---\|---\|---\|---\|
	\| Fine-tuned Qwen3-VL-8B + Qwen3-8B (both FP16) \| ✅ fits with margin \| ⚠️ tight (~32 GB) \| maybe 1×, no headroom \|
	\| + Whisper-large-v3 + MLP classifier \| ✅ all concurrent \| ⚠️ tight (~35 GB total + KV) \| likely 1× but no headroom \|
	\| + 70B reasoner upgrade (V2) \| ✅ 70B FP8 ~70 GB still fits \| ❌ doesn't fit at all \| ≥3× \|

	The single-GPU concurrency story is the AMD pitch. This V1 fits on H100;
	the architecture has clear headroom on MI300X for higher-quality V2 models.

	## License

	MIT.