# SignBridge — technical walkthrough > Internal technical record of the build. Not a submission deliverable > (Build-in-Public extra challenge was dropped on 2026-05-07). > Kept around because it documents the AMD-specific engineering thinking > and is useful if anyone later asks "why these design choices?". ## What we built A real-time webcam-based ASL → English speech translator. A deaf user signs into the webcam; the pipeline (MediaPipe Hand → trained MLP for static fingerspelling, OR webcam-clip → ffmpeg → fine-tuned Qwen3-VL-8B native video → Qwen3-8B composer → gTTS) returns spoken English in under 2 seconds. Designed to fit Track 3 (Vision & Multimodal AI) with both LLMs running concurrently on a single AMD Instinct MI300X. ## Why AMD MI300X - 192 GB HBM3 — the trained MLP classifier (~478 KB), fine-tuned Qwen3-VL-8B (~16 GB FP16), Qwen3-8B composer (~16 GB FP16), and (V2 stretch) Whisper-large-v3 (~3 GB) all fit concurrently with margin for KV cache. - 5.3 TB/s memory bandwidth — bandwidth-bound streaming workload (many small inferences per second on the MLP + LLM next-token + Qwen3-VL vision encoder) is exactly what bandwidth wins. ## Architecture ``` Snapshot tab (fingerspelling): webcam frame → MediaPipe Hand → trained MLP classifier (CPU-fast) (PyTorch on CPU, ~50 ms) Record sign tab (motion words): webcam recording → ffmpeg (480p, 8 fps, ≤4 s, H.264) ↓ vLLM video_url block on AMD MI300X port 8000 ↓ fine-tuned Qwen3-VL-8B (native video understanding) Both paths converge: ↓ Qwen3-8B sentence composer (vLLM on MI300X port 8001) ↓ gTTS (Google free TTS, MP3) ``` ## Models | Component | Source | Notes | |---|---|---| | Hand-pose extractor | MediaPipe HandLandmarker (Google) | CPU-only, ~50ms/frame — runs on the HF Space CPU | | Static-letter classifier (Snapshot tab) | **trained-from-scratch MLP** on hand-landmark vectors → 26 ASL letters | 3-layer MLP (63→256→256→128→26), 5K trainable params, GELU+dropout. **88.0% test accuracy** on a 1,727-image holdout, **90.4% on the gold set**. Weights at `huggingface.co/LucasLooTan/signbridge-asl-classifier` | | Motion-sign + fallback recognizer | **fine-tuned `Qwen/Qwen3-VL-8B-Instruct`** | LoRA fine-tune on AMD MI300X (rank 16, target q/k/v/o, 2 epochs, 54 min wall-clock on a single MI300X). Eval loss 0.48, transformers gold-set accuracy 92.3%. Merged adapter pushed to `huggingface.co/LucasLooTan/signbridge-qwen3vl-8b-asl` (17.5 GB) | | Sentence composer | `Qwen/Qwen3-8B` | Pulled from HF Hub; served on MI300X via vLLM. Used for every Speak click — AMD is in the critical path | | Text-to-speech | `gTTS` (Google's free TTS) | Tiny dependency, no model download, MP3 output in <1 s. Coqui XTTS-v2 path is preserved as Tier 1 fallback when installed locally | ## Datasets - **Marxulia/asl_sign_languages_alphabets_v03** (HF Hub) — 10,873 photographic ASL letter samples; we extracted MediaPipe landmarks (8,639 hands detected) + used the same images for the LoRA fine-tune (9,786 train / 1,087 eval split) - [WLASL](https://github.com/dxli94/WLASL) Top-100 subset — referenced for V2 motion-sign training (not used in V1) ## ROCm / AMD Developer Cloud experience ### Day 1 — environment + sanity - Provisioned an MI300X-1× droplet (192 GB HBM3, 240 GB RAM, 5 TB scratch) at $1.99/hr in ATL1 - Selected the prebuilt **vLLM 0.17.1 / ROCm 7.2** Quick-Start image — saved ~30 min vs hand-installing - ROCm reported the GPU correctly via `rocm-smi`; vLLM spun up Qwen3-VL-32B + Qwen3-8B in parallel within 12 minutes - One real friction: vLLM's default `0.0.0.0` binding tripped a Gloo/NCCL error on the host's NIC; fixed by setting `VLLM_HOST_IP=127.0.0.1` and `GLOO_SOCKET_IFNAME=lo` env vars ### Day 2 — fine-tuning Qwen3-VL-8B with LoRA on MI300X - Used the AMD-provided `rocm:latest` Docker image — torch 2.9.1+ROCm, transformers 4.57.6, peft 0.18.1, accelerate 1.13.0 all preinstalled - LoRA rank 16 on q/k/v/o projections, FP16, gradient checkpointing with `use_reentrant=False` - Critical fix for PEFT + grad-checkpoint: call `model.enable_input_require_grads()` BEFORE wrapping in PEFT (without it, training stalls at step 0 with "None of the inputs have requires_grad=True") - 1,224 steps × 4×4 effective batch = 9,786 samples × 2 epochs in 54 minutes; eval loss 0.48 - Spent ~$2 of the $100 credit on this single fine-tune ### Day 3 — serving + accuracy comparison - **Three approaches benchmarked on the same 52-image gold set:** - Qwen3-VL-32B zero-shot: **19.2%** — VLMs without ASL-specific tuning struggle with subtle hand shapes - MediaPipe + 5K-param MLP: **90.4%** — the textbook approach for static pose classification still wins for cost/accuracy ratio - LoRA-tuned Qwen3-VL-8B (transformers eval): **92.3%** — best, but 4× slower per inference - Hybrid pipeline ships: MediaPipe+MLP for typical fingerspelling (50ms, ~90%), fine-tuned VLM for motion signs and as fallback when no hand is detected - Latency on MI300X: Qwen3-8B composer ~0.5s/call, fine-tuned 8B vision recognizer ~1.3s/call ### What worked well - AMD Developer Cloud provisioning was 5 min from "approved" to SSH — credit landed via email and the Quick-Start vLLM image meant zero ROCm setup pain - 192 GB HBM3 hosted both the 32B vision model and the 8B composer concurrently (gpu-mem 0.55 + 0.30) with margin for KV cache - Fine-tuning + inference + composing on a single MI300X with no swapping or reloading — the multi-tenant story is real - The `rocm:latest` Docker image had the entire training stack (torch, transformers, peft, accelerate, datasets) preinstalled and tested ### What we'd flag as friction - vLLM 0.17.1's image-preprocessing for Qwen3-VL doesn't exactly match transformers' processor — the LoRA-tuned model that scored 92.3% in transformers eval drops to 63.5% via the OpenAI-compatible vLLM endpoint. This is upstream and not AMD-specific, but it limited how aggressively we could lean on the fine-tune for the live demo - The `low-power state` warning in `rocm-smi` while the GPU was idle was cosmetic but confusing — clarifying that "low-power" doesn't mean "stalled" would help first-time users - Setting `VLLM_HOST_IP=127.0.0.1` for single-GPU vLLM on a Gloo backend isn't documented in the AMD vLLM Quick-Start; we found it from a vLLM GitHub issue ### What we'd flag as friction TODO ## Why AMD MI300X — concretely The pipeline (MediaPipe Hand + fine-tuned Qwen3-VL-8B + Qwen3-8B composer + gTTS) fits comfortably on a single MI300X with KV-cache headroom. The same workload on NVIDIA forces sharding once we add the V2 reasoner. | Component | Weights (FP16) | MI300X 1× (192 GB) | H100 80 GB | H200 141 GB | |---|---|---|---|---| | Fine-tuned Qwen3-VL-8B (vision, native video) | ~16 GB | ✅ fits | ✅ | ✅ | | Qwen3-8B (composer) | ~16 GB | ✅ fits | ✅ | ✅ | | Whisper-large-v3 (V2 reverse direction) | ~3 GB | ✅ fits | ⚠ tight | ✅ | | gTTS (no GPU footprint — Python-side cloud call) | n/a | ✅ | ✅ | ✅ | | (V2) Llama-3.1-70B FP8 reasoner upgrade | ~70 GB | ✅ still fits | ❌ doesn't fit at all | ⚠ FP8 only, no headroom | | **Concurrent serving + KV cache** | ✅ comfortable | ❌ requires sharding | ⚠ tight | ✅ | The single-GPU concurrency story is the AMD pitch. V1 fits anywhere; the architecture has clear MI300X headroom for V2 model upgrades that NVIDIA H100 cannot match without sharding across multiple cards. ## Deployment ethics SignBridge is a *substrate*, not a finished product. We ship the open-source multi-modal pipeline so Deaf-led organisations — schools-for-the-Deaf, regional NGOs, ministries of social services — can deploy on their own AMD compute, fine-tune for their dialect, and own the deployment. Three principles, drawn from the Deaf-led literature on sign-language AI: 1. **ASL only V1** is a scope decision. Sign languages are not interchangeable — BSL, ISL, MSL, CSL, and 200+ other sign languages each deserve their own teams, training data, and Deaf community leadership. Bragg et al. ["Systemic Biases in Sign Language AI Research"](https://arxiv.org/html/2403.02563v1) (2024, Deaf-led position paper) is direct on this point. 2. **Deaf community engagement before deployment.** Per the ACM ASSETS 2025 ["Exploring Collaboration to Center the Deaf Community in Sign Language AI"](https://dl.acm.org/doi/10.1145/3663547.3746390) the productive ML/Deaf collaboration question isn't "how do we build this?" but "*should* we build this, *for whom*, *with whom*?". Any deployment downstream of this code must answer that locally. 3. **Privacy by default.** SignBridge sessions are ephemeral — webcam frames and audio are processed in-memory and not persisted server-side beyond the request lifetime. In the spirit of [Privacy-Aware Sign Language Translation at Scale](https://aclanthology.org/2024.acl-long.467.pdf) (ACL 2024). ## Future work — academic foundations we'd build on next - **SignCLIP** ([EMNLP 2024](https://aclanthology.org/2024.emnlp-main.518.pdf)) — learned text↔sign embeddings; replaces the prompt-only composer with a CLIP-style alignment head for higher-quality sign-to-English mapping. - **SL-SLR** ([arXiv 2509.05188v1](https://arxiv.org/html/2509.05188v1)) — self-supervised representation learning with motion-aware data augmentation; the right path if we ever train a custom classifier on raw signer footage. - **Continuous SLT trained models** (Swin-MSTP, Stack Transformer) — the current trained-from-scratch ceiling on WLASL is ~93.5% Top-1. The VLM zero-shot path we ship here is a *deployment-cost* play, not an accuracy-ceiling play; SignCLIP-style learned embeddings are the natural V2 step toward that ceiling. ## Latency Target: ≤ 2 s from end-of-sign to start of speech. Measured on a single MI300X (Day 3): - MediaPipe Hand detection per frame: ~50 ms (CPU) - Trained MLP per landmark vector: ~5 ms (CPU) - Fine-tuned Qwen3-VL-8B per recording (native video, ~1680 prompt tokens): ~1-2 s - Qwen3-8B sentence composition (≤ 30 tokens): ~300 ms - gTTS first-audio-chunk: ~500 ms (single round-trip to Google) ## MI300X vs NVIDIA H100 — the AMD pitch | Item | MI300X (1 GPU) | H100 (1 GPU) | H100 cluster needed | |---|---|---|---| | Fine-tuned Qwen3-VL-8B + Qwen3-8B (both FP16) | ✅ fits with margin | ⚠️ tight (~32 GB) | maybe 1×, no headroom | | + Whisper-large-v3 + MLP classifier | ✅ all concurrent | ⚠️ tight (~35 GB total + KV) | likely 1× but no headroom | | + 70B reasoner upgrade (V2) | ✅ 70B FP8 ~70 GB still fits | ❌ doesn't fit at all | ≥3× | The single-GPU concurrency story is the AMD pitch. This V1 fits on H100; the architecture has clear headroom on MI300X for higher-quality V2 models. ## License MIT.