Spaces:
Running on Zero
Running on Zero
| # FormScout β Functional Movement Screening, scored small | |
| **Project specification & architecture documentation** | |
| *Build Small Hackathon (Gradio Γ Hugging Face) β Track: Backyard AI* | |
| *Working title; rename freely. Doc version 0.1, June 2026.* | |
| --- | |
| ## 1. One-paragraph pitch | |
| A basketball team's physiotherapist screens players with the **Functional Movement Screen (FMS)** β seven movement patterns, each scored 0β3 by eye. The scoring is slow, subjective, and hard to reproduce across raters or across months. FormScout is a Gradio app that takes a video of an athlete performing an FMS test, extracts 2D and 3D body pose, measures the biomechanics the FMS rubric actually cares about, and produces a 0β3 score *with a written rationale and an annotated overlay* β anchored to the physio's own previously-scored clips. It is a **screening aid that standardizes and speeds up the physio's first pass**, not a diagnosis and not an injury predictor. Everything runs on models that fit on a laptop. | |
| --- | |
| ## 2. The problem, honestly | |
| The FMS is a seven-test battery (Deep Squat, Hurdle Step, In-Line Lunge, Shoulder Mobility, Active Straight-Leg Raise, Trunk Stability Push-Up, Rotary Stability), each scored 0β3 for a composite 0β21. A score of 0 means **pain** during the movement and is an automatic red flag for clinical referral. Three of the tests have associated **clearing tests** (shoulder, spinal extension, spinal flexion) that also force a 0 on pain. | |
| Two facts shape this project and should be stated plainly in the demo and the writeup: | |
| - **Inter-rater reliability is decent but not perfect.** Composite-score reliability is moderate-to-good (ICC roughly 0.7β0.8), but novice and less-experienced raters grade component scores inconsistently. This is the real, addressable pain point: **variance between raters and over time.** | |
| - **Predictive validity for injury is weak/mixed.** The popular "β€14 = higher injury risk" cutoff is not a reliable predictor on its own. So FormScout must **not** be sold as injury prediction. | |
| **Where FormScout genuinely helps:** | |
| 1. A repeatable, objective **digital baseline** to track an athlete over a season. | |
| 2. **Asymmetry detection** (left vs. right), which is one of the FMS's most defensible outputs. | |
| 3. A fast, consistent **first-pass / second opinion** that reduces rater variance. | |
| 4. **Explainability** β it shows *which compensation* it saw, not just a number. | |
| This honest framing is also strategic: the Backyard AI track is judged partly on "honest fit between problem and the small-model constraint." Overclaiming clinical power would hurt the submission, not help it. | |
| --- | |
| ## 3. Why this fits the hackathon | |
| | Hackathon rule | How FormScout satisfies it | | |
| |---|---| | |
| | **Total params β€ 32B** | Recommended config sums to ~18B. A portfolio of small specialists beats one monolith β which is on-theme for "think small." | | |
| | **Built on Gradio, hosted as a HF Space** | Gradio app with `gr.Video` input, a custom-styled results panel, on-Space inference (ZeroGPU or llama.cpp). | | |
| | **Show, Don't Tell** | Demo video = physio uploads a real player clip, gets a scored overlay in seconds. Social post = before/after of a manual vs. assisted screening session. | | |
| | **Track: Backyard AI** | The "someone you know" is the team physiotherapist. The deliverable is something they *actually use* on real players. | | |
| **Badge targets (aim for all six):** | |
| - π **Off the Grid** β no cloud APIs; all models served on the Space. | |
| - π― **Well-Tuned** β the skeletal-temporal scoring head is fine-tuned on the physio's labels and published to the Hub. | |
| - π¨ **Off-Brand** β custom Gradio frontend (scorecard UI, video overlay, per-test rubric panel), pushing past default Gradio. | |
| - π¦ **Llama Champion** β VLM + embedding model served through llama.cpp (GGUF builds exist for both). | |
| - π‘ **Sharing is Caring** β publish the agent trace (one full screening run, agent by agent) to the Hub. | |
| - π **Field Notes** β a blog post on building a clinical-adjacent AQA pipeline under a 32B budget, with the honesty section front and center. | |
| --- | |
| ## 4. Core technical framing: FMS *is* Action Quality Assessment | |
| Don't reinvent this from scratch. **Action Quality Assessment (AQA)** is the established field for "score how well a movement was performed." Skeleton-based AQA (sports scoring, surgical-skill and rehab assessment) is the directly relevant lineage. The "Skeletal-Temporal Transformer" idea maps onto the **AQA scoring head**. | |
| The key design constraint is the **tiny labeled dataset** (a couple of physio-scored videos). That rules out training a large score regressor from scratch and dictates a hybrid approach: | |
| 1. **Deterministic biomechanics** carry most of the load. The FMS rubric is, to a large degree, a set of *angle and alignment thresholds* (e.g. Deep Squat "3" = femur below horizontal, torso parallel to tibia, knees tracking over feet, dowel over feet). These are computable from 3D pose with **zero training** and are inherently interpretable β exactly what earns a physio's trust. | |
| 2. **A small learned head** (ST-GCN or a compact temporal transformer) refines the score and captures the patterns rules miss. It is small enough to fine-tune on a few labeled clips, *especially* if pre-trained on public AQA/pose datasets first. | |
| 3. **Retrieval over the physio's labeled clips** (RAG) gives the language model few-shot anchors at judgment time β the right move when you have examples but not enough to train on. | |
| 4. **A VLM as the judge/explainer** synthesizes rubric + measurements + retrieved exemplars into a final score and a human-readable rationale, and conservatively flags anything pain-related for a human. | |
| --- | |
| ## 5. Parameter budget (the single most important table) | |
| Assume "total parameters" = **sum of all model weights in the pipeline**. Design to this; confirm the exact interpretation in the Discord AMA. | |
| ### Recommended config β "Portfolio of specialists" (~18B) | |
| | Component | Model | Params | Role | | |
| |---|---|---:|---| | |
| | 2D pose + tracking | YOLO26-Pose (L/X) | ~0.05B | Per-frame 17-keypoint skeletons, multi-person tracking | | |
| | Segmentation | SAM 3.1 (base) | ~0.85B | Clean athlete mask, occlusion handling, prompt for 3D | | |
| | 3D body | SAM 3D Body | ~0.7β1B* | Single-image 3D mesh β true joint angles, view-invariant | | |
| | Scoring head | ST-GCN / temporal transformer (fine-tuned) | ~0.01β0.05B | Pose-sequence β candidate 0β3 + confidence | | |
| | Judge / explainer | Qwen3-VL-8B-Instruct | 8B | Movement ID, rubric reasoning, final score + rationale | | |
| | Retrieval | Qwen3-VL-Embedding-8B | 8B | Nearest physio-scored reference clips (RAG) | | |
| | **Total** | | **~17.8B** | Comfortable headroom under 32B | | |
| \* SAM 3D Body's exact count isn't published prominently β verify on the model card. It's SAM-3-family and sub-billion-class; budget impact is small either way. The two 8B Qwen models **share the Qwen3-VL-8B backbone** (the embedder is built on the instruct model), which is conceptually clean and operationally efficient. | |
| ### Alternative config β "Heavy reasoner" (~28.7B) | |
| Swap the 8B judge for **Qwen3.6-27B** (multimodal, strong tool-calling, MTP speedups on llama.cpp). Budget then = 27 + ~0.85 + ~1 + small β **28.7B**. This **leaves no room for the 8B embedder**, so you'd drop RAG (or replace it with a sub-0.5B embedder, or use pose-feature similarity for retrieval). Note: Qwen3.6-27B's MTP speculative decoding currently can't run simultaneously with image input (`--mmproj`), so for vision you run it without MTP. | |
| **Recommendation: ship the ~18B portfolio config.** RAG over the physio's few labeled clips is worth more than raw reasoning horsepower on this task, the headroom de-risks the budget, and "many small specialists" is the better hackathon story. | |
| --- | |
| ## 6. Model selection rationale | |
| **YOLO26-Pose** β current-generation YOLO pose; single forward pass for detection + keypoints, NMS-free, real-time even on edge. Tiny param cost. It also handles **multiple people in frame** (important: team videos often have other players/staff visible) and feeds keypoints downstream. Off-the-shelf it predicts COCO human keypoints; can be fine-tuned for custom landmarks (e.g. dowel endpoints) if needed. | |
| **SAM 3.1** β gives a clean athlete mask and stable multi-object video tracking (Object Multiplex makes it fast). Two jobs: (a) isolate the target athlete from teammates/background so pose and 3D aren't polluted, (b) provide the mask prompt that SAM 3D Body consumes. Concept prompts ("the person in the blue jersey performing the squat") are a bonus for disambiguation. | |
| **SAM 3D Body** β *the addition that makes the scores trustworthy.* FMS criteria are joint angles and symmetry; 2D pose can't measure these reliably across camera angles (projection ambiguity). 3D mesh recovery from a single image, promptable with the 2D keypoints + mask you already have, yields view-invariant joint angles (the MHR rig even separates skeletal structure from soft-tissue shape, which is convenient for angle extraction). This is the difference between "looks bent" and "femur is 4Β° above horizontal β not a 3." | |
| **Skeletal-temporal scoring head** β your AQA component and your **Well-Tuned** badge. Recommend a compact **ST-GCN** (graph conv over the skeleton, temporal conv over frames) over a from-scratch transformer, because it's far more data-efficient on a tiny labeled set. Pre-train on public AQA / pose-action data, then fine-tune on the physio's labels. Output: per-test candidate score + a confidence the judge can weigh. | |
| **Qwen3-VL-8B-Instruct** β the judge. Strong video temporal modeling (Interleaved-MRoPE, timestamp alignment) suits movement clips. It identifies which of the 7 tests is being performed, reads the biomechanics, considers retrieved exemplars and the head's candidate, and emits the final score + rationale + detected compensation. GGUF β llama.cpp β Llama Champion. | |
| **Qwen3-VL-Embedding-8B** β retrieval. Embeds the query clip (or its keyframes/pose-render) and finds the physio's most similar already-scored clips to anchor the judge. Top multimodal retriever on MMEB-V2; same backbone as the judge; GGUF available. | |
| --- | |
| ## 7. Architecture β an agentic pipeline | |
| Structured as cooperating specialist agents (maps naturally onto an OFP-style orchestration, with a Director coordinating and quality-gating). Each agent has one job and a typed output. | |
| ``` | |
| ββββββββββββββββββββββββββββββββββββββββββββββββ | |
| video upload ββββββββΆβ IngestAgent β | |
| β decode, normalize FPS, sample frames β | |
| βββββββββββββββββ¬βββββββββββββββββββββββββββββββ | |
| βΌ | |
| ββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β SegmentationAgent (SAM 3.1) β | |
| β athlete mask + track id (reject teammates) β | |
| βββββββββββββββββ¬βββββββββββββββββββββββββββββββ | |
| βΌ | |
| ββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββ | |
| βΌ βΌ | |
| βββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββ | |
| β PoseAgent (YOLO26-Pose) β β Body3DAgent (SAM 3D Body) β | |
| β 2D keypoints per frame β βββkeypoints+maskβββΆ β 3D mesh / joint angles β | |
| βββββββββββββββββ¬ββββββββββββ βββββββββββββββββ¬ββββββββββββ | |
| βββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ | |
| βΌ | |
| ββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β MovementClassifierAgent β | |
| β which of the 7 FMS tests? (VLM or small CLS) β | |
| βββββββββββββββββ¬βββββββββββββββββββββββββββββββ | |
| βΌ | |
| ββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββ | |
| βΌ βΌ βΌ | |
| ββββββββββββββββββββββ βββββββββββββββββββββββββββ ββββββββββββββββββββββββββ | |
| β BiomechanicsAgent β β ScoringAgent (ST-GCN) β β RetrievalAgent β | |
| β rubric angles, β β candidate 0β3 + conf β β (Qwen3-VL-Embedding) β | |
| β ROM, symmetry, β β from pose sequence β β k nearest physio clips β | |
| β alignment, timing β β β β + their scores β | |
| βββββββββββ¬βββββββββββ βββββββββββββ¬ββββββββββββββ βββββββββββββ¬βββββββββββββ | |
| βββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββ | |
| βΌ | |
| ββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β JudgeAgent (Qwen3-VL-8B) β | |
| β rubric + measurements + exemplars + candidateβ | |
| β β final 0β3, rationale, compensation tag, β | |
| β corrective hint, PAIN/CLEARING β defer β | |
| βββββββββββββββββ¬βββββββββββββββββββββββββββββββ | |
| βΌ | |
| ββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β ReportAgent β | |
| β per-test card, composite 0β21, asymmetry β | |
| β flags, annotated video, exportable PDF β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| **Agent contracts (sketch):** | |
| - `IngestAgent` β `{frames[], fps, duration, n_people}` | |
| - `SegmentationAgent` β `{athlete_track_id, masks[]}` | |
| - `PoseAgent` β `{keypoints_2d[frame][joint]={x,y,conf}}` | |
| - `Body3DAgent` β `{joints_3d[frame][joint]={x,y,z}, mesh_optional}` | |
| - `MovementClassifierAgent` β `{test_name, side: left|right|n/a, confidence}` | |
| - `BiomechanicsAgent` β `{features: {torso_tibia_angle, hip_flexion_deg, knee_valgus_deg, dowel_alignment, L_R_symmetry, ...}}` | |
| - `ScoringAgent` β `{candidate_score: 0β3, confidence}` | |
| - `RetrievalAgent` β `{exemplars: [{clip_id, score, similarity}]}` | |
| - `JudgeAgent` β `{score: 0β3, rationale, compensation_tags[], corrective_hint, needs_human: bool}` | |
| - `ReportAgent` β `{per_test[], composite, asymmetries[], overlay_video, pdf}` | |
| **Quality gating:** if the ST-GCN candidate and the JudgeAgent disagree by β₯1 point, or any agent confidence is low, the report marks the test **"low confidence β physio review recommended."** This keeps the human in the loop and is itself a selling point. | |
| --- | |
| ## 8. Scoring methodology, per test | |
| The seven tests reduce to measurable quantities. Build a small rubric module β one scoring function per test β that consumes the 3D features and returns a score with the triggering reason. Examples: | |
| - **Deep Squat (3):** femur below horizontal AND torso parallel to tibia AND knees tracking over feet AND dowel over feet. **(2):** same but achieved only with heels elevated. **(1):** criteria unmet even with heels elevated. β all four conditions are angle/alignment checks on the 3D pose. | |
| - **Hurdle Step / In-Line Lunge / Shoulder Mobility / ASLR:** bilateral β score each side, **record the lower** as the test score, and **always emit the asymmetry** even when the score is the same. | |
| - **Trunk Stability Push-Up / Rotary Stability:** trunk rigidity / timing of limb movement β temporal features from the pose sequence; the ST-GCN head is most valuable here. | |
| - **Pain / clearing tests (0):** the system **cannot** detect pain. Any clearing test, or a visible distress/abort, sets `needs_human = true` and the test is **not auto-scored**. Defer to the physio. State this loudly. | |
| Final composite = sum of seven test scores (0β21), plus an asymmetry summary. The number is never shown without its rationale. | |
| --- | |
| ## 9. Data & fine-tuning plan (tiny-dataset survival guide) | |
| You have "a couple" of physio-scored clips. Treat them as gold, not as a training set. | |
| 1. **Deterministic backbone first.** Get the biomechanics rubric working with no training. Validate the measured angles against the physio's scores qualitatively. This alone may be demo-ready. | |
| 2. **Pre-train the ST-GCN** on public pose-action / AQA data (action recognition or generic AQA) so it learns temporal movement structure, not FMS labels. | |
| 3. **Fine-tune on the physio's clips** with heavy augmentation: temporal crops/speed jitter, mirror (leftβright, doubles your bilateral data), camera-angle perturbation in 3D, joint noise. Few-shot, regularized, early-stopped. | |
| 4. **Hold out at least one physio-scored clip** as a sanity check the judge never sees. | |
| 5. **RAG instead of more training.** Every labeled clip goes into the embedding index as a scoring anchor. New clips added later improve the system with no retraining β a nice longitudinal story for the physio. | |
| 6. **Publish the fine-tuned head** to the Hub with a model card (β Well-Tuned badge). Include the augmentation recipe and the honest "trained on N clips, treat as assistive" caveat. | |
| **Label schema to collect from the physio** (if you can get a bit more data): `clip_id, athlete_id, test_name, side, score(0β3), pain(bool), compensation_notes, camera_view`. Even 20β30 well-labeled clips meaningfully helps. | |
| --- | |
| ## 10. Gradio Space & deployment | |
| **UI (targets Off-Brand badge):** | |
| - `gr.Video` upload (or webcam capture) + a test-type selector (auto-detect, with manual override). | |
| - Results panel: the 0β3 score as a large dial/patch, the composite 0β21, an asymmetry strip (L/R bars), and the **rationale text**. | |
| - The annotated overlay video: skeleton + the specific angle that decided the score drawn on the frame where it mattered. | |
| - A rubric drawer that shows the official 3/2/1 criteria for the detected test, with the met/unmet conditions checked off. | |
| - A persistent **"Screening aid β not a diagnosis. Pain or clearing tests require a clinician."** banner. | |
| - Custom CSS / `gr.Server` for a non-default look (scout/trail-map theme would rhyme with the hackathon, and with your design instincts). | |
| **Compute:** | |
| - ZeroGPU (H200 slice) can host the ~18B portfolio; load pose/SAM/3D eagerly, the VLM + embedder via llama.cpp. | |
| - For **Off the Grid**, ensure zero external API calls β everything served on-Space. | |
| - For **Llama Champion**, route the VLM + embedding through llama.cpp (GGUF builds exist for Qwen3-VL-8B-Instruct, Qwen3-VL-Embedding-8B, and Qwen3.6-27B). On a Space, watch the CUDA/llama-cpp build flags β recent hackathon Spaces hit `libcudart` issues; a CPU-only or pinned-CUDA build is the usual fix. | |
| - Persist the embedding index and accumulated labels in Space storage for the longitudinal baseline. | |
| --- | |
| ## 11. Clinical safety & ethics (bake this in, don't bolt it on) | |
| - **Not a medical device.** Screening aid only. No diagnosis, no injury prediction, no treatment advice beyond generic FMS-style correctives. | |
| - **Pain is out of scope** for automatic scoring β always defer to the physio. | |
| - **Human-in-the-loop by design:** low-confidence and disagreement cases are surfaced, not hidden. | |
| - **Consent & privacy:** athlete videos are biometric data. Get consent; don't log/persist clips beyond what the physio approves; document retention in the writeup. | |
| - **Honesty in the demo:** show a case the system gets right *and* one it flags as uncertain. Judges (and physios) trust calibrated tools more than confident ones. | |
| --- | |
| ## 12. Build plan β two weekends (June 5β15) | |
| **Weekend 1 β the spine works end to end:** | |
| - Day 1: Space scaffold, `gr.Video` in β skeleton overlay out (YOLO26-Pose). Ingest + Segmentation + Pose agents. | |
| - Day 2: SAM 3D Body integrated; BiomechanicsAgent computing Deep-Squat angles; first deterministic score on a real clip. | |
| - Goal: upload a squat video, get a rationalized 0β3. *This alone is a viable demo.* | |
| **Midweek:** wire the JudgeAgent (Qwen3-VL via llama.cpp), MovementClassifier, and the rubric module for all 7 tests. Attend the AMA β confirm the param-sum interpretation. | |
| **Weekend 2 β make it sing:** | |
| - ST-GCN pre-train + few-shot fine-tune on physio clips; publish to Hub. | |
| - RetrievalAgent + embedding index over labeled clips. | |
| - Custom UI polish, asymmetry view, PDF export, safety banners. | |
| - Record the demo video (physio uses it on a real player), write the social post, publish the agent trace and the blog post. | |
| --- | |
| ## 13. Risks & open questions | |
| - **Param-sum interpretation** β biggest unknown. The ~18B config is safe under either reading; confirm anyway. | |
| - **SAM 3D Body on a Space** β verify weights, license, and that it runs within ZeroGPU limits; have a 2D-only fallback (angles from 2D + camera-angle caveats) if it's too heavy. | |
| - **Single-camera angle limits** even with 3D β note it; recommend a consistent capture protocol (fixed camera position) for the physio, which also improves the longitudinal baseline. | |
| - **Tiny dataset** β the deterministic rubric must stand on its own so the demo doesn't hinge on the learned head generalizing from a few clips. | |
| - **llama.cpp + vision build** on Spaces β budget time for the CUDA build dance; CPU fallback for the embedder is fine. | |
| - **Movement misclassification** β if the wrong test is detected, scoring is meaningless; keep the manual override prominent. | |
| --- | |
| ## 14. Quick reference β the stack | |
| | Layer | Choice | Badge it helps | | |
| |---|---|---| | |
| | 2D pose | YOLO26-Pose | β | | |
| | Segmentation/track | SAM 3.1 | β | | |
| | 3D biomechanics | SAM 3D Body | β | | |
| | Learned scoring | ST-GCN (fine-tuned, published) | Well-Tuned | | |
| | Judge/explainer | Qwen3-VL-8B-Instruct (llama.cpp) | Llama Champion | | |
| | Retrieval | Qwen3-VL-Embedding-8B (llama.cpp) | Llama Champion | | |
| | Serving | On-Space, no cloud APIs | Off the Grid | | |
| | Frontend | Custom Gradio (scout theme) | Off-Brand | | |
| | Trace | Published agent run on Hub | Sharing is Caring | | |
| | Writeup | Blog post w/ honesty section | Field Notes | | |
| *Total β 18B params. Honest, explainable, human-in-the-loop, runs on a laptop.* | |