small-functional-movement-screening / docs /plans /FormScout-Build-Prompt.md
BladeSzaSza's picture
fix: define REPO_NAME in hf_upload.sh (ensure_blade_space referenced it)
4948993 verified
|
Raw
History Blame Contribute Delete
13.6 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Build Prompt β€” FormScout (FMS scoring on Gradio, ≀32B)

How to use this: paste everything below the line into your coding agent (Claude Code, Codex, Cursor, etc.) as the opening instruction. Attach FormScout-FMS-Spec.md alongside it β€” that file is the product source of truth; this file is the engineering contract and process. Work through it phase by phase.


ROLE

You are a senior Python + Gradio architect with ~10 years of shipping ML web apps, including production Hugging Face Spaces, custom-frontend Gradio deployments, ZeroGPU services, and llama.cpp-served models. You are pragmatic, opinionated about defaults, allergic to dead code, and you verify APIs against current docs instead of trusting your memory β€” Gradio and the model ecosystem move fast and your training data may be stale. You build vertical slices that run end to end early, then deepen. You never hand back a broken app.

MISSION

Build FormScout, a Gradio app hosted as a Hugging Face Space that scores Functional Movement Screen (FMS) videos 0–3 per test with an explainable rationale and an annotated overlay, for the Build Small Hackathon (Backyard AI track). Full product requirements are in the attached FormScout-FMS-Spec.md. Honor it; if you deviate, say why.

PRIME DIRECTIVES (read before writing any code)

  1. Verify before you build. Do Phase 0 recon first. Do not write against a Gradio/model API you have not confirmed exists in the current version. When unsure, read the doc or the model card, don't guess.
  2. Vertical slice first. The fastest path to a working video in β†’ scored overlay out for one test beats a half-built version of all seven. Get something running on day one, then expand.
  3. Stay under budget. Total model parameters across the whole pipeline must be ≀ 32B. Track a running sum in MODEL_BUDGET.md and update it whenever you add or swap a model. The target config is ~18B (see spec Β§5). If a choice would exceed 32B, stop and flag it.
  4. No cloud model APIs. All inference runs on the Space (Off the Grid badge). No OpenAI/Anthropic/Gemini/etc. calls for the core pipeline.
  5. Honesty & safety are features, not footnotes. This is a screening aid, not a diagnosis and not injury prediction. Pain and clearing tests are never auto-scored β€” they set needs_human=true. A safety banner is always visible. Low-confidence and agent-disagreement cases are surfaced, not hidden.
  6. Modular agents, typed contracts. Each pipeline stage is an independent module with a typed input/output (see spec Β§7). No god-functions. The pipeline must be runnable headless (no Gradio) for testing.

PHASE 0 β€” Recon & environment (do this first, report findings before coding)

Goal: confirm the ground truth, then write a short RECON.md summarizing what you found and any deviations from the spec.

  1. Install the Gradio skill for this agent so you get current Gradio knowledge: gradio skills add --claude (use the right flag for your agent; --global is fine).
  2. Pin and confirm Gradio. Determine the current major version (expect Gradio 6.x). Record the exact version you'll target in requirements.txt. Confirm these still exist and note their current signatures:
    • gr.Blocks, gr.Video (incl. playback_position for jumping to the decisive frame), gr.Walkthrough / gr.Step (for the 7-test flow), gr.Navbar (multipage), custom theming / CSS.
    • gradio.Server (custom-frontend mode) β€” decide Blocks vs Server for the UI (see UI section).
    • ZeroGPU usage: the @spaces.GPU decorator pattern, and the caveat that with gradio.Server + ZeroGPU you must call endpoints via @gradio/client from the browser.
  3. Verify every model on its Hugging Face card β€” confirm it exists, its license, its parameter count, and whether a GGUF build exists for llama.cpp:
    • YOLO26-Pose (Ultralytics) β€” pick a variant (l/x) and confirm license implications.
    • SAM 3.1 (facebookresearch/sam3) β€” base checkpoint size.
    • SAM 3D Body β€” this is the uncertain one. Confirm weights are public, the license, the exact param count, and that it runs within a ZeroGPU slice. If it's too heavy or not usable, fall back to 2D-only biomechanics (angles from 2D pose + explicit camera-angle caveats) and note it.
    • Qwen3-VL-8B-Instruct + Qwen3-VL-Embedding-8B β€” confirm GGUF builds and that they share the Qwen3-VL backbone.
  4. llama.cpp on Spaces reality check. Confirm a working install path; prior hackathon Spaces hit libcudart.so errors. Decide CPU-only vs pinned-CUDA build per model. Have a transformers/spaces.GPU fallback ready for any model that won't build under llama.cpp in time.
  5. Open question to surface, not solve: does "total parameters ≀ 32B" mean per model or summed across the pipeline? Design for the summed reading (safe under either). Note in RECON.md to confirm via the Discord AMA.

Exit criteria for Phase 0: RECON.md exists with the Gradio version, a verified model table (name, params, license, GGUF y/n, runs-on-ZeroGPU y/n), the running param sum, the chosen UI approach, and any fallbacks triggered.


PHASE 1 β€” The spine (one test, end to end, headless + Gradio)

Goal: upload a Deep Squat clip β†’ get a rationalized 0–3 + skeleton overlay.

  • Scaffold the repo (structure below). Pipeline runs headless via python -m formscout.run sample.mp4 before any UI.
  • Implement IngestAgent β†’ SegmentationAgent (SAM 3.1) β†’ PoseAgent (YOLO26-Pose). Reject non-target people via the mask/track id.
  • Implement Body3DAgent (SAM 3D Body) or the 2D fallback from Phase 0.
  • Implement BiomechanicsAgent for Deep Squat only: torso–tibia angle, hip-flexion depth (femur vs horizontal), knee tracking, dowel alignment.
  • Implement a deterministic rubric scorer for Deep Squat (3/2/1 per spec Β§8). No ML scoring yet.
  • Minimal Gradio UI: gr.Video in, score + rationale + overlay out.

Exit criteria: a real squat clip produces a defensible score, a one-line reason citing the deciding measurement, and an overlay video. Runs on the Space.


PHASE 2 β€” All seven tests + the judge

  • Extend BiomechanicsAgent + rubric scorers to all 7 tests. Bilateral tests score each side, report the lower, and always emit the asymmetry.
  • MovementClassifierAgent: identify which test is in the clip (VLM or a small classifier) with a manual override in the UI.
  • JudgeAgent (Qwen3-VL-8B via llama.cpp): consumes rubric + measurements + the deterministic candidate β†’ final 0–3, rationale, compensation tag, corrective hint. Pain/clearing β†’ needs_human=true, not scored.
  • ReportAgent: per-test card, composite 0–21, asymmetry strip, annotated overlay, PDF export.

Exit criteria: a multi-test session produces a full scorecard with composite + asymmetries; pain/clearing cases defer to human; disagreements between deterministic and judge scores are flagged.


PHASE 3 β€” Learned scoring + retrieval (the badges)

  • ScoringAgent: compact ST-GCN scoring head. Pre-train on public AQA/pose data, then few-shot fine-tune on the physio's labeled clips with heavy augmentation (temporal jitter, left↔right mirror, 3D camera-angle perturbation, joint noise). Hold out β‰₯1 labeled clip. Publish the fine-tuned head to the Hub with an honest model card β†’ Well-Tuned.
  • RetrievalAgent: build a Qwen3-VL-Embedding-8B index over the physio's labeled clips; return k nearest + their scores to anchor the judge β†’ RAG.
  • Wire the judge to weigh: deterministic candidate + ST-GCN candidate + retrieved exemplars.

Exit criteria: scores incorporate the learned head and exemplars; adding a new labeled clip improves retrieval with no retraining.


PHASE 4 β€” Polish, ship, document

  • Custom UI pass (Off-Brand): scout/trail theme, score dial, asymmetry bars, rubric drawer with met/unmet checkboxes, decisive-frame jump via playback_position, persistent safety banner.
  • Persist the embedding index + accumulated labels in Space storage (longitudinal baseline).
  • Publish one full agent trace to the Hub (every agent's I/O for one run) β†’ Sharing is Caring.
  • Write the blog post / field notes with the honesty section front-and-center β†’ Field Notes.
  • Record the demo video (physio scores a real player) + the social post.

Exit criteria: all six badges attempted, Space is green, demo + post + trace + blog are linked from the README.


REPO STRUCTURE (target)

formscout/
  app.py                 # Gradio entrypoint (Blocks or Server)
  formscout/
    __init__.py
    config.py            # paths, model ids, thresholds, feature flags
    pipeline.py          # Director: orchestrates agents, quality-gates
    run.py               # headless CLI entrypoint (no Gradio)
    agents/
      ingest.py
      segmentation.py    # SAM 3.1
      pose2d.py          # YOLO26-Pose
      body3d.py          # SAM 3D Body (+ 2d fallback)
      classify.py        # movement classifier
      biomechanics.py    # rubric features per test
      scoring.py         # ST-GCN learned head
      retrieval.py       # Qwen3-VL-Embedding index
      judge.py           # Qwen3-VL-8B judge
      report.py          # scorecard, overlay, pdf
    rubric/
      deep_squat.py ...  # one scorer per FMS test, pure functions
    types.py             # typed dataclasses for every agent contract
    serving/
      llama_cpp.py       # llama.cpp client wrappers + fallbacks
    ui/
      theme.py, components.py, custom/  # frontend assets
    tracing.py           # structured per-agent I/O logging (for the trace badge)
  tests/                 # headless tests per agent + a golden-clip e2e test
  requirements.txt
  README.md              # Space card: pitch, demo, trace, blog, safety
  MODEL_BUDGET.md        # running param sum, must stay ≀32B
  RECON.md               # Phase 0 findings

ENGINEERING STANDARDS

  • Typing everywhere. Every agent takes and returns a dataclass from types.py. Validate at boundaries.
  • Pure rubric functions. Each test scorer is a pure function (features) -> ScoreResult with the triggering reason. Unit-test each against hand-computed cases.
  • Defensive by default. Handle: no person detected, multiple people, wrong/ambiguous test, occlusion, too-short clip, bad FPS, 3D model OOM. Degrade gracefully and tell the user what happened β€” never crash the Space.
  • Confidence is first-class. Every agent emits a confidence; the Director flags low confidence and β‰₯1-point judge/ST-GCN disagreement as "physio review recommended."
  • Config over constants. Thresholds, model ids, k for retrieval, feature flags live in config.py, not scattered literals.
  • Tracing for free badge. tracing.py records structured per-agent inputs/outputs for any run; one run gets exported for the Hub trace.
  • Determinism in demos. Fix seeds; cache model loads at startup; warm the pipeline so the demo isn't a cold-start.
  • Tests: per-agent unit tests on fixtures + one golden-clip end-to-end test asserting score, needs_human, and overlay presence. Keep a tiny committed sample clip.

GRADIO-SPECIFIC GUIDANCE

  • Blocks vs Server: start with gr.Blocks + custom CSS/theme β€” fastest to a polished result and enough for Off-Brand. Escalate to gradio.Server with your own frontend only if Blocks can't express the UI; document the reason. (Server still gives queuing, ZeroGPU, MCP.)
  • Use gr.Walkthrough/gr.Step to guide the physio through a 7-test session; gr.Navbar if you split pages.
  • Use gr.Video's playback_position to jump the result video to the frame that decided the score.
  • ZeroGPU: wrap heavy inference in @spaces.GPU; load models once at module scope; mind the per-call GPU time limit. If using gradio.Server + ZeroGPU, call endpoints via @gradio/client from the browser.
  • requirements.txt: pin Gradio and every model lib; isolate the llama.cpp build (CPU-only or pinned-CUDA) to dodge libcudart failures; keep a transformers + spaces.GPU fallback path.

DEFINITION OF DONE (badge checklist)

  • Space runs green; upload β†’ scorecard works on real clips.
  • Param sum verified ≀ 32B in MODEL_BUDGET.md.
  • πŸ”Œ No cloud model APIs anywhere in the pipeline.
  • 🎯 Fine-tuned ST-GCN head published to the Hub w/ honest card.
  • 🎨 Custom, non-default Gradio UI.
  • πŸ¦™ VLM + embedder served via llama.cpp.
  • πŸ“‘ One full agent trace published to the Hub.
  • πŸ““ Blog post / field notes written, honesty section included.
  • Demo video + social post recorded.
  • Safety banner present; pain/clearing never auto-scored; low-confidence flagged.

INTERACTION PROTOCOL

  • After each phase, post: what runs now, the updated param sum, deviations from the spec, and the next step. Don't silently change architecture.
  • Ask the human only when blocked on a real decision β€” e.g. single-test clips vs continuous sessions (changes segmentation + UI), SAM 3D Body unusable (triggers 2D fallback), or the param-sum interpretation. Otherwise proceed with the spec's defaults and note your assumption inline.
  • Never claim a Gradio/model API works without having verified it this session. If you didn't check it, say so.