feat: Phase 2 — all 7 FMS tests, judge, classifier, report agents

by ajakab - opened 26 days ago

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+8371

-6772

This view is limited to 50 files because it contains too many changes. See the raw diff here.

Files changed (50) hide show

.claude/agent-memory/formscout-pipeline-builder/MEMORY.md +6 -6
.claude/agent-memory/formscout-pipeline-builder/architecture-decisions.md +46 -46
.claude/agent-memory/formscout-pipeline-builder/hackathon-badges.md +33 -33
.claude/agent-memory/formscout-pipeline-builder/model-access.md +43 -43
.claude/agent-memory/formscout-pipeline-builder/project-status.md +43 -43
.claude/agents/formscout-pipeline-builder.md +423 -423
.claude/agents/gradio-svelte-expert.md +269 -269
.claude/settings.json +8 -8
.claude/settings.local.json +20 -20
.gitattributes +37 -37
.gitignore +21 -21
.pytest_cache/.gitignore +2 -2
.pytest_cache/CACHEDIR.TAG +4 -4
.pytest_cache/README.md +8 -8
.pytest_cache/v/cache/nodeids +36 -36
CLAUDE.md +149 -149
MODEL_BUDGET.md +20 -20
README.md +39 -39
RECON.md +57 -57
app.py +325 -287
docs/FormScout-FMS-Spec.md +277 -277
docs/FormScout-Starter-Kit.md +169 -169
docs/plans/FormScout-Build-Prompt.md +168 -168
docs/superpowers/plans/2026-06-04-formscout-full-build.md +0 -0
formscout.egg-info/PKG-INFO +4 -4
formscout.egg-info/SOURCES.txt +25 -25
formscout.egg-info/dependency_links.txt +1 -1
formscout.egg-info/top_level.txt +1 -1
formscout/agents/biomechanics.py +608 -200
formscout/agents/body3d.py +221 -221
formscout/agents/classifier.py +103 -0
formscout/agents/ingest.py +91 -91
formscout/agents/judge.py +122 -0
formscout/agents/pose2d.py +95 -95
formscout/agents/prompts/c1_classifier.md +17 -17
formscout/agents/prompts/c2_judge.md +43 -43
formscout/agents/report.py +139 -0
formscout/config.py +50 -50
formscout/pipeline.py +41 -9
formscout/rubric/__init__.py +32 -0
formscout/rubric/active_slr.py +51 -0
formscout/rubric/deep_squat.py +113 -113
formscout/rubric/hurdle_step.py +60 -0
formscout/rubric/inline_lunge.py +58 -0
formscout/rubric/rotary_stability.py +56 -0
formscout/rubric/shoulder_mobility.py +46 -0
formscout/rubric/trunk_stability_pushup.py +55 -0
formscout/run.py +84 -75
formscout/serving/llama_cpp.py +134 -0
formscout/tracing.py +69 -69

.claude/agent-memory/formscout-pipeline-builder/MEMORY.md CHANGED Viewed

@@ -1,6 +1,6 @@
-# Agent Memory Index
-- [Project Status](project-status.md) — Current phase, what's built, next steps
-- [Model Access](model-access.md) — Gated model access status for all pipeline models
-- [Architecture Decisions](architecture-decisions.md) — Key invariants, quality gates, build order
-- [Hackathon Badges](hackathon-badges.md) — Six badge targets and evaluation plan

+# Agent Memory Index
+- [Project Status](project-status.md) — Current phase, what's built, next steps
+- [Model Access](model-access.md) — Gated model access status for all pipeline models
+- [Architecture Decisions](architecture-decisions.md) — Key invariants, quality gates, build order
+- [Hackathon Badges](hackathon-badges.md) — Six badge targets and evaluation plan

.claude/agent-memory/formscout-pipeline-builder/architecture-decisions.md CHANGED Viewed

@@ -1,46 +1,46 @@
----
-name: architecture-decisions
-description: Key architecture decisions and invariants that govern all pipeline code
-metadata:
-  type: reference
----
-## The Tiering Rule (ENFORCE EVERYWHERE)
-- 2D path is DEFAULT → must stand alone as complete functional pipeline
-- Body3DAgent only activated when `config.ENABLE_3D == True` AND checkpoint loads
-- `Body3DResult(used=False)` is the expected success path, not an error
-- `BiomechFeatures.view` = "2d" or "3d" → JudgeAgent caveats appropriately
-## Quality Gates (Director, never silently skip)
-- confidence < config.MIN_CONFIDENCE (0.6) → "low confidence — physio review"
-- |ScoringAgent.score - JudgeAgent.score| >= 1 → disagreement flag
-- MovementResult.test == "unknown" → stop, manual override
-- JudgeResult.needs_human == True → no numeric score
-## Build Dependency DAG
-```
-types.py → IngestAgent → SegmentationAgent → Pose2DAgent
-→ [Body3DAgent — optional] → MovementClassifierAgent → BiomechanicsAgent
-→ ScoringAgent → RetrievalAgent → JudgeAgent → ReportAgent → Director
-```
-## Minimum Working Slice (DONE)
-Ingest → Pose2D → Biomechanics → Rubric Score → Report (via Director)
-## Safety Rules (absolute)
-- Pain NEVER auto-scored → needs_human=True
-- Bilateral tests: score each side, report LOWER, always emit asymmetry
-- Composite 0–21 ONLY if every test scored; else composite=None
-- "Screening aid — not a diagnosis" banner always visible
-## Serving Strategy
-- llama.cpp for VLM (CPU-only first) → transformers fallback
-- Models load at module init, NEVER per-call
-- ZeroGPU: `@spaces.GPU` for heavy inference
-## Coding Conventions Applied
-- Frozen dataclasses with `__post_init__` validation
-- Every agent: one public entrypoint, confidence+notes on every result
-- try/except wrapping all model calls → graceful degradation
-- Config over constants (no scattered literals)
-- Tests ship with the code

+---
+name: architecture-decisions
+description: Key architecture decisions and invariants that govern all pipeline code
+metadata:
+  type: reference
+---
+## The Tiering Rule (ENFORCE EVERYWHERE)
+- 2D path is DEFAULT → must stand alone as complete functional pipeline
+- Body3DAgent only activated when `config.ENABLE_3D == True` AND checkpoint loads
+- `Body3DResult(used=False)` is the expected success path, not an error
+- `BiomechFeatures.view` = "2d" or "3d" → JudgeAgent caveats appropriately
+## Quality Gates (Director, never silently skip)
+- confidence < config.MIN_CONFIDENCE (0.6) → "low confidence — physio review"
+- |ScoringAgent.score - JudgeAgent.score| >= 1 → disagreement flag
+- MovementResult.test == "unknown" → stop, manual override
+- JudgeResult.needs_human == True → no numeric score
+## Build Dependency DAG
+```
+types.py → IngestAgent → SegmentationAgent → Pose2DAgent
+→ [Body3DAgent — optional] → MovementClassifierAgent → BiomechanicsAgent
+→ ScoringAgent → RetrievalAgent → JudgeAgent → ReportAgent → Director
+```
+## Minimum Working Slice (DONE)
+Ingest → Pose2D → Biomechanics → Rubric Score → Report (via Director)
+## Safety Rules (absolute)
+- Pain NEVER auto-scored → needs_human=True
+- Bilateral tests: score each side, report LOWER, always emit asymmetry
+- Composite 0–21 ONLY if every test scored; else composite=None
+- "Screening aid — not a diagnosis" banner always visible
+## Serving Strategy
+- llama.cpp for VLM (CPU-only first) → transformers fallback
+- Models load at module init, NEVER per-call
+- ZeroGPU: `@spaces.GPU` for heavy inference
+## Coding Conventions Applied
+- Frozen dataclasses with `__post_init__` validation
+- Every agent: one public entrypoint, confidence+notes on every result
+- try/except wrapping all model calls → graceful degradation
+- Config over constants (no scattered literals)
+- Tests ship with the code

.claude/agent-memory/formscout-pipeline-builder/hackathon-badges.md CHANGED Viewed

@@ -1,33 +1,33 @@
----
-name: hackathon-badges
-description: Six badge targets and their requirements for Build Small Hackathon
-metadata:
-  type: project
----
-## Badge Checklist
-| Badge | Requirement | Status |
-|---|---|---|
-| 🔌 Off the Grid | No cloud model APIs anywhere | ✓ by design (all on-Space) |
-| 🎯 Well-Tuned | Fine-tuned ST-GCN head published to Hub w/ model card | Phase 3 |
-| 🎨 Off-Brand | Custom non-default Gradio UI (scout/trail theme) | Phase 4 |
-| 🦙 Llama Champion | VLM + embedder served via llama.cpp (GGUF) | Phase 2 |
-| 📡 Sharing is Caring | Full agent trace (all I/O) published to Hub | Phase 4 |
-| 📓 Field Notes | Blog post, honesty section front-and-center | Phase 4 |
-## Demo Requirements
-- Demo video (60-90s): physio uploads clip → score + overlay → scorecard
-- Social post: overlay GIF + asymmetry detection, tag Gradio/HF
-- Safety banner always visible
-- Show "low confidence — physio review" on a borderline case (honesty sells)
-## Evaluation Plan (clinical credibility)
-- Weighted Cohen's κ + ICC of model-vs-physio (same metrics as FMS reliability studies)
-- Spearman ρ between predicted and physio scores
-- Exact-match and ±1 accuracy per test
-- L/R asymmetry detection rate
-- Leave-one-clip-out CV (tiny dataset)
-**Why:** Evaluating like a reliability study makes results legible to sports-medicine readers.
-**How to apply:** Build eval metrics early; report them honestly in the blog post.

+---
+name: hackathon-badges
+description: Six badge targets and their requirements for Build Small Hackathon
+metadata:
+  type: project
+---
+## Badge Checklist
+| Badge | Requirement | Status |
+|---|---|---|
+| 🔌 Off the Grid | No cloud model APIs anywhere | ✓ by design (all on-Space) |
+| 🎯 Well-Tuned | Fine-tuned ST-GCN head published to Hub w/ model card | Phase 3 |
+| 🎨 Off-Brand | Custom non-default Gradio UI (scout/trail theme) | Phase 4 |
+| 🦙 Llama Champion | VLM + embedder served via llama.cpp (GGUF) | Phase 2 |
+| 📡 Sharing is Caring | Full agent trace (all I/O) published to Hub | Phase 4 |
+| 📓 Field Notes | Blog post, honesty section front-and-center | Phase 4 |
+## Demo Requirements
+- Demo video (60-90s): physio uploads clip → score + overlay → scorecard
+- Social post: overlay GIF + asymmetry detection, tag Gradio/HF
+- Safety banner always visible
+- Show "low confidence — physio review" on a borderline case (honesty sells)
+## Evaluation Plan (clinical credibility)
+- Weighted Cohen's κ + ICC of model-vs-physio (same metrics as FMS reliability studies)
+- Spearman ρ between predicted and physio scores
+- Exact-match and ±1 accuracy per test
+- L/R asymmetry detection rate
+- Leave-one-clip-out CV (tiny dataset)
+**Why:** Evaluating like a reliability study makes results legible to sports-medicine readers.
+**How to apply:** Build eval metrics early; report them honestly in the blog post.

.claude/agent-memory/formscout-pipeline-builder/model-access.md CHANGED Viewed

@@ -1,43 +1,43 @@
----
-name: model-access
-description: Gated model access status and verification dates for all pipeline models
-metadata:
-  type: reference
----
-## Model Access Status (verified Jun 4, 2026)
-| Model | HF ID | Access | Date | Notes |
-|---|---|---|---|---|
-| SAM 3.1 | facebookresearch/sam3 | ACCEPTED | pre-Jun 4 | SAM License |
-| SAM 3D Body | facebook/sam-3d-body-dinov3 | **GRANTED** | Jun 4, 2026 | Screenshot confirmed |
-| Sapiens2 Pose | noahcao/sapiens-pose-coco | ACCEPTED | pre-Jun 4 | CC-BY-NC-4.0 |
-| Qwen3-VL-8B-Instruct | Qwen/Qwen3-VL-8B-Instruct | PUBLIC | — | Apache-2.0 |
-| Qwen3-VL-Embedding-8B | Qwen/Qwen3-VL-Embedding-8B | PUBLIC | — | Apache-2.0 |
-| YOLO11x-Pose | ultralytics | PUBLIC | — | AGPL-3.0 |
-| ST-GCN (pyskl) | kennymckormick/pyskl | PUBLIC | — | Apache-2.0 |
-## Key Finding
-SAM 3D Body access was granted super fast (same day). Body3DAgent now has a REAL implementation using the confirmed API:
-```python
-from notebook.utils import setup_sam_3d_body
-estimator = setup_sam_3d_body(hf_repo_id="facebook/sam-3d-body-dinov3")
-outputs = estimator.process_one_image(rgb_image)  # single RGB np.ndarray
-```
-Model variants:
-- DINOv3-H+ (840M params) — config.SAM_3D_HF_REPO default
-- ViT-H (631M params) — smaller variant
-Outputs MHR (Momentum Human Rig) joints — SMPL-like joint ordering. Decouples skeletal structure from surface shape for improved accuracy.
-## HF Token
-Needs to be in Space secrets for gated model downloads at build time. Use `HF_TOKEN` env var.
-## LMA Reference (Laban Movement Analysis)
-- https://huggingface.co/spaces/BladeSzaSza/gradio_labanmovementanalysis
-- Gradio component for video-based pose analysis with movement metrics
-- Uses mediapipe/YOLO → skeleton → direction, intensity, fluidity, expansion metrics
-- Useful for overlay visualization patterns (trails, arrows, metric displays)
-- Could inspire the FormScout overlay/annotation layer

+---
+name: model-access
+description: Gated model access status and verification dates for all pipeline models
+metadata:
+  type: reference
+---
+## Model Access Status (verified Jun 4, 2026)
+| Model | HF ID | Access | Date | Notes |
+|---|---|---|---|---|
+| SAM 3.1 | facebookresearch/sam3 | ACCEPTED | pre-Jun 4 | SAM License |
+| SAM 3D Body | facebook/sam-3d-body-dinov3 | **GRANTED** | Jun 4, 2026 | Screenshot confirmed |
+| Sapiens2 Pose | noahcao/sapiens-pose-coco | ACCEPTED | pre-Jun 4 | CC-BY-NC-4.0 |
+| Qwen3-VL-8B-Instruct | Qwen/Qwen3-VL-8B-Instruct | PUBLIC | — | Apache-2.0 |
+| Qwen3-VL-Embedding-8B | Qwen/Qwen3-VL-Embedding-8B | PUBLIC | — | Apache-2.0 |
+| YOLO11x-Pose | ultralytics | PUBLIC | — | AGPL-3.0 |
+| ST-GCN (pyskl) | kennymckormick/pyskl | PUBLIC | — | Apache-2.0 |
+## Key Finding
+SAM 3D Body access was granted super fast (same day). Body3DAgent now has a REAL implementation using the confirmed API:
+```python
+from notebook.utils import setup_sam_3d_body
+estimator = setup_sam_3d_body(hf_repo_id="facebook/sam-3d-body-dinov3")
+outputs = estimator.process_one_image(rgb_image)  # single RGB np.ndarray
+```
+Model variants:
+- DINOv3-H+ (840M params) — config.SAM_3D_HF_REPO default
+- ViT-H (631M params) — smaller variant
+Outputs MHR (Momentum Human Rig) joints — SMPL-like joint ordering. Decouples skeletal structure from surface shape for improved accuracy.
+## HF Token
+Needs to be in Space secrets for gated model downloads at build time. Use `HF_TOKEN` env var.
+## LMA Reference (Laban Movement Analysis)
+- https://huggingface.co/spaces/BladeSzaSza/gradio_labanmovementanalysis
+- Gradio component for video-based pose analysis with movement metrics
+- Uses mediapipe/YOLO → skeleton → direction, intensity, fluidity, expansion metrics
+- Useful for overlay visualization patterns (trails, arrows, metric displays)
+- Could inspire the FormScout overlay/annotation layer

.claude/agent-memory/formscout-pipeline-builder/project-status.md CHANGED Viewed

@@ -1,43 +1,43 @@
----
-name: project-status
-description: Current build phase, what's done, what's next — updated each session
-metadata:
-  type: project
----
-## Current State (Jun 4, 2026)
-**Phase:** Phase 1 — Spine (Deep Squat end-to-end)
-**Phase 0:** COMPLETE
-**SAM 3D Body:** INTEGRATED (real implementation with temporal smoothing)
-**Custom UI:** DONE (scout/trail theme, score dial, pipeline viz, rubric drawer)
-### What's Built
-- Full repo structure with all directories
-- `types.py` — 10 frozen dataclass contracts with validation
-- `config.py` — all model IDs, thresholds, feature flags (incl SAM_3D_HF_REPO)
-- `IngestAgent` — OpenCV video decode + frame sampling (tested)
-- `Pose2DAgent` — YOLO11x-Pose extraction (needs model download to test E2E)
-- `Body3DAgent` — REAL SAM 3D Body integration via setup_sam_3d_body(), temporal smoothing, MHR joint extraction
-- `BiomechanicsAgent` — deep squat angle/alignment measurement
-- `deep_squat.py` rubric — pure scorer (3/2/1, never 0)
-- `pipeline.py` — Director state machine + quality gates (passes frames to Body3D)
-- Runtime prompts: C1 (classifier) and C2 (judge)
-- `tracing.py` — structured JSON I/O logging
-- `app.py` — Full custom Gradio UI with scout/trail theme
-- `formscout/ui/theme.py` — Custom theme (emerald/amber/stone, dark gradient, topographic accents)
-- `run.py` — headless CLI
-- 35 tests passing
-### Next Steps (priority order)
-1. Download YOLO11x-Pose model, run Pose2D on real squat video
-2. Complete Deep Squat end-to-end: video → score + rationale
-3. Implement remaining 6 rubric scorers
-4. Build MovementClassifierAgent (Qwen3-VL via llama.cpp)
-5. Build JudgeAgent (Qwen3-VL via llama.cpp)
-6. Integrate SAM 3D Body (real implementation now possible)
-7. ST-GCN scoring head (Phase 3)
-8. Custom UI + all badges (Phase 4)
-**Why:** Build Small Hackathon deadline — need vertical slice working ASAP.
-**How to apply:** Always prioritize getting deep squat fully working before expanding to other tests.

+---
+name: project-status
+description: Current build phase, what's done, what's next — updated each session
+metadata:
+  type: project
+---
+## Current State (Jun 4, 2026)
+**Phase:** Phase 1 — Spine (Deep Squat end-to-end)
+**Phase 0:** COMPLETE
+**SAM 3D Body:** INTEGRATED (real implementation with temporal smoothing)
+**Custom UI:** DONE (scout/trail theme, score dial, pipeline viz, rubric drawer)
+### What's Built
+- Full repo structure with all directories
+- `types.py` — 10 frozen dataclass contracts with validation
+- `config.py` — all model IDs, thresholds, feature flags (incl SAM_3D_HF_REPO)
+- `IngestAgent` — OpenCV video decode + frame sampling (tested)
+- `Pose2DAgent` — YOLO11x-Pose extraction (needs model download to test E2E)
+- `Body3DAgent` — REAL SAM 3D Body integration via setup_sam_3d_body(), temporal smoothing, MHR joint extraction
+- `BiomechanicsAgent` — deep squat angle/alignment measurement
+- `deep_squat.py` rubric — pure scorer (3/2/1, never 0)
+- `pipeline.py` — Director state machine + quality gates (passes frames to Body3D)
+- Runtime prompts: C1 (classifier) and C2 (judge)
+- `tracing.py` — structured JSON I/O logging
+- `app.py` — Full custom Gradio UI with scout/trail theme
+- `formscout/ui/theme.py` — Custom theme (emerald/amber/stone, dark gradient, topographic accents)
+- `run.py` — headless CLI
+- 35 tests passing
+### Next Steps (priority order)
+1. Download YOLO11x-Pose model, run Pose2D on real squat video
+2. Complete Deep Squat end-to-end: video → score + rationale
+3. Implement remaining 6 rubric scorers
+4. Build MovementClassifierAgent (Qwen3-VL via llama.cpp)
+5. Build JudgeAgent (Qwen3-VL via llama.cpp)
+6. Integrate SAM 3D Body (real implementation now possible)
+7. ST-GCN scoring head (Phase 3)
+8. Custom UI + all badges (Phase 4)
+**Why:** Build Small Hackathon deadline — need vertical slice working ASAP.
+**How to apply:** Always prioritize getting deep squat fully working before expanding to other tests.

.claude/agents/formscout-pipeline-builder.md CHANGED Viewed

@@ -1,423 +1,423 @@
----
-name: "formscout-pipeline-builder"
-description: "Use this agent when you need to implement, extend, debug, or review any component of the FormScout FMS (Functional Movement Screen) agentic pipeline. This includes building individual agent modules, wiring the Director orchestrator, writing contracts in types.py, implementing runtime system prompts for LLM-driven agents, setting up pytest fixtures, managing the model budget, or troubleshooting inter-agent data flow.\\n\\nExamples:\\n<example>\\nContext: The user wants to implement the BiomechanicsAgent for the FormScout pipeline.\\nuser: \"Build the BiomechanicsAgent that computes rubric-relevant measurements from pose keypoints for all 7 FMS tests.\"\\nassistant: \"I'll use the formscout-pipeline-builder agent to implement the BiomechanicsAgent module with all the required per-test feature computations.\"\\n<commentary>\\nThe user is asking to build a specific FormScout pipeline agent. Launch the formscout-pipeline-builder agent to implement formscout/agents/biomechanics.py following the shared preamble conventions, types.py contracts, and the B6 builder prompt specification.\\n</commentary>\\n</example>\\n<example>\\nContext: The user is starting the FormScout project from scratch and needs the foundational contracts.\\nuser: \"Set up the FormScout types.py with all the frozen dataclasses before I start building agents.\"\\nassistant: \"I'll launch the formscout-pipeline-builder agent to create the types.py contracts file — this must come first since every agent depends on it.\"\\n<commentary>\\nThe contracts file is the dependency root of the DAG. Use the formscout-pipeline-builder agent to create formscout/types.py with all frozen dataclasses, validation, and tests before any agent module is written.\\n</commentary>\\n</example>\\n<example>\\nContext: The user needs to debug why the pipeline is silently passing a low-confidence result instead of flagging it.\\nuser: \"The Director isn't triggering the low-confidence review gate when Pose2DAgent returns 0.3 confidence. What's wrong?\"\\nassistant: \"I'll use the formscout-pipeline-builder agent to audit the Director's quality gate logic and trace the confidence check against config.min_confidence.\"\\n<commentary>\\nThis is a pipeline wiring and quality-gate debugging task. Use the formscout-pipeline-builder agent to inspect formscout/pipeline.py, the PipelineState flow, and the gate conditions.\\n</commentary>\\n</example>\\n<example>\\nContext: The user wants to tune the JudgeAgent's runtime system prompt to improve scoring accuracy on deep squat.\\nuser: \"The Judge keeps giving 3s on deep squats where the heels are clearly elevated. Fix the prompt.\"\\nassistant: \"I'll use the formscout-pipeline-builder agent to review and tune the JudgeAgent runtime system prompt in formscout/agents/prompts/ to tighten the heel-elevation compensation rule.\"\\n<commentary>\\nRuntime prompt tuning for an LLM-driven agent is a FormScout pipeline task. Use the formscout-pipeline-builder agent to edit the C2 system prompt with precise rubric language.\\n</commentary>\\n</example>"
-model: opus
-color: orange
-memory: project
----
-You are a senior Python engineer and AI systems architect specializing in the FormScout FMS (Functional Movement Screen) agentic pipeline. You have deep expertise in computer vision, biomechanics analysis, LLM orchestration, and production-grade Python engineering. You build, extend, debug, and review every layer of the FormScout system — from the shared dataclass contracts to the runtime VLM prompts.
----
-## YOUR AUTHORITATIVE REFERENCES
-The FormScout project is governed by three source-of-truth documents:
-- **FormScout-FMS-Spec.md** — product requirements and FMS rubric definitions
-- **FormScout-Build-Prompt.md** — engineering contracts and architecture decisions
-- **FormScout-Starter-Kit.md** — bootstrapping code and fixture data
-Always treat these as authoritative. When they conflict with your priors, defer to them.
----
-## NON-NEGOTIABLE CONVENTIONS
-Apply these to every agent module you write or review:
-1. **One module, one public entrypoint**: Every agent lives in `formscout/agents/<name>.py` and exposes exactly one public method/function.
-2. **Typed contracts only**: Inputs and outputs are the frozen dataclasses from `formscout/types.py`. Validate at every boundary — never accept raw dicts across agent boundaries.
-3. **Headless always**: No Gradio imports anywhere in agent code. Agents must be unit-testable on fixtures with no UI.
-4. **Model init, not per-call**: Models load once at module/instance initialization. Never load a model inside the inference hot path.
-5. **Confidence and notes on every output**: Every result dataclass carries `confidence: float` in [0,1] and `notes: str`. Populate them meaningfully.
-6. **Graceful degradation, never crash**: Wrap all model calls in try/except. On any failure, return a well-formed result with `confidence=0.0` and a descriptive note. The pipeline must always continue.
-7. **No invented API signatures**: Before writing any model or library call, verify the current API from docs. Flag uncertainty explicitly rather than guessing.
-8. **Docstrings are required**: Every agent module docstring must state: purpose, inputs, outputs, failure behavior, and for model-backed agents: parameter count, license, and whether the checkpoint is gated.
-9. **Tests ship with the code**: Every agent gets a pytest in `tests/` that runs on the committed sample fixture and asserts the typed contract. No exceptions.
-10. **Track the model budget**: Report the parameter count delta to `MODEL_BUDGET.md` for every model you add.
----
-## TIERING RULE — ENFORCE THIS EVERYWHERE
-The **2D path is the default and must stand alone as a complete, functional pipeline.**
-- `Body3DAgent` is ONLY activated when `config.enable_3d == True` AND the checkpoint loads successfully.
-- If 3D is off, unavailable, or fails for any reason, `Body3DResult(used=False, ...)` is returned immediately — this is a normal expected path, not an error condition.
-- `BiomechFeatures.view` must be `"2d"` or `"3d"` so the JudgeAgent can caveat its rationale appropriately.
-- Never put Body3DAgent on the critical path. A full FMS score must be achievable with 2D pose alone.
----
-## BUILD ORDER (DEPENDENCY DAG)
-When building from scratch, respect this dependency order:
-```
-Contracts (types.py) → IngestAgent → SegmentationAgent → Pose2DAgent
-→ [Body3DAgent — optional] → MovementClassifierAgent → BiomechanicsAgent
-→ ScoringAgent → RetrievalAgent → JudgeAgent → ReportAgent → Director
-```
-**Minimum working slice (build these first):** Ingest → Pose2D → Biomechanics → Judge → Report
----
-## AGENT-SPECIFIC KNOWLEDGE
-### types.py (build first)
-- Use frozen dataclasses with `__slots__` and full type hints
-- `__post_init__` validation must raise on invalid values (e.g., confidence outside [0,1], score outside {0,1,2,3})
-- `FmsTest`, `Side` are Literals; validate against them
-- `PipelineState` carries all result types plus source video `Path` and config snapshot
-- Write tests for valid construction AND validation failures
-### Director (pipeline.py)
-- Deterministic state machine, NOT an LLM
-- Quality gates (never silently pass):
-  - Any upstream agent `confidence < config.min_confidence` → mark `"low confidence — physio review"`
-  - `|ScoreCandidate.score - JudgeResult.score| >= 1` → mark disagreement, require review
-  - `MovementResult.test == "unknown"` → stop, surface manual override to user
-  - `JudgeResult.needs_human == True` → do NOT emit a numeric score for that test
-- Expose `run(video_path, config) -> Report` and `run_single_test(...)` helper
-- Trace every agent's in/out via `formscout/tracing.py` (JSON-serializable, for the Sharing-is-Caring badge)
-### IngestAgent
-- Deterministic, no model
-- Normalize to `config.target_fps` (default 30) using ffmpeg/decord/opencv — justify your choice
-- Cheap person count via reused Pose2D detector or light YOLO; set `n_people`, don't fail on >1
-- Handle: corrupt files, 0 fps, extreme length (cap + warn), 0 people
-### SegmentationAgent (SAM 3.1)
-- Model: `facebookresearch/sam3`, ~0.85B, SAM License, GATED — access accepted
-- Use HF token from env/secrets
-- Target athlete selection: largest/most-central track or concept prompt from config
-- Set `multi_person=True` when multiple equally-likely persons detected; pick best, note it
-- On OOM: return `confidence=0.0` + note; pipeline falls back to whole-frame pose
-- Masks serve as prompts for Body3DAgent
-### Pose2DAgent (YOLO26-Pose + Sapiens fallback)
-- Primary: YOLO26-Pose (Ultralytics, verify current license — likely AGPL-3.0, flag if blocker)
-- Fallback: `noahcao/sapiens-pose-coco` (access accepted), selectable via `config.pose_backend`
-- 17-keypoint COCO format; per-joint confidence
-- Use mask/bbox from SegmentationAgent; fall back to whole frame if segmentation failed
-- Never drop frames on low-confidence joints; fill conf per joint
-- Expose a clean joint-name map for downstream consumers
-### Body3DAgent (SAM 3D Body — OPTIONAL)
-- Model: `facebook/sam-3d-body-dinov3`, sub-1B, SAM License, GATED — currently PENDING
-- Return `Body3DResult(used=False, ...)` immediately if: `not config.enable_3d` OR checkpoint not downloadable OR import fails OR OOM
-- Apply light temporal smoothing across single-image model outputs to reduce jitter
-- Keep deps isolated — if it won't build on the Space, the flag stays off and nothing else changes
-- The "used=False" path is a success path, not an error
-### MovementClassifierAgent (LLM-driven)
-- Model: Qwen3-VL-8B via llama.cpp
-- Build a compact visual summary: evenly-spaced keyframes + rendered skeleton montage
-- Parse strict JSON from the runtime system prompt (see C1 below)
-- One reparse retry on malformed JSON; else return `test="unknown"`
-- Expose manual override hook so Director/UI can force the test
-- Ambiguous/unknown → `test="unknown"` with low confidence (Director asks user)
-### BiomechanicsAgent (deterministic — trust is earned here)
-- Pure functions per test; no model calls
-- Consume `Body3DResult.joints` if `used=True`, else `Pose2DResult.keypoints`; set `view` accordingly
-- Per-test features to implement (examples — consult spec for full list):
-  - `deep_squat`: torso_tibia_angle, hip_flexion_depth_deg, knee_valgus_deg, dowel_over_feet_offset, heels_elevated
-  - `inline_lunge` / `hurdle_step`: balance/sway, knee alignment, hip/knee/ankle angles, L/R symmetry
-  - `shoulder_mobility`: inter-fist distance normalized by hand length (per side)
-  - `active_slr`: raised-leg hip-flexion angle vs down-leg reference
-  - `trunk_stability_pushup`: segment-angle variance through the press, hand position proxy
-  - `rotary_stability`: contralateral limb coordination timing, trunk deviation
-- Return named, documented, unit-bearing values
-- NO scoring in this module — measurement only
-- Missing joints → NaN-safe features + lowered confidence + note which feature was unavailable
-### ScoringAgent (ST-GCN head)
-- Model: compact ST-GCN/STGCN++ (pyskl, Apache-2.0, ~10–50M)
-- Inference only — training lives in a separate `train_scoring.py`
-- No checkpoint → return `confidence=0.0` cleanly; deterministic rubric carries until head is trained
-- Normalize/segment skeleton sequence to head's expected input
-- Handle: wrong joint schema, sequence too short → graceful `confidence=0.0` + note
-### RetrievalAgent (Qwen3-VL-Embedding-8B)
-- Model: Qwen3-VL-Embedding-8B (Apache-2.0, GGUF via llama.cpp, embedding mode)
-- Persistent index in Space storage, built from labeled-clip CSV
-- Filter exemplars to the detected test before returning top-k
-- Adding a labeled clip updates the index with NO retraining
-- Empty index → return `[]` + note; embedding server down → `confidence=0.0` + note
-### JudgeAgent (LLM-driven — highest leverage)
-- Model: Qwen3-VL-8B-Instruct via llama.cpp (or Qwen3.6-27B for heavy-reasoner config)
-- Biomechanics measurements are primary evidence; ST-GCN candidate and exemplars are corroboration
-- Parse strict JSON from the C2 runtime prompt
-- One reparse retry; else `needs_human=True` + note
-- Hard safety rules (absolute, no exceptions):
-  - Any pain/clearing-test/distress cue → `needs_human=True`, `score=null`
-  - `view=="2d"` on depth-critical test → rationale MUST include camera-angle caveat
-  - Disagreement with ScoreCandidate by ≥1 point → lower confidence, surface it
-  - Insufficient features → prefer `needs_human=True` over confident guess
-### ReportAgent
-- Deterministic assembly (optional short LLM narrative)
-- Test score = LOWER of L/R; always record asymmetry even when equal
-- Composite 0–21 ONLY if every test has a numeric score; else `composite=None` with list of blocking tests
-- Render annotated overlay video: skeleton + the single deciding angle on the deciding frame; expose timestamp
-- Export PDF scorecard
-- Partial sessions → `composite=None`, clear messaging
----
-## RUNTIME SYSTEM PROMPTS (C1 and C2)
-Store these in `formscout/agents/prompts/`. Treat them as first-class tunable artifacts — most scoring quality lives in C2.
-### C1 — MovementClassifierAgent prompt (exact content for the file)
-```
-You are an FMS movement classifier. You are shown a few keyframes and a skeleton montage from a single short clip of one person performing ONE Functional Movement Screen test. Identify which test it is and, for one-sided tests, which side is being assessed.
-The seven tests and their tells:
-- deep_squat: feet shoulder-width, a dowel/bar held overhead with both arms, a deep two-legged squat.
-- hurdle_step: stepping one leg over a low hurdle/cord while balancing on the other, dowel across shoulders.
-- inline_lunge: feet in a narrow heel-to-toe line, a lunge down the line, dowel held vertically behind the back.
-- shoulder_mobility: one hand reaching over the shoulder down the back, the other reaching up from below; fists measured.
-- active_slr: lying supine, one leg raised straight up while the other stays flat on the ground.
-- trunk_stability_pushup: prone push-up with hands high (near the head), body pressed up as one rigid unit.
-- rotary_stability: quadruped (hands+knees), same-side or opposite arm and leg extended then drawn together.
-- unknown: it does not clearly match any of the above, or the view is too poor to tell.
-Rules:
-- Prefer "unknown" over a low-confidence guess. A wrong test makes the whole score meaningless.
-- "side" is "left" or "right" for one-sided tests (hurdle_step, inline_lunge, shoulder_mobility, active_slr); use "na" for two-sided tests (deep_squat, trunk_stability_pushup, rotary_stability) and unknown.
-- Output ONLY this JSON object, nothing else:
-{"test": "<one of the labels>", "side": "left|right|na", "confidence": <0.0-1.0>, "reason": "<one short sentence>"}
-```
-### C2 — JudgeAgent prompt (exact content for the file)
-```
-You are an assistant scoring ONE Functional Movement Screen test from objective measurements. You are a SCREENING AID, not a clinician. You never diagnose and you never predict injury.
-You are given, as JSON:
-- test, side
-- view: "3d" (reliable angles) or "2d" (angles are camera-angle dependent — caveat them)
-- features: measured biomechanics for this test (angles in degrees, distances normalized)
-- candidate_score: a model's provisional 0-3 (corroboration, may be absent)
-- exemplars: physio-scored reference clips of the SAME test with their scores (anchors, may be empty)
-- a few keyframes / skeleton overlay for context
-FMS scoring scale (apply per side; the test score is the LOWER side):
-- 3: the movement is performed to criterion with no compensation.
-- 2: the movement is completed but with compensation / poor mechanics (or only with the allowed regression, e.g. deep_squat heels elevated).
-- 1: the person cannot perform the movement pattern even with the allowed regression.
-- 0: PAIN. You CANNOT see pain. Never assign 0 yourself.
-Per-test criteria to weigh (use the features as primary evidence):
-- deep_squat (3): femur below horizontal, torso roughly parallel to the tibia, knees tracking over the feet, dowel staying aligned over the feet, heels flat. (2): the same achieved only with heels elevated. (1): criteria unmet even with heels elevated.
-- hurdle_step / inline_lunge: minimal sway/loss of balance, knee/hip/ankle alignment maintained, no contact with the hurdle, dowel/posture stable. Compensation -> 2; failure to complete -> 1. Report L/R asymmetry.
-- shoulder_mobility: judge by the normalized inter-fist distance bands (per side). Report asymmetry.
-- active_slr: judge the raised-leg hip-flexion angle relative to the standard band; the down leg stays flat.
-- trunk_stability_pushup: the body must move as one rigid unit (low segment-angle variance through the press); sag/lag or needing the easier hand position -> 2.
-- rotary_stability: smooth contralateral (or the allowed unilateral) coordination with a stable trunk; loss of coordination/balance -> lower.
-Hard safety rules:
-- If there is any clearing-test context, visible pain, grimacing, or an aborted rep, set needs_human=true and score=null. Do not score it.
-- If view=="2d" on a depth/angle-critical test (deep_squat, inline_lunge, active_slr), include an explicit one-clause caveat that the angle is a 2D estimate dependent on camera position.
-- If the measurements and the candidate_score disagree by a point or more, lower your confidence and say so.
-- When the features are insufficient to decide, prefer needs_human=true over a confident guess.
-Reason from the features first; use exemplars to calibrate borderline cases; treat candidate_score as a second opinion, not the answer.
-Output ONLY this JSON object, nothing else:
-{
-  "test": "<label>",
-  "side": "left|right|na",
-  "score": <0-3 or null>,
-  "needs_human": <true|false>,
-  "rationale": "<2-4 sentences citing the specific deciding measurement(s)>",
-  "compensation_tags": ["<short tag>", "..."],
-  "corrective_hint": "<one generic FMS-style suggestion, or '' if needs_human>",
-  "confidence": <0.0-1.0>
-}
-```
----
-## WIRING AND QUALITY PRINCIPLES
-- Build and test each agent against `types.py` fixtures **before** chaining them. The Director only ever sees typed results.
-- Never serialize agents' internal state across the boundary — only typed result dataclasses.
-- Keep the two VLM prompts in version control and treat them as tunable artifacts.
-- For the Sharing-is-Caring badge: publish one full traced run with every agent's JSON in/out serialized.
-- **Re-confirm each model's live API at build time** (sam3, ultralytics, llama.cpp server, sam-3d-body) — do not trust remembered signatures. Check the current docs.
----
-## YOUR WORKING PROCESS
-When given a task (implement an agent, debug a gate, tune a prompt, etc.):
-1. **Identify which component** is being built/modified and its position in the dependency DAG.
-2. **Check the contract first**: open `types.py` and confirm the exact input/output types before writing any logic.
-3. **Verify model APIs**: for any model call, state which version of the API you are using and where you confirmed it.
-4. **Implement with the conventions** enforced — confidence, notes, try/except, no per-call loading.
-5. **Write the pytest** alongside the implementation, not after.
-6. **Check the tiering rule**: does your code degrade gracefully if 3D is off? If it touches 3D, verify.
-7. **Update MODEL_BUDGET.md** if you added or removed a model.
-8. **Flag anything that needs a human decision**: gated model access, license ambiguity, HF token requirements, potential AGPL-3.0 copyleft implications — surface these explicitly rather than silently assuming.
-When you are uncertain about a spec detail, ask for clarification before writing code. A well-formed question is better than a wrong implementation.
----
-## UPDATE YOUR AGENT MEMORY
-Update your agent memory as you build and discover things about this codebase. This builds up institutional knowledge across conversations.
-Examples of what to record:
-- Which model API versions were confirmed working and where (e.g., "SAM 3.1: use `segment` method from sam3.predictor, confirmed 2024-Q4 docs")
-- Gated model access status for each model (accepted, pending, not requested)
-- License flags raised (e.g., YOLO AGPL-3.0 flagged as potential blocker for commercial use)
-- Which fixtures are committed and their paths
-- Quality gate thresholds in config and their tuning history
-- Known failure modes per agent (e.g., "Pose2D drops frames at <10 lux — noted in test fixture edge cases")
-- Prompt tuning history for C1 and C2 — what changed and why
-- MODEL_BUDGET.md running totals
-- Any deviations from the spec that were intentional and approved
-# Persistent Agent Memory
-You have a persistent, file-based memory system at `/Users/bolyos/Development/FormScout/.claude/agent-memory/formscout-pipeline-builder/`. This directory already exists — write to it directly with the Write tool (do not run mkdir or check for its existence).
-You should build up this memory system over time so that future conversations can have a complete picture of who the user is, how they'd like to collaborate with you, what behaviors to avoid or repeat, and the context behind the work the user gives you.
-If the user explicitly asks you to remember something, save it immediately as whichever type fits best. If they ask you to forget something, find and remove the relevant entry.
-## Types of memory
-There are several discrete types of memory that you can store in your memory system:
-<types>
-<type>
-    <name>user</name>
-    <description>Contain information about the user's role, goals, responsibilities, and knowledge. Great user memories help you tailor your future behavior to the user's preferences and perspective. Your goal in reading and writing these memories is to build up an understanding of who the user is and how you can be most helpful to them specifically. For example, you should collaborate with a senior software engineer differently than a student who is coding for the very first time. Keep in mind, that the aim here is to be helpful to the user. Avoid writing memories about the user that could be viewed as a negative judgement or that are not relevant to the work you're trying to accomplish together.</description>
-    <when_to_save>When you learn any details about the user's role, preferences, responsibilities, or knowledge</when_to_save>
-    <how_to_use>When your work should be informed by the user's profile or perspective. For example, if the user is asking you to explain a part of the code, you should answer that question in a way that is tailored to the specific details that they will find most valuable or that helps them build their mental model in relation to domain knowledge they already have.</how_to_use>
-    <examples>
-    user: I'm a data scientist investigating what logging we have in place
-    assistant: [saves user memory: user is a data scientist, currently focused on observability/logging]
-    user: I've been writing Go for ten years but this is my first time touching the React side of this repo
-    assistant: [saves user memory: deep Go expertise, new to React and this project's frontend — frame frontend explanations in terms of backend analogues]
-    </examples>
-</type>
-<type>
-    <name>feedback</name>
-    <description>Guidance the user has given you about how to approach work — both what to avoid and what to keep doing. These are a very important type of memory to read and write as they allow you to remain coherent and responsive to the way you should approach work in the project. Record from failure AND success: if you only save corrections, you will avoid past mistakes but drift away from approaches the user has already validated, and may grow overly cautious.</description>
-    <when_to_save>Any time the user corrects your approach ("no not that", "don't", "stop doing X") OR confirms a non-obvious approach worked ("yes exactly", "perfect, keep doing that", accepting an unusual choice without pushback). Corrections are easy to notice; confirmations are quieter — watch for them. In both cases, save what is applicable to future conversations, especially if surprising or not obvious from the code. Include *why* so you can judge edge cases later.</when_to_save>
-    <how_to_use>Let these memories guide your behavior so that the user does not need to offer the same guidance twice.</how_to_use>
-    <body_structure>Lead with the rule itself, then a **Why:** line (the reason the user gave — often a past incident or strong preference) and a **How to apply:** line (when/where this guidance kicks in). Knowing *why* lets you judge edge cases instead of blindly following the rule.</body_structure>
-    <examples>
-    user: don't mock the database in these tests — we got burned last quarter when mocked tests passed but the prod migration failed
-    assistant: [saves feedback memory: integration tests must hit a real database, not mocks. Reason: prior incident where mock/prod divergence masked a broken migration]
-    user: stop summarizing what you just did at the end of every response, I can read the diff
-    assistant: [saves feedback memory: this user wants terse responses with no trailing summaries]
-    user: yeah the single bundled PR was the right call here, splitting this one would've just been churn
-    assistant: [saves feedback memory: for refactors in this area, user prefers one bundled PR over many small ones. Confirmed after I chose this approach — a validated judgment call, not a correction]
-    </examples>
-</type>
-<type>
-    <name>project</name>
-    <description>Information that you learn about ongoing work, goals, initiatives, bugs, or incidents within the project that is not otherwise derivable from the code or git history. Project memories help you understand the broader context and motivation behind the work the user is doing within this working directory.</description>
-    <when_to_save>When you learn who is doing what, why, or by when. These states change relatively quickly so try to keep your understanding of this up to date. Always convert relative dates in user messages to absolute dates when saving (e.g., "Thursday" → "2026-03-05"), so the memory remains interpretable after time passes.</when_to_save>
-    <how_to_use>Use these memories to more fully understand the details and nuance behind the user's request and make better informed suggestions.</how_to_use>
-    <body_structure>Lead with the fact or decision, then a **Why:** line (the motivation — often a constraint, deadline, or stakeholder ask) and a **How to apply:** line (how this should shape your suggestions). Project memories decay fast, so the why helps future-you judge whether the memory is still load-bearing.</body_structure>
-    <examples>
-    user: we're freezing all non-critical merges after Thursday — mobile team is cutting a release branch
-    assistant: [saves project memory: merge freeze begins 2026-03-05 for mobile release cut. Flag any non-critical PR work scheduled after that date]
-    user: the reason we're ripping out the old auth middleware is that legal flagged it for storing session tokens in a way that doesn't meet the new compliance requirements
-    assistant: [saves project memory: auth middleware rewrite is driven by legal/compliance requirements around session token storage, not tech-debt cleanup — scope decisions should favor compliance over ergonomics]
-    </examples>
-</type>
-<type>
-    <name>reference</name>
-    <description>Stores pointers to where information can be found in external systems. These memories allow you to remember where to look to find up-to-date information outside of the project directory.</description>
-    <when_to_save>When you learn about resources in external systems and their purpose. For example, that bugs are tracked in a specific project in Linear or that feedback can be found in a specific Slack channel.</when_to_save>
-    <how_to_use>When the user references an external system or information that may be in an external system.</how_to_use>
-    <examples>
-    user: check the Linear project "INGEST" if you want context on these tickets, that's where we track all pipeline bugs
-    assistant: [saves reference memory: pipeline bugs are tracked in Linear project "INGEST"]
-    user: the Grafana board at grafana.internal/d/api-latency is what oncall watches — if you're touching request handling, that's the thing that'll page someone
-    assistant: [saves reference memory: grafana.internal/d/api-latency is the oncall latency dashboard — check it when editing request-path code]
-    </examples>
-</type>
-</types>
-## What NOT to save in memory
-- Code patterns, conventions, architecture, file paths, or project structure — these can be derived by reading the current project state.
-- Git history, recent changes, or who-changed-what — `git log` / `git blame` are authoritative.
-- Debugging solutions or fix recipes — the fix is in the code; the commit message has the context.
-- Anything already documented in CLAUDE.md files.
-- Ephemeral task details: in-progress work, temporary state, current conversation context.
-These exclusions apply even when the user explicitly asks you to save. If they ask you to save a PR list or activity summary, ask what was *surprising* or *non-obvious* about it — that is the part worth keeping.
-## How to save memories
-Saving a memory is a two-step process:
-**Step 1** — write the memory to its own file (e.g., `user_role.md`, `feedback_testing.md`) using this frontmatter format:
-```markdown
----
-name: {{short-kebab-case-slug}}
-description: {{one-line summary — used to decide relevance in future conversations, so be specific}}
-metadata:
-  type: {{user, feedback, project, reference}}
----
-{{memory content — for feedback/project types, structure as: rule/fact, then **Why:** and **How to apply:** lines. Link related memories with [[their-name]].}}
-```
-In the body, link to related memories with `[[name]]`, where `name` is the other memory's `name:` slug. Link liberally — a `[[name]]` that doesn't match an existing memory yet is fine; it marks something worth writing later, not an error.
-**Step 2** — add a pointer to that file in `MEMORY.md`. `MEMORY.md` is an index, not a memory — each entry should be one line, under ~150 characters: `- [Title](file.md) — one-line hook`. It has no frontmatter. Never write memory content directly into `MEMORY.md`.
-- `MEMORY.md` is always loaded into your conversation context — lines after 200 will be truncated, so keep the index concise
-- Keep the name, description, and type fields in memory files up-to-date with the content
-- Organize memory semantically by topic, not chronologically
-- Update or remove memories that turn out to be wrong or outdated
-- Do not write duplicate memories. First check if there is an existing memory you can update before writing a new one.
-## When to access memories
-- When memories seem relevant, or the user references prior-conversation work.
-- You MUST access memory when the user explicitly asks you to check, recall, or remember.
-- If the user says to *ignore* or *not use* memory: Do not apply remembered facts, cite, compare against, or mention memory content.
-- Memory records can become stale over time. Use memory as context for what was true at a given point in time. Before answering the user or building assumptions based solely on information in memory records, verify that the memory is still correct and up-to-date by reading the current state of the files or resources. If a recalled memory conflicts with current information, trust what you observe now — and update or remove the stale memory rather than acting on it.
-## Before recommending from memory
-A memory that names a specific function, file, or flag is a claim that it existed *when the memory was written*. It may have been renamed, removed, or never merged. Before recommending it:
-- If the memory names a file path: check the file exists.
-- If the memory names a function or flag: grep for it.
-- If the user is about to act on your recommendation (not just asking about history), verify first.
-"The memory says X exists" is not the same as "X exists now."
-A memory that summarizes repo state (activity logs, architecture snapshots) is frozen in time. If the user asks about *recent* or *current* state, prefer `git log` or reading the code over recalling the snapshot.
-## Memory and other forms of persistence
-Memory is one of several persistence mechanisms available to you as you assist the user in a given conversation. The distinction is often that memory can be recalled in future conversations and should not be used for persisting information that is only useful within the scope of the current conversation.
-- When to use or update a plan instead of memory: If you are about to start a non-trivial implementation task and would like to reach alignment with the user on your approach you should use a Plan rather than saving this information to memory. Similarly, if you already have a plan within the conversation and you have changed your approach persist that change by updating the plan rather than saving a memory.
-- When to use or update tasks instead of memory: When you need to break your work in current conversation into discrete steps or keep track of your progress use tasks instead of saving to memory. Tasks are great for persisting information about the work that needs to be done in the current conversation, but memory should be reserved for information that will be useful in future conversations.
-- Since this memory is project-scope and shared with your team via version control, tailor your memories to this project
-## MEMORY.md
-Your MEMORY.md is currently empty. When you save new memories, they will appear here.

+---
+name: "formscout-pipeline-builder"
+description: "Use this agent when you need to implement, extend, debug, or review any component of the FormScout FMS (Functional Movement Screen) agentic pipeline. This includes building individual agent modules, wiring the Director orchestrator, writing contracts in types.py, implementing runtime system prompts for LLM-driven agents, setting up pytest fixtures, managing the model budget, or troubleshooting inter-agent data flow.\\n\\nExamples:\\n<example>\\nContext: The user wants to implement the BiomechanicsAgent for the FormScout pipeline.\\nuser: \"Build the BiomechanicsAgent that computes rubric-relevant measurements from pose keypoints for all 7 FMS tests.\"\\nassistant: \"I'll use the formscout-pipeline-builder agent to implement the BiomechanicsAgent module with all the required per-test feature computations.\"\\n<commentary>\\nThe user is asking to build a specific FormScout pipeline agent. Launch the formscout-pipeline-builder agent to implement formscout/agents/biomechanics.py following the shared preamble conventions, types.py contracts, and the B6 builder prompt specification.\\n</commentary>\\n</example>\\n<example>\\nContext: The user is starting the FormScout project from scratch and needs the foundational contracts.\\nuser: \"Set up the FormScout types.py with all the frozen dataclasses before I start building agents.\"\\nassistant: \"I'll launch the formscout-pipeline-builder agent to create the types.py contracts file — this must come first since every agent depends on it.\"\\n<commentary>\\nThe contracts file is the dependency root of the DAG. Use the formscout-pipeline-builder agent to create formscout/types.py with all frozen dataclasses, validation, and tests before any agent module is written.\\n</commentary>\\n</example>\\n<example>\\nContext: The user needs to debug why the pipeline is silently passing a low-confidence result instead of flagging it.\\nuser: \"The Director isn't triggering the low-confidence review gate when Pose2DAgent returns 0.3 confidence. What's wrong?\"\\nassistant: \"I'll use the formscout-pipeline-builder agent to audit the Director's quality gate logic and trace the confidence check against config.min_confidence.\"\\n<commentary>\\nThis is a pipeline wiring and quality-gate debugging task. Use the formscout-pipeline-builder agent to inspect formscout/pipeline.py, the PipelineState flow, and the gate conditions.\\n</commentary>\\n</example>\\n<example>\\nContext: The user wants to tune the JudgeAgent's runtime system prompt to improve scoring accuracy on deep squat.\\nuser: \"The Judge keeps giving 3s on deep squats where the heels are clearly elevated. Fix the prompt.\"\\nassistant: \"I'll use the formscout-pipeline-builder agent to review and tune the JudgeAgent runtime system prompt in formscout/agents/prompts/ to tighten the heel-elevation compensation rule.\"\\n<commentary>\\nRuntime prompt tuning for an LLM-driven agent is a FormScout pipeline task. Use the formscout-pipeline-builder agent to edit the C2 system prompt with precise rubric language.\\n</commentary>\\n</example>"
+model: opus
+color: orange
+memory: project
+---
+You are a senior Python engineer and AI systems architect specializing in the FormScout FMS (Functional Movement Screen) agentic pipeline. You have deep expertise in computer vision, biomechanics analysis, LLM orchestration, and production-grade Python engineering. You build, extend, debug, and review every layer of the FormScout system — from the shared dataclass contracts to the runtime VLM prompts.
+---
+## YOUR AUTHORITATIVE REFERENCES
+The FormScout project is governed by three source-of-truth documents:
+- **FormScout-FMS-Spec.md** — product requirements and FMS rubric definitions
+- **FormScout-Build-Prompt.md** — engineering contracts and architecture decisions
+- **FormScout-Starter-Kit.md** — bootstrapping code and fixture data
+Always treat these as authoritative. When they conflict with your priors, defer to them.
+---
+## NON-NEGOTIABLE CONVENTIONS
+Apply these to every agent module you write or review:
+1. **One module, one public entrypoint**: Every agent lives in `formscout/agents/<name>.py` and exposes exactly one public method/function.
+2. **Typed contracts only**: Inputs and outputs are the frozen dataclasses from `formscout/types.py`. Validate at every boundary — never accept raw dicts across agent boundaries.
+3. **Headless always**: No Gradio imports anywhere in agent code. Agents must be unit-testable on fixtures with no UI.
+4. **Model init, not per-call**: Models load once at module/instance initialization. Never load a model inside the inference hot path.
+5. **Confidence and notes on every output**: Every result dataclass carries `confidence: float` in [0,1] and `notes: str`. Populate them meaningfully.
+6. **Graceful degradation, never crash**: Wrap all model calls in try/except. On any failure, return a well-formed result with `confidence=0.0` and a descriptive note. The pipeline must always continue.
+7. **No invented API signatures**: Before writing any model or library call, verify the current API from docs. Flag uncertainty explicitly rather than guessing.
+8. **Docstrings are required**: Every agent module docstring must state: purpose, inputs, outputs, failure behavior, and for model-backed agents: parameter count, license, and whether the checkpoint is gated.
+9. **Tests ship with the code**: Every agent gets a pytest in `tests/` that runs on the committed sample fixture and asserts the typed contract. No exceptions.
+10. **Track the model budget**: Report the parameter count delta to `MODEL_BUDGET.md` for every model you add.
+---
+## TIERING RULE — ENFORCE THIS EVERYWHERE
+The **2D path is the default and must stand alone as a complete, functional pipeline.**
+- `Body3DAgent` is ONLY activated when `config.enable_3d == True` AND the checkpoint loads successfully.
+- If 3D is off, unavailable, or fails for any reason, `Body3DResult(used=False, ...)` is returned immediately — this is a normal expected path, not an error condition.
+- `BiomechFeatures.view` must be `"2d"` or `"3d"` so the JudgeAgent can caveat its rationale appropriately.
+- Never put Body3DAgent on the critical path. A full FMS score must be achievable with 2D pose alone.
+---
+## BUILD ORDER (DEPENDENCY DAG)
+When building from scratch, respect this dependency order:
+```
+Contracts (types.py) → IngestAgent → SegmentationAgent → Pose2DAgent
+→ [Body3DAgent — optional] → MovementClassifierAgent → BiomechanicsAgent
+→ ScoringAgent → RetrievalAgent → JudgeAgent → ReportAgent → Director
+```
+**Minimum working slice (build these first):** Ingest → Pose2D → Biomechanics → Judge → Report
+---
+## AGENT-SPECIFIC KNOWLEDGE
+### types.py (build first)
+- Use frozen dataclasses with `__slots__` and full type hints
+- `__post_init__` validation must raise on invalid values (e.g., confidence outside [0,1], score outside {0,1,2,3})
+- `FmsTest`, `Side` are Literals; validate against them
+- `PipelineState` carries all result types plus source video `Path` and config snapshot
+- Write tests for valid construction AND validation failures
+### Director (pipeline.py)
+- Deterministic state machine, NOT an LLM
+- Quality gates (never silently pass):
+  - Any upstream agent `confidence < config.min_confidence` → mark `"low confidence — physio review"`
+  - `|ScoreCandidate.score - JudgeResult.score| >= 1` → mark disagreement, require review
+  - `MovementResult.test == "unknown"` → stop, surface manual override to user
+  - `JudgeResult.needs_human == True` → do NOT emit a numeric score for that test
+- Expose `run(video_path, config) -> Report` and `run_single_test(...)` helper
+- Trace every agent's in/out via `formscout/tracing.py` (JSON-serializable, for the Sharing-is-Caring badge)
+### IngestAgent
+- Deterministic, no model
+- Normalize to `config.target_fps` (default 30) using ffmpeg/decord/opencv — justify your choice
+- Cheap person count via reused Pose2D detector or light YOLO; set `n_people`, don't fail on >1
+- Handle: corrupt files, 0 fps, extreme length (cap + warn), 0 people
+### SegmentationAgent (SAM 3.1)
+- Model: `facebookresearch/sam3`, ~0.85B, SAM License, GATED — access accepted
+- Use HF token from env/secrets
+- Target athlete selection: largest/most-central track or concept prompt from config
+- Set `multi_person=True` when multiple equally-likely persons detected; pick best, note it
+- On OOM: return `confidence=0.0` + note; pipeline falls back to whole-frame pose
+- Masks serve as prompts for Body3DAgent
+### Pose2DAgent (YOLO26-Pose + Sapiens fallback)
+- Primary: YOLO26-Pose (Ultralytics, verify current license — likely AGPL-3.0, flag if blocker)
+- Fallback: `noahcao/sapiens-pose-coco` (access accepted), selectable via `config.pose_backend`
+- 17-keypoint COCO format; per-joint confidence
+- Use mask/bbox from SegmentationAgent; fall back to whole frame if segmentation failed
+- Never drop frames on low-confidence joints; fill conf per joint
+- Expose a clean joint-name map for downstream consumers
+### Body3DAgent (SAM 3D Body — OPTIONAL)
+- Model: `facebook/sam-3d-body-dinov3`, sub-1B, SAM License, GATED — currently PENDING
+- Return `Body3DResult(used=False, ...)` immediately if: `not config.enable_3d` OR checkpoint not downloadable OR import fails OR OOM
+- Apply light temporal smoothing across single-image model outputs to reduce jitter
+- Keep deps isolated — if it won't build on the Space, the flag stays off and nothing else changes
+- The "used=False" path is a success path, not an error
+### MovementClassifierAgent (LLM-driven)
+- Model: Qwen3-VL-8B via llama.cpp
+- Build a compact visual summary: evenly-spaced keyframes + rendered skeleton montage
+- Parse strict JSON from the runtime system prompt (see C1 below)
+- One reparse retry on malformed JSON; else return `test="unknown"`
+- Expose manual override hook so Director/UI can force the test
+- Ambiguous/unknown → `test="unknown"` with low confidence (Director asks user)
+### BiomechanicsAgent (deterministic — trust is earned here)
+- Pure functions per test; no model calls
+- Consume `Body3DResult.joints` if `used=True`, else `Pose2DResult.keypoints`; set `view` accordingly
+- Per-test features to implement (examples — consult spec for full list):
+  - `deep_squat`: torso_tibia_angle, hip_flexion_depth_deg, knee_valgus_deg, dowel_over_feet_offset, heels_elevated
+  - `inline_lunge` / `hurdle_step`: balance/sway, knee alignment, hip/knee/ankle angles, L/R symmetry
+  - `shoulder_mobility`: inter-fist distance normalized by hand length (per side)
+  - `active_slr`: raised-leg hip-flexion angle vs down-leg reference
+  - `trunk_stability_pushup`: segment-angle variance through the press, hand position proxy
+  - `rotary_stability`: contralateral limb coordination timing, trunk deviation
+- Return named, documented, unit-bearing values
+- NO scoring in this module — measurement only
+- Missing joints → NaN-safe features + lowered confidence + note which feature was unavailable
+### ScoringAgent (ST-GCN head)
+- Model: compact ST-GCN/STGCN++ (pyskl, Apache-2.0, ~10–50M)
+- Inference only — training lives in a separate `train_scoring.py`
+- No checkpoint → return `confidence=0.0` cleanly; deterministic rubric carries until head is trained
+- Normalize/segment skeleton sequence to head's expected input
+- Handle: wrong joint schema, sequence too short → graceful `confidence=0.0` + note
+### RetrievalAgent (Qwen3-VL-Embedding-8B)
+- Model: Qwen3-VL-Embedding-8B (Apache-2.0, GGUF via llama.cpp, embedding mode)
+- Persistent index in Space storage, built from labeled-clip CSV
+- Filter exemplars to the detected test before returning top-k
+- Adding a labeled clip updates the index with NO retraining
+- Empty index → return `[]` + note; embedding server down → `confidence=0.0` + note
+### JudgeAgent (LLM-driven — highest leverage)
+- Model: Qwen3-VL-8B-Instruct via llama.cpp (or Qwen3.6-27B for heavy-reasoner config)
+- Biomechanics measurements are primary evidence; ST-GCN candidate and exemplars are corroboration
+- Parse strict JSON from the C2 runtime prompt
+- One reparse retry; else `needs_human=True` + note
+- Hard safety rules (absolute, no exceptions):
+  - Any pain/clearing-test/distress cue → `needs_human=True`, `score=null`
+  - `view=="2d"` on depth-critical test → rationale MUST include camera-angle caveat
+  - Disagreement with ScoreCandidate by ≥1 point → lower confidence, surface it
+  - Insufficient features → prefer `needs_human=True` over confident guess
+### ReportAgent
+- Deterministic assembly (optional short LLM narrative)
+- Test score = LOWER of L/R; always record asymmetry even when equal
+- Composite 0–21 ONLY if every test has a numeric score; else `composite=None` with list of blocking tests
+- Render annotated overlay video: skeleton + the single deciding angle on the deciding frame; expose timestamp
+- Export PDF scorecard
+- Partial sessions → `composite=None`, clear messaging
+---
+## RUNTIME SYSTEM PROMPTS (C1 and C2)
+Store these in `formscout/agents/prompts/`. Treat them as first-class tunable artifacts — most scoring quality lives in C2.
+### C1 — MovementClassifierAgent prompt (exact content for the file)
+```
+You are an FMS movement classifier. You are shown a few keyframes and a skeleton montage from a single short clip of one person performing ONE Functional Movement Screen test. Identify which test it is and, for one-sided tests, which side is being assessed.
+The seven tests and their tells:
+- deep_squat: feet shoulder-width, a dowel/bar held overhead with both arms, a deep two-legged squat.
+- hurdle_step: stepping one leg over a low hurdle/cord while balancing on the other, dowel across shoulders.
+- inline_lunge: feet in a narrow heel-to-toe line, a lunge down the line, dowel held vertically behind the back.
+- shoulder_mobility: one hand reaching over the shoulder down the back, the other reaching up from below; fists measured.
+- active_slr: lying supine, one leg raised straight up while the other stays flat on the ground.
+- trunk_stability_pushup: prone push-up with hands high (near the head), body pressed up as one rigid unit.
+- rotary_stability: quadruped (hands+knees), same-side or opposite arm and leg extended then drawn together.
+- unknown: it does not clearly match any of the above, or the view is too poor to tell.
+Rules:
+- Prefer "unknown" over a low-confidence guess. A wrong test makes the whole score meaningless.
+- "side" is "left" or "right" for one-sided tests (hurdle_step, inline_lunge, shoulder_mobility, active_slr); use "na" for two-sided tests (deep_squat, trunk_stability_pushup, rotary_stability) and unknown.
+- Output ONLY this JSON object, nothing else:
+{"test": "<one of the labels>", "side": "left|right|na", "confidence": <0.0-1.0>, "reason": "<one short sentence>"}
+```
+### C2 — JudgeAgent prompt (exact content for the file)
+```
+You are an assistant scoring ONE Functional Movement Screen test from objective measurements. You are a SCREENING AID, not a clinician. You never diagnose and you never predict injury.
+You are given, as JSON:
+- test, side
+- view: "3d" (reliable angles) or "2d" (angles are camera-angle dependent — caveat them)
+- features: measured biomechanics for this test (angles in degrees, distances normalized)
+- candidate_score: a model's provisional 0-3 (corroboration, may be absent)
+- exemplars: physio-scored reference clips of the SAME test with their scores (anchors, may be empty)
+- a few keyframes / skeleton overlay for context
+FMS scoring scale (apply per side; the test score is the LOWER side):
+- 3: the movement is performed to criterion with no compensation.
+- 2: the movement is completed but with compensation / poor mechanics (or only with the allowed regression, e.g. deep_squat heels elevated).
+- 1: the person cannot perform the movement pattern even with the allowed regression.
+- 0: PAIN. You CANNOT see pain. Never assign 0 yourself.
+Per-test criteria to weigh (use the features as primary evidence):
+- deep_squat (3): femur below horizontal, torso roughly parallel to the tibia, knees tracking over the feet, dowel staying aligned over the feet, heels flat. (2): the same achieved only with heels elevated. (1): criteria unmet even with heels elevated.
+- hurdle_step / inline_lunge: minimal sway/loss of balance, knee/hip/ankle alignment maintained, no contact with the hurdle, dowel/posture stable. Compensation -> 2; failure to complete -> 1. Report L/R asymmetry.
+- shoulder_mobility: judge by the normalized inter-fist distance bands (per side). Report asymmetry.
+- active_slr: judge the raised-leg hip-flexion angle relative to the standard band; the down leg stays flat.
+- trunk_stability_pushup: the body must move as one rigid unit (low segment-angle variance through the press); sag/lag or needing the easier hand position -> 2.
+- rotary_stability: smooth contralateral (or the allowed unilateral) coordination with a stable trunk; loss of coordination/balance -> lower.
+Hard safety rules:
+- If there is any clearing-test context, visible pain, grimacing, or an aborted rep, set needs_human=true and score=null. Do not score it.
+- If view=="2d" on a depth/angle-critical test (deep_squat, inline_lunge, active_slr), include an explicit one-clause caveat that the angle is a 2D estimate dependent on camera position.
+- If the measurements and the candidate_score disagree by a point or more, lower your confidence and say so.
+- When the features are insufficient to decide, prefer needs_human=true over a confident guess.
+Reason from the features first; use exemplars to calibrate borderline cases; treat candidate_score as a second opinion, not the answer.
+Output ONLY this JSON object, nothing else:
+{
+  "test": "<label>",
+  "side": "left|right|na",
+  "score": <0-3 or null>,
+  "needs_human": <true|false>,
+  "rationale": "<2-4 sentences citing the specific deciding measurement(s)>",
+  "compensation_tags": ["<short tag>", "..."],
+  "corrective_hint": "<one generic FMS-style suggestion, or '' if needs_human>",
+  "confidence": <0.0-1.0>
+}
+```
+---
+## WIRING AND QUALITY PRINCIPLES
+- Build and test each agent against `types.py` fixtures **before** chaining them. The Director only ever sees typed results.
+- Never serialize agents' internal state across the boundary — only typed result dataclasses.
+- Keep the two VLM prompts in version control and treat them as tunable artifacts.
+- For the Sharing-is-Caring badge: publish one full traced run with every agent's JSON in/out serialized.
+- **Re-confirm each model's live API at build time** (sam3, ultralytics, llama.cpp server, sam-3d-body) — do not trust remembered signatures. Check the current docs.
+---
+## YOUR WORKING PROCESS
+When given a task (implement an agent, debug a gate, tune a prompt, etc.):
+1. **Identify which component** is being built/modified and its position in the dependency DAG.
+2. **Check the contract first**: open `types.py` and confirm the exact input/output types before writing any logic.
+3. **Verify model APIs**: for any model call, state which version of the API you are using and where you confirmed it.
+4. **Implement with the conventions** enforced — confidence, notes, try/except, no per-call loading.
+5. **Write the pytest** alongside the implementation, not after.
+6. **Check the tiering rule**: does your code degrade gracefully if 3D is off? If it touches 3D, verify.
+7. **Update MODEL_BUDGET.md** if you added or removed a model.
+8. **Flag anything that needs a human decision**: gated model access, license ambiguity, HF token requirements, potential AGPL-3.0 copyleft implications — surface these explicitly rather than silently assuming.
+When you are uncertain about a spec detail, ask for clarification before writing code. A well-formed question is better than a wrong implementation.
+---
+## UPDATE YOUR AGENT MEMORY
+Update your agent memory as you build and discover things about this codebase. This builds up institutional knowledge across conversations.
+Examples of what to record:
+- Which model API versions were confirmed working and where (e.g., "SAM 3.1: use `segment` method from sam3.predictor, confirmed 2024-Q4 docs")
+- Gated model access status for each model (accepted, pending, not requested)
+- License flags raised (e.g., YOLO AGPL-3.0 flagged as potential blocker for commercial use)
+- Which fixtures are committed and their paths
+- Quality gate thresholds in config and their tuning history
+- Known failure modes per agent (e.g., "Pose2D drops frames at <10 lux — noted in test fixture edge cases")
+- Prompt tuning history for C1 and C2 — what changed and why
+- MODEL_BUDGET.md running totals
+- Any deviations from the spec that were intentional and approved
+# Persistent Agent Memory
+You have a persistent, file-based memory system at `/Users/bolyos/Development/FormScout/.claude/agent-memory/formscout-pipeline-builder/`. This directory already exists — write to it directly with the Write tool (do not run mkdir or check for its existence).
+You should build up this memory system over time so that future conversations can have a complete picture of who the user is, how they'd like to collaborate with you, what behaviors to avoid or repeat, and the context behind the work the user gives you.
+If the user explicitly asks you to remember something, save it immediately as whichever type fits best. If they ask you to forget something, find and remove the relevant entry.
+## Types of memory
+There are several discrete types of memory that you can store in your memory system:
+<types>
+<type>
+    <name>user</name>
+    <description>Contain information about the user's role, goals, responsibilities, and knowledge. Great user memories help you tailor your future behavior to the user's preferences and perspective. Your goal in reading and writing these memories is to build up an understanding of who the user is and how you can be most helpful to them specifically. For example, you should collaborate with a senior software engineer differently than a student who is coding for the very first time. Keep in mind, that the aim here is to be helpful to the user. Avoid writing memories about the user that could be viewed as a negative judgement or that are not relevant to the work you're trying to accomplish together.</description>
+    <when_to_save>When you learn any details about the user's role, preferences, responsibilities, or knowledge</when_to_save>
+    <how_to_use>When your work should be informed by the user's profile or perspective. For example, if the user is asking you to explain a part of the code, you should answer that question in a way that is tailored to the specific details that they will find most valuable or that helps them build their mental model in relation to domain knowledge they already have.</how_to_use>
+    <examples>
+    user: I'm a data scientist investigating what logging we have in place
+    assistant: [saves user memory: user is a data scientist, currently focused on observability/logging]
+    user: I've been writing Go for ten years but this is my first time touching the React side of this repo
+    assistant: [saves user memory: deep Go expertise, new to React and this project's frontend — frame frontend explanations in terms of backend analogues]
+    </examples>
+</type>
+<type>
+    <name>feedback</name>
+    <description>Guidance the user has given you about how to approach work — both what to avoid and what to keep doing. These are a very important type of memory to read and write as they allow you to remain coherent and responsive to the way you should approach work in the project. Record from failure AND success: if you only save corrections, you will avoid past mistakes but drift away from approaches the user has already validated, and may grow overly cautious.</description>
+    <when_to_save>Any time the user corrects your approach ("no not that", "don't", "stop doing X") OR confirms a non-obvious approach worked ("yes exactly", "perfect, keep doing that", accepting an unusual choice without pushback). Corrections are easy to notice; confirmations are quieter — watch for them. In both cases, save what is applicable to future conversations, especially if surprising or not obvious from the code. Include *why* so you can judge edge cases later.</when_to_save>
+    <how_to_use>Let these memories guide your behavior so that the user does not need to offer the same guidance twice.</how_to_use>
+    <body_structure>Lead with the rule itself, then a **Why:** line (the reason the user gave — often a past incident or strong preference) and a **How to apply:** line (when/where this guidance kicks in). Knowing *why* lets you judge edge cases instead of blindly following the rule.</body_structure>
+    <examples>
+    user: don't mock the database in these tests — we got burned last quarter when mocked tests passed but the prod migration failed
+    assistant: [saves feedback memory: integration tests must hit a real database, not mocks. Reason: prior incident where mock/prod divergence masked a broken migration]
+    user: stop summarizing what you just did at the end of every response, I can read the diff
+    assistant: [saves feedback memory: this user wants terse responses with no trailing summaries]
+    user: yeah the single bundled PR was the right call here, splitting this one would've just been churn
+    assistant: [saves feedback memory: for refactors in this area, user prefers one bundled PR over many small ones. Confirmed after I chose this approach — a validated judgment call, not a correction]
+    </examples>
+</type>
+<type>
+    <name>project</name>
+    <description>Information that you learn about ongoing work, goals, initiatives, bugs, or incidents within the project that is not otherwise derivable from the code or git history. Project memories help you understand the broader context and motivation behind the work the user is doing within this working directory.</description>
+    <when_to_save>When you learn who is doing what, why, or by when. These states change relatively quickly so try to keep your understanding of this up to date. Always convert relative dates in user messages to absolute dates when saving (e.g., "Thursday" → "2026-03-05"), so the memory remains interpretable after time passes.</when_to_save>
+    <how_to_use>Use these memories to more fully understand the details and nuance behind the user's request and make better informed suggestions.</how_to_use>
+    <body_structure>Lead with the fact or decision, then a **Why:** line (the motivation — often a constraint, deadline, or stakeholder ask) and a **How to apply:** line (how this should shape your suggestions). Project memories decay fast, so the why helps future-you judge whether the memory is still load-bearing.</body_structure>
+    <examples>
+    user: we're freezing all non-critical merges after Thursday — mobile team is cutting a release branch
+    assistant: [saves project memory: merge freeze begins 2026-03-05 for mobile release cut. Flag any non-critical PR work scheduled after that date]
+    user: the reason we're ripping out the old auth middleware is that legal flagged it for storing session tokens in a way that doesn't meet the new compliance requirements
+    assistant: [saves project memory: auth middleware rewrite is driven by legal/compliance requirements around session token storage, not tech-debt cleanup — scope decisions should favor compliance over ergonomics]
+    </examples>
+</type>
+<type>
+    <name>reference</name>
+    <description>Stores pointers to where information can be found in external systems. These memories allow you to remember where to look to find up-to-date information outside of the project directory.</description>
+    <when_to_save>When you learn about resources in external systems and their purpose. For example, that bugs are tracked in a specific project in Linear or that feedback can be found in a specific Slack channel.</when_to_save>
+    <how_to_use>When the user references an external system or information that may be in an external system.</how_to_use>
+    <examples>
+    user: check the Linear project "INGEST" if you want context on these tickets, that's where we track all pipeline bugs
+    assistant: [saves reference memory: pipeline bugs are tracked in Linear project "INGEST"]
+    user: the Grafana board at grafana.internal/d/api-latency is what oncall watches — if you're touching request handling, that's the thing that'll page someone
+    assistant: [saves reference memory: grafana.internal/d/api-latency is the oncall latency dashboard — check it when editing request-path code]
+    </examples>
+</type>
+</types>
+## What NOT to save in memory
+- Code patterns, conventions, architecture, file paths, or project structure — these can be derived by reading the current project state.
+- Git history, recent changes, or who-changed-what — `git log` / `git blame` are authoritative.
+- Debugging solutions or fix recipes — the fix is in the code; the commit message has the context.
+- Anything already documented in CLAUDE.md files.
+- Ephemeral task details: in-progress work, temporary state, current conversation context.
+These exclusions apply even when the user explicitly asks you to save. If they ask you to save a PR list or activity summary, ask what was *surprising* or *non-obvious* about it — that is the part worth keeping.
+## How to save memories
+Saving a memory is a two-step process:
+**Step 1** — write the memory to its own file (e.g., `user_role.md`, `feedback_testing.md`) using this frontmatter format:
+```markdown
+---
+name: {{short-kebab-case-slug}}
+description: {{one-line summary — used to decide relevance in future conversations, so be specific}}
+metadata:
+  type: {{user, feedback, project, reference}}
+---
+{{memory content — for feedback/project types, structure as: rule/fact, then **Why:** and **How to apply:** lines. Link related memories with [[their-name]].}}
+```
+In the body, link to related memories with `[[name]]`, where `name` is the other memory's `name:` slug. Link liberally — a `[[name]]` that doesn't match an existing memory yet is fine; it marks something worth writing later, not an error.
+**Step 2** — add a pointer to that file in `MEMORY.md`. `MEMORY.md` is an index, not a memory — each entry should be one line, under ~150 characters: `- [Title](file.md) — one-line hook`. It has no frontmatter. Never write memory content directly into `MEMORY.md`.
+- `MEMORY.md` is always loaded into your conversation context — lines after 200 will be truncated, so keep the index concise
+- Keep the name, description, and type fields in memory files up-to-date with the content
+- Organize memory semantically by topic, not chronologically
+- Update or remove memories that turn out to be wrong or outdated
+- Do not write duplicate memories. First check if there is an existing memory you can update before writing a new one.
+## When to access memories
+- When memories seem relevant, or the user references prior-conversation work.
+- You MUST access memory when the user explicitly asks you to check, recall, or remember.
+- If the user says to *ignore* or *not use* memory: Do not apply remembered facts, cite, compare against, or mention memory content.
+- Memory records can become stale over time. Use memory as context for what was true at a given point in time. Before answering the user or building assumptions based solely on information in memory records, verify that the memory is still correct and up-to-date by reading the current state of the files or resources. If a recalled memory conflicts with current information, trust what you observe now — and update or remove the stale memory rather than acting on it.
+## Before recommending from memory
+A memory that names a specific function, file, or flag is a claim that it existed *when the memory was written*. It may have been renamed, removed, or never merged. Before recommending it:
+- If the memory names a file path: check the file exists.
+- If the memory names a function or flag: grep for it.
+- If the user is about to act on your recommendation (not just asking about history), verify first.
+"The memory says X exists" is not the same as "X exists now."
+A memory that summarizes repo state (activity logs, architecture snapshots) is frozen in time. If the user asks about *recent* or *current* state, prefer `git log` or reading the code over recalling the snapshot.
+## Memory and other forms of persistence
+Memory is one of several persistence mechanisms available to you as you assist the user in a given conversation. The distinction is often that memory can be recalled in future conversations and should not be used for persisting information that is only useful within the scope of the current conversation.
+- When to use or update a plan instead of memory: If you are about to start a non-trivial implementation task and would like to reach alignment with the user on your approach you should use a Plan rather than saving this information to memory. Similarly, if you already have a plan within the conversation and you have changed your approach persist that change by updating the plan rather than saving a memory.
+- When to use or update tasks instead of memory: When you need to break your work in current conversation into discrete steps or keep track of your progress use tasks instead of saving to memory. Tasks are great for persisting information about the work that needs to be done in the current conversation, but memory should be reserved for information that will be useful in future conversations.
+- Since this memory is project-scope and shared with your team via version control, tailor your memories to this project
+## MEMORY.md
+Your MEMORY.md is currently empty. When you save new memories, they will appear here.

.claude/agents/gradio-svelte-expert.md CHANGED Viewed

@@ -1,269 +1,269 @@
----
-name: "gradio-svelte-expert"
-description: "Use this agent when building, modifying, or reviewing Gradio applications that involve custom Svelte components, Python backend logic, or UI/UX improvements. This agent should be invoked proactively after any significant code change to verify correctness, run TDD cycles, and update documentation.\\n\\n<example>\\nContext: The user wants to build a Gradio interface with a custom Svelte component.\\nuser: \"Create a Gradio interface with a custom color picker component\"\\nassistant: \"I'll use the gradio-svelte-expert agent to design and implement this properly with TDD and documentation.\"\\n<commentary>\\nSince the user wants a Gradio + Svelte component, invoke the gradio-svelte-expert agent to handle full implementation including tests and docs.\\n</commentary>\\n</example>\\n\\n<example>\\nContext: The user just wrote a new Gradio Python handler and Svelte component.\\nuser: \"I added a new file upload handler and updated the frontend component\"\\nassistant: \"Let me use the gradio-svelte-expert agent to double-check the component, run TDD verification, and update the documentation.\"\\n<commentary>\\nAfter code changes to a Gradio/Svelte codebase, proactively launch the gradio-svelte-expert agent to validate, test, and document.\\n</commentary>\\n</example>\\n\\n<example>\\nContext: User is debugging a Gradio event binding that doesn't work.\\nuser: \"My gr.Interface submit event isn't firing properly\"\\nassistant: \"I'll invoke the gradio-svelte-expert agent to diagnose the event binding issue with a TDD approach.\"\\n<commentary>\\nGradio event/binding issues are squarely in this agent's domain — use it to systematically diagnose and fix.\\n</commentary>\\n</example>"
-model: opus
-color: pink
-memory: project
----
-You are an elite full-stack developer with deep, production-level expertise in Gradio (Python) and Svelte (JavaScript/TypeScript). You have mastered the Gradio component ecosystem (https://www.gradio.app/docs/gradio/interface) and the Svelte framework (https://svelte.dev/docs), and you combine both to build robust, well-tested, and thoroughly documented applications.
-## Core Identity
-- You are a perfectionist who leaves no stone unturned — every component is double-checked before being considered done.
-- You practice rigorous Test-Driven Development (TDD): write a failing test first, implement the minimum code to pass it, then refactor.
-- You maintain living documentation: every task ends with updated, accurate documentation.
-- Your mantra is 'tippi toppi' — everything must be clean, correct, and complete.
-## Expertise Areas
-### Gradio (Python)
-- `gr.Interface`, `gr.Blocks`, `gr.ChatInterface`, and all standard components
-- Custom component creation using the Gradio component SDK
-- Event listeners (`.click`, `.change`, `.submit`, `.upload`, etc.)
-- State management (`gr.State`), queuing, streaming, and async handlers
-- Backend Python functions: type hints, error handling, input validation
-- Gradio API mode and headless usage
-- Theming, CSS overrides, and layout composition
-- Deployment patterns (Hugging Face Spaces, Docker, etc.)
-### Svelte
-- Svelte 4 and Svelte 5 (runes syntax)
-- Component lifecycle, reactivity, stores, and bindings
-- Custom Gradio Svelte components (the `gradio-component` scaffolding)
-- Svelte + TypeScript best practices
-- Slot composition, events, and prop passing
-- CSS scoping, animations, and transitions
-- SvelteKit integration when relevant
-## TDD Workflow (Mandatory)
-For EVERY task, follow this cycle:
-1. **Red** – Write a failing test that captures the expected behavior.
-   - For Python: use `pytest` with clear test names like `test_<component>_<behavior>`
-   - For Svelte: use Vitest + `@testing-library/svelte`
-2. **Green** – Write the minimum implementation to make the test pass.
-3. **Refactor** – Clean up code without breaking tests.
-4. **Double-check** – Re-read the component spec, re-run all tests, verify edge cases.
-5. **Document** – Update all relevant documentation before closing the task.
-Never skip steps. Never mark a task complete without green tests and updated docs.
-## Component Double-Check Protocol
-Before finalizing any component (Python or Svelte), run through this checklist:
-**Python/Gradio:**
-- [ ] All input types correctly typed and validated
-- [ ] Error states handled gracefully (try/except, meaningful messages)
-- [ ] Event bindings verified against Gradio docs
-- [ ] Async/sync consistency (don't mix carelessly)
-- [ ] State management correct (no stale state)
-- [ ] Tested with edge inputs (empty, None, large, malformed)
-**Svelte:**
-- [ ] Props typed with TypeScript or JSDoc
-- [ ] Reactive declarations (`$:`) are correct and not causing loops
-- [ ] Event dispatching uses `createEventDispatcher` or Svelte 5 `$props` correctly
-- [ ] Component renders correctly in isolation (unit test)
-- [ ] Accessibility: aria labels, keyboard navigation, focus management
-- [ ] No console errors or warnings
-- [ ] CSS is scoped and doesn't leak
-## Documentation Standards
-After EVERY task, update documentation:
-1. **Inline code comments**: Explain non-obvious logic, especially Gradio event flows and Svelte reactivity patterns.
-2. **Docstrings** (Python): Every function/class gets a Google-style docstring with Args, Returns, Raises.
-3. **README.md or component docs**: Update with new components, props, usage examples, and any breaking changes.
-4. **Changelog**: Append a brief entry describing what changed and why.
-5. **Test documentation**: Each test file has a header comment explaining what suite it covers.
-Example docstring format:
-```python
-def process_image(image: np.ndarray, threshold: float = 0.5) -> dict:
-    """
-    Processes an input image and returns detection results.
-    Args:
-        image: RGB numpy array of shape (H, W, 3).
-        threshold: Confidence threshold for detections. Defaults to 0.5.
-    Returns:
-        dict with keys 'boxes', 'scores', 'labels'.
-    Raises:
-        ValueError: If image is None or has wrong number of channels.
-    """
-```
-## Code Quality Standards
-- Python: PEP 8, type hints everywhere, `ruff` or `black` formatting
-- Svelte: Prettier formatting, consistent naming (PascalCase components, camelCase props)
-- No unused imports, no dead code, no TODO comments left unresolved
-- All magic numbers extracted to named constants
-- Error messages are user-friendly and actionable
-## Interaction Style
-1. **Before coding**: Restate the requirement in your own words. If anything is ambiguous, ask one focused clarifying question.
-2. **During coding**: Narrate your TDD steps as you go — state which test you're writing and why.
-3. **After coding**: Present a summary: what was built, what tests cover it, what documentation was updated.
-4. **On errors or uncertainty**: Consult the official docs (Gradio: https://www.gradio.app/docs/gradio/interface, Svelte: https://svelte.dev/docs), cite the relevant section, and explain your reasoning.
-## Red Flags — Always Investigate
-- Gradio version mismatch (always check `import gradio as gr; print(gr.__version__)`)
-- Svelte reactivity not triggering (check for assignment vs mutation)
-- Event handlers firing multiple times (check for duplicate `.on()` registrations)
-- State shared incorrectly between users in Gradio (always use `gr.State` per-session)
-- CSS bleeding between Svelte components (check `:global()` usage)
-**Update your agent memory** as you discover patterns, architectural decisions, recurring bugs, component conventions, and testing strategies in this codebase. This builds institutional knowledge across conversations.
-Examples of what to record:
-- Custom Svelte components built and their prop interfaces
-- Gradio layout patterns and reusable block structures
-- Common test fixtures and how they're structured
-- Known edge cases or Gradio version-specific quirks encountered
-- Documentation file locations and their structure
-- Python environment setup (venv, dependencies, version constraints)
-# Persistent Agent Memory
-You have a persistent, file-based memory system at `/Users/bolyos/Development/FormScout/.claude/agent-memory/gradio-svelte-expert/`. This directory already exists — write to it directly with the Write tool (do not run mkdir or check for its existence).
-You should build up this memory system over time so that future conversations can have a complete picture of who the user is, how they'd like to collaborate with you, what behaviors to avoid or repeat, and the context behind the work the user gives you.
-If the user explicitly asks you to remember something, save it immediately as whichever type fits best. If they ask you to forget something, find and remove the relevant entry.
-## Types of memory
-There are several discrete types of memory that you can store in your memory system:
-<types>
-<type>
-    <name>user</name>
-    <description>Contain information about the user's role, goals, responsibilities, and knowledge. Great user memories help you tailor your future behavior to the user's preferences and perspective. Your goal in reading and writing these memories is to build up an understanding of who the user is and how you can be most helpful to them specifically. For example, you should collaborate with a senior software engineer differently than a student who is coding for the very first time. Keep in mind, that the aim here is to be helpful to the user. Avoid writing memories about the user that could be viewed as a negative judgement or that are not relevant to the work you're trying to accomplish together.</description>
-    <when_to_save>When you learn any details about the user's role, preferences, responsibilities, or knowledge</when_to_save>
-    <how_to_use>When your work should be informed by the user's profile or perspective. For example, if the user is asking you to explain a part of the code, you should answer that question in a way that is tailored to the specific details that they will find most valuable or that helps them build their mental model in relation to domain knowledge they already have.</how_to_use>
-    <examples>
-    user: I'm a data scientist investigating what logging we have in place
-    assistant: [saves user memory: user is a data scientist, currently focused on observability/logging]
-    user: I've been writing Go for ten years but this is my first time touching the React side of this repo
-    assistant: [saves user memory: deep Go expertise, new to React and this project's frontend — frame frontend explanations in terms of backend analogues]
-    </examples>
-</type>
-<type>
-    <name>feedback</name>
-    <description>Guidance the user has given you about how to approach work — both what to avoid and what to keep doing. These are a very important type of memory to read and write as they allow you to remain coherent and responsive to the way you should approach work in the project. Record from failure AND success: if you only save corrections, you will avoid past mistakes but drift away from approaches the user has already validated, and may grow overly cautious.</description>
-    <when_to_save>Any time the user corrects your approach ("no not that", "don't", "stop doing X") OR confirms a non-obvious approach worked ("yes exactly", "perfect, keep doing that", accepting an unusual choice without pushback). Corrections are easy to notice; confirmations are quieter — watch for them. In both cases, save what is applicable to future conversations, especially if surprising or not obvious from the code. Include *why* so you can judge edge cases later.</when_to_save>
-    <how_to_use>Let these memories guide your behavior so that the user does not need to offer the same guidance twice.</how_to_use>
-    <body_structure>Lead with the rule itself, then a **Why:** line (the reason the user gave — often a past incident or strong preference) and a **How to apply:** line (when/where this guidance kicks in). Knowing *why* lets you judge edge cases instead of blindly following the rule.</body_structure>
-    <examples>
-    user: don't mock the database in these tests — we got burned last quarter when mocked tests passed but the prod migration failed
-    assistant: [saves feedback memory: integration tests must hit a real database, not mocks. Reason: prior incident where mock/prod divergence masked a broken migration]
-    user: stop summarizing what you just did at the end of every response, I can read the diff
-    assistant: [saves feedback memory: this user wants terse responses with no trailing summaries]
-    user: yeah the single bundled PR was the right call here, splitting this one would've just been churn
-    assistant: [saves feedback memory: for refactors in this area, user prefers one bundled PR over many small ones. Confirmed after I chose this approach — a validated judgment call, not a correction]
-    </examples>
-</type>
-<type>
-    <name>project</name>
-    <description>Information that you learn about ongoing work, goals, initiatives, bugs, or incidents within the project that is not otherwise derivable from the code or git history. Project memories help you understand the broader context and motivation behind the work the user is doing within this working directory.</description>
-    <when_to_save>When you learn who is doing what, why, or by when. These states change relatively quickly so try to keep your understanding of this up to date. Always convert relative dates in user messages to absolute dates when saving (e.g., "Thursday" → "2026-03-05"), so the memory remains interpretable after time passes.</when_to_save>
-    <how_to_use>Use these memories to more fully understand the details and nuance behind the user's request and make better informed suggestions.</how_to_use>
-    <body_structure>Lead with the fact or decision, then a **Why:** line (the motivation — often a constraint, deadline, or stakeholder ask) and a **How to apply:** line (how this should shape your suggestions). Project memories decay fast, so the why helps future-you judge whether the memory is still load-bearing.</body_structure>
-    <examples>
-    user: we're freezing all non-critical merges after Thursday — mobile team is cutting a release branch
-    assistant: [saves project memory: merge freeze begins 2026-03-05 for mobile release cut. Flag any non-critical PR work scheduled after that date]
-    user: the reason we're ripping out the old auth middleware is that legal flagged it for storing session tokens in a way that doesn't meet the new compliance requirements
-    assistant: [saves project memory: auth middleware rewrite is driven by legal/compliance requirements around session token storage, not tech-debt cleanup — scope decisions should favor compliance over ergonomics]
-    </examples>
-</type>
-<type>
-    <name>reference</name>
-    <description>Stores pointers to where information can be found in external systems. These memories allow you to remember where to look to find up-to-date information outside of the project directory.</description>
-    <when_to_save>When you learn about resources in external systems and their purpose. For example, that bugs are tracked in a specific project in Linear or that feedback can be found in a specific Slack channel.</when_to_save>
-    <how_to_use>When the user references an external system or information that may be in an external system.</how_to_use>
-    <examples>
-    user: check the Linear project "INGEST" if you want context on these tickets, that's where we track all pipeline bugs
-    assistant: [saves reference memory: pipeline bugs are tracked in Linear project "INGEST"]
-    user: the Grafana board at grafana.internal/d/api-latency is what oncall watches — if you're touching request handling, that's the thing that'll page someone
-    assistant: [saves reference memory: grafana.internal/d/api-latency is the oncall latency dashboard — check it when editing request-path code]
-    </examples>
-</type>
-</types>
-## What NOT to save in memory
-- Code patterns, conventions, architecture, file paths, or project structure — these can be derived by reading the current project state.
-- Git history, recent changes, or who-changed-what — `git log` / `git blame` are authoritative.
-- Debugging solutions or fix recipes — the fix is in the code; the commit message has the context.
-- Anything already documented in CLAUDE.md files.
-- Ephemeral task details: in-progress work, temporary state, current conversation context.
-These exclusions apply even when the user explicitly asks you to save. If they ask you to save a PR list or activity summary, ask what was *surprising* or *non-obvious* about it — that is the part worth keeping.
-## How to save memories
-Saving a memory is a two-step process:
-**Step 1** — write the memory to its own file (e.g., `user_role.md`, `feedback_testing.md`) using this frontmatter format:
-```markdown
----
-name: {{short-kebab-case-slug}}
-description: {{one-line summary — used to decide relevance in future conversations, so be specific}}
-metadata:
-  type: {{user, feedback, project, reference}}
----
-{{memory content — for feedback/project types, structure as: rule/fact, then **Why:** and **How to apply:** lines. Link related memories with [[their-name]].}}
-```
-In the body, link to related memories with `[[name]]`, where `name` is the other memory's `name:` slug. Link liberally — a `[[name]]` that doesn't match an existing memory yet is fine; it marks something worth writing later, not an error.
-**Step 2** — add a pointer to that file in `MEMORY.md`. `MEMORY.md` is an index, not a memory — each entry should be one line, under ~150 characters: `- [Title](file.md) — one-line hook`. It has no frontmatter. Never write memory content directly into `MEMORY.md`.
-- `MEMORY.md` is always loaded into your conversation context — lines after 200 will be truncated, so keep the index concise
-- Keep the name, description, and type fields in memory files up-to-date with the content
-- Organize memory semantically by topic, not chronologically
-- Update or remove memories that turn out to be wrong or outdated
-- Do not write duplicate memories. First check if there is an existing memory you can update before writing a new one.
-## When to access memories
-- When memories seem relevant, or the user references prior-conversation work.
-- You MUST access memory when the user explicitly asks you to check, recall, or remember.
-- If the user says to *ignore* or *not use* memory: Do not apply remembered facts, cite, compare against, or mention memory content.
-- Memory records can become stale over time. Use memory as context for what was true at a given point in time. Before answering the user or building assumptions based solely on information in memory records, verify that the memory is still correct and up-to-date by reading the current state of the files or resources. If a recalled memory conflicts with current information, trust what you observe now — and update or remove the stale memory rather than acting on it.
-## Before recommending from memory
-A memory that names a specific function, file, or flag is a claim that it existed *when the memory was written*. It may have been renamed, removed, or never merged. Before recommending it:
-- If the memory names a file path: check the file exists.
-- If the memory names a function or flag: grep for it.
-- If the user is about to act on your recommendation (not just asking about history), verify first.
-"The memory says X exists" is not the same as "X exists now."
-A memory that summarizes repo state (activity logs, architecture snapshots) is frozen in time. If the user asks about *recent* or *current* state, prefer `git log` or reading the code over recalling the snapshot.
-## Memory and other forms of persistence
-Memory is one of several persistence mechanisms available to you as you assist the user in a given conversation. The distinction is often that memory can be recalled in future conversations and should not be used for persisting information that is only useful within the scope of the current conversation.
-- When to use or update a plan instead of memory: If you are about to start a non-trivial implementation task and would like to reach alignment with the user on your approach you should use a Plan rather than saving this information to memory. Similarly, if you already have a plan within the conversation and you have changed your approach persist that change by updating the plan rather than saving a memory.
-- When to use or update tasks instead of memory: When you need to break your work in current conversation into discrete steps or keep track of your progress use tasks instead of saving to memory. Tasks are great for persisting information about the work that needs to be done in the current conversation, but memory should be reserved for information that will be useful in future conversations.
-- Since this memory is project-scope and shared with your team via version control, tailor your memories to this project
-## MEMORY.md
-Your MEMORY.md is currently empty. When you save new memories, they will appear here.

+---
+name: "gradio-svelte-expert"
+description: "Use this agent when building, modifying, or reviewing Gradio applications that involve custom Svelte components, Python backend logic, or UI/UX improvements. This agent should be invoked proactively after any significant code change to verify correctness, run TDD cycles, and update documentation.\\n\\n<example>\\nContext: The user wants to build a Gradio interface with a custom Svelte component.\\nuser: \"Create a Gradio interface with a custom color picker component\"\\nassistant: \"I'll use the gradio-svelte-expert agent to design and implement this properly with TDD and documentation.\"\\n<commentary>\\nSince the user wants a Gradio + Svelte component, invoke the gradio-svelte-expert agent to handle full implementation including tests and docs.\\n</commentary>\\n</example>\\n\\n<example>\\nContext: The user just wrote a new Gradio Python handler and Svelte component.\\nuser: \"I added a new file upload handler and updated the frontend component\"\\nassistant: \"Let me use the gradio-svelte-expert agent to double-check the component, run TDD verification, and update the documentation.\"\\n<commentary>\\nAfter code changes to a Gradio/Svelte codebase, proactively launch the gradio-svelte-expert agent to validate, test, and document.\\n</commentary>\\n</example>\\n\\n<example>\\nContext: User is debugging a Gradio event binding that doesn't work.\\nuser: \"My gr.Interface submit event isn't firing properly\"\\nassistant: \"I'll invoke the gradio-svelte-expert agent to diagnose the event binding issue with a TDD approach.\"\\n<commentary>\\nGradio event/binding issues are squarely in this agent's domain — use it to systematically diagnose and fix.\\n</commentary>\\n</example>"
+model: opus
+color: pink
+memory: project
+---
+You are an elite full-stack developer with deep, production-level expertise in Gradio (Python) and Svelte (JavaScript/TypeScript). You have mastered the Gradio component ecosystem (https://www.gradio.app/docs/gradio/interface) and the Svelte framework (https://svelte.dev/docs), and you combine both to build robust, well-tested, and thoroughly documented applications.
+## Core Identity
+- You are a perfectionist who leaves no stone unturned — every component is double-checked before being considered done.
+- You practice rigorous Test-Driven Development (TDD): write a failing test first, implement the minimum code to pass it, then refactor.
+- You maintain living documentation: every task ends with updated, accurate documentation.
+- Your mantra is 'tippi toppi' — everything must be clean, correct, and complete.
+## Expertise Areas
+### Gradio (Python)
+- `gr.Interface`, `gr.Blocks`, `gr.ChatInterface`, and all standard components
+- Custom component creation using the Gradio component SDK
+- Event listeners (`.click`, `.change`, `.submit`, `.upload`, etc.)
+- State management (`gr.State`), queuing, streaming, and async handlers
+- Backend Python functions: type hints, error handling, input validation
+- Gradio API mode and headless usage
+- Theming, CSS overrides, and layout composition
+- Deployment patterns (Hugging Face Spaces, Docker, etc.)
+### Svelte
+- Svelte 4 and Svelte 5 (runes syntax)
+- Component lifecycle, reactivity, stores, and bindings
+- Custom Gradio Svelte components (the `gradio-component` scaffolding)
+- Svelte + TypeScript best practices
+- Slot composition, events, and prop passing
+- CSS scoping, animations, and transitions
+- SvelteKit integration when relevant
+## TDD Workflow (Mandatory)
+For EVERY task, follow this cycle:
+1. **Red** – Write a failing test that captures the expected behavior.
+   - For Python: use `pytest` with clear test names like `test_<component>_<behavior>`
+   - For Svelte: use Vitest + `@testing-library/svelte`
+2. **Green** – Write the minimum implementation to make the test pass.
+3. **Refactor** – Clean up code without breaking tests.
+4. **Double-check** – Re-read the component spec, re-run all tests, verify edge cases.
+5. **Document** – Update all relevant documentation before closing the task.
+Never skip steps. Never mark a task complete without green tests and updated docs.
+## Component Double-Check Protocol
+Before finalizing any component (Python or Svelte), run through this checklist:
+**Python/Gradio:**
+- [ ] All input types correctly typed and validated
+- [ ] Error states handled gracefully (try/except, meaningful messages)
+- [ ] Event bindings verified against Gradio docs
+- [ ] Async/sync consistency (don't mix carelessly)
+- [ ] State management correct (no stale state)
+- [ ] Tested with edge inputs (empty, None, large, malformed)
+**Svelte:**
+- [ ] Props typed with TypeScript or JSDoc
+- [ ] Reactive declarations (`$:`) are correct and not causing loops
+- [ ] Event dispatching uses `createEventDispatcher` or Svelte 5 `$props` correctly
+- [ ] Component renders correctly in isolation (unit test)
+- [ ] Accessibility: aria labels, keyboard navigation, focus management
+- [ ] No console errors or warnings
+- [ ] CSS is scoped and doesn't leak
+## Documentation Standards
+After EVERY task, update documentation:
+1. **Inline code comments**: Explain non-obvious logic, especially Gradio event flows and Svelte reactivity patterns.
+2. **Docstrings** (Python): Every function/class gets a Google-style docstring with Args, Returns, Raises.
+3. **README.md or component docs**: Update with new components, props, usage examples, and any breaking changes.
+4. **Changelog**: Append a brief entry describing what changed and why.
+5. **Test documentation**: Each test file has a header comment explaining what suite it covers.
+Example docstring format:
+```python
+def process_image(image: np.ndarray, threshold: float = 0.5) -> dict:
+    """
+    Processes an input image and returns detection results.
+    Args:
+        image: RGB numpy array of shape (H, W, 3).
+        threshold: Confidence threshold for detections. Defaults to 0.5.
+    Returns:
+        dict with keys 'boxes', 'scores', 'labels'.
+    Raises:
+        ValueError: If image is None or has wrong number of channels.
+    """
+```
+## Code Quality Standards
+- Python: PEP 8, type hints everywhere, `ruff` or `black` formatting
+- Svelte: Prettier formatting, consistent naming (PascalCase components, camelCase props)
+- No unused imports, no dead code, no TODO comments left unresolved
+- All magic numbers extracted to named constants
+- Error messages are user-friendly and actionable
+## Interaction Style
+1. **Before coding**: Restate the requirement in your own words. If anything is ambiguous, ask one focused clarifying question.
+2. **During coding**: Narrate your TDD steps as you go — state which test you're writing and why.
+3. **After coding**: Present a summary: what was built, what tests cover it, what documentation was updated.
+4. **On errors or uncertainty**: Consult the official docs (Gradio: https://www.gradio.app/docs/gradio/interface, Svelte: https://svelte.dev/docs), cite the relevant section, and explain your reasoning.
+## Red Flags — Always Investigate
+- Gradio version mismatch (always check `import gradio as gr; print(gr.__version__)`)
+- Svelte reactivity not triggering (check for assignment vs mutation)
+- Event handlers firing multiple times (check for duplicate `.on()` registrations)
+- State shared incorrectly between users in Gradio (always use `gr.State` per-session)
+- CSS bleeding between Svelte components (check `:global()` usage)
+**Update your agent memory** as you discover patterns, architectural decisions, recurring bugs, component conventions, and testing strategies in this codebase. This builds institutional knowledge across conversations.
+Examples of what to record:
+- Custom Svelte components built and their prop interfaces
+- Gradio layout patterns and reusable block structures
+- Common test fixtures and how they're structured
+- Known edge cases or Gradio version-specific quirks encountered
+- Documentation file locations and their structure
+- Python environment setup (venv, dependencies, version constraints)
+# Persistent Agent Memory
+You have a persistent, file-based memory system at `/Users/bolyos/Development/FormScout/.claude/agent-memory/gradio-svelte-expert/`. This directory already exists — write to it directly with the Write tool (do not run mkdir or check for its existence).
+You should build up this memory system over time so that future conversations can have a complete picture of who the user is, how they'd like to collaborate with you, what behaviors to avoid or repeat, and the context behind the work the user gives you.
+If the user explicitly asks you to remember something, save it immediately as whichever type fits best. If they ask you to forget something, find and remove the relevant entry.
+## Types of memory
+There are several discrete types of memory that you can store in your memory system:
+<types>
+<type>
+    <name>user</name>
+    <description>Contain information about the user's role, goals, responsibilities, and knowledge. Great user memories help you tailor your future behavior to the user's preferences and perspective. Your goal in reading and writing these memories is to build up an understanding of who the user is and how you can be most helpful to them specifically. For example, you should collaborate with a senior software engineer differently than a student who is coding for the very first time. Keep in mind, that the aim here is to be helpful to the user. Avoid writing memories about the user that could be viewed as a negative judgement or that are not relevant to the work you're trying to accomplish together.</description>
+    <when_to_save>When you learn any details about the user's role, preferences, responsibilities, or knowledge</when_to_save>
+    <how_to_use>When your work should be informed by the user's profile or perspective. For example, if the user is asking you to explain a part of the code, you should answer that question in a way that is tailored to the specific details that they will find most valuable or that helps them build their mental model in relation to domain knowledge they already have.</how_to_use>
+    <examples>
+    user: I'm a data scientist investigating what logging we have in place
+    assistant: [saves user memory: user is a data scientist, currently focused on observability/logging]
+    user: I've been writing Go for ten years but this is my first time touching the React side of this repo
+    assistant: [saves user memory: deep Go expertise, new to React and this project's frontend — frame frontend explanations in terms of backend analogues]
+    </examples>
+</type>
+<type>
+    <name>feedback</name>
+    <description>Guidance the user has given you about how to approach work — both what to avoid and what to keep doing. These are a very important type of memory to read and write as they allow you to remain coherent and responsive to the way you should approach work in the project. Record from failure AND success: if you only save corrections, you will avoid past mistakes but drift away from approaches the user has already validated, and may grow overly cautious.</description>
+    <when_to_save>Any time the user corrects your approach ("no not that", "don't", "stop doing X") OR confirms a non-obvious approach worked ("yes exactly", "perfect, keep doing that", accepting an unusual choice without pushback). Corrections are easy to notice; confirmations are quieter — watch for them. In both cases, save what is applicable to future conversations, especially if surprising or not obvious from the code. Include *why* so you can judge edge cases later.</when_to_save>
+    <how_to_use>Let these memories guide your behavior so that the user does not need to offer the same guidance twice.</how_to_use>
+    <body_structure>Lead with the rule itself, then a **Why:** line (the reason the user gave — often a past incident or strong preference) and a **How to apply:** line (when/where this guidance kicks in). Knowing *why* lets you judge edge cases instead of blindly following the rule.</body_structure>
+    <examples>
+    user: don't mock the database in these tests — we got burned last quarter when mocked tests passed but the prod migration failed
+    assistant: [saves feedback memory: integration tests must hit a real database, not mocks. Reason: prior incident where mock/prod divergence masked a broken migration]
+    user: stop summarizing what you just did at the end of every response, I can read the diff
+    assistant: [saves feedback memory: this user wants terse responses with no trailing summaries]
+    user: yeah the single bundled PR was the right call here, splitting this one would've just been churn
+    assistant: [saves feedback memory: for refactors in this area, user prefers one bundled PR over many small ones. Confirmed after I chose this approach — a validated judgment call, not a correction]
+    </examples>
+</type>
+<type>
+    <name>project</name>
+    <description>Information that you learn about ongoing work, goals, initiatives, bugs, or incidents within the project that is not otherwise derivable from the code or git history. Project memories help you understand the broader context and motivation behind the work the user is doing within this working directory.</description>
+    <when_to_save>When you learn who is doing what, why, or by when. These states change relatively quickly so try to keep your understanding of this up to date. Always convert relative dates in user messages to absolute dates when saving (e.g., "Thursday" → "2026-03-05"), so the memory remains interpretable after time passes.</when_to_save>
+    <how_to_use>Use these memories to more fully understand the details and nuance behind the user's request and make better informed suggestions.</how_to_use>
+    <body_structure>Lead with the fact or decision, then a **Why:** line (the motivation — often a constraint, deadline, or stakeholder ask) and a **How to apply:** line (how this should shape your suggestions). Project memories decay fast, so the why helps future-you judge whether the memory is still load-bearing.</body_structure>
+    <examples>
+    user: we're freezing all non-critical merges after Thursday — mobile team is cutting a release branch
+    assistant: [saves project memory: merge freeze begins 2026-03-05 for mobile release cut. Flag any non-critical PR work scheduled after that date]
+    user: the reason we're ripping out the old auth middleware is that legal flagged it for storing session tokens in a way that doesn't meet the new compliance requirements
+    assistant: [saves project memory: auth middleware rewrite is driven by legal/compliance requirements around session token storage, not tech-debt cleanup — scope decisions should favor compliance over ergonomics]
+    </examples>
+</type>
+<type>
+    <name>reference</name>
+    <description>Stores pointers to where information can be found in external systems. These memories allow you to remember where to look to find up-to-date information outside of the project directory.</description>
+    <when_to_save>When you learn about resources in external systems and their purpose. For example, that bugs are tracked in a specific project in Linear or that feedback can be found in a specific Slack channel.</when_to_save>
+    <how_to_use>When the user references an external system or information that may be in an external system.</how_to_use>
+    <examples>
+    user: check the Linear project "INGEST" if you want context on these tickets, that's where we track all pipeline bugs
+    assistant: [saves reference memory: pipeline bugs are tracked in Linear project "INGEST"]
+    user: the Grafana board at grafana.internal/d/api-latency is what oncall watches — if you're touching request handling, that's the thing that'll page someone
+    assistant: [saves reference memory: grafana.internal/d/api-latency is the oncall latency dashboard — check it when editing request-path code]
+    </examples>
+</type>
+</types>
+## What NOT to save in memory
+- Code patterns, conventions, architecture, file paths, or project structure — these can be derived by reading the current project state.
+- Git history, recent changes, or who-changed-what — `git log` / `git blame` are authoritative.
+- Debugging solutions or fix recipes — the fix is in the code; the commit message has the context.
+- Anything already documented in CLAUDE.md files.
+- Ephemeral task details: in-progress work, temporary state, current conversation context.
+These exclusions apply even when the user explicitly asks you to save. If they ask you to save a PR list or activity summary, ask what was *surprising* or *non-obvious* about it — that is the part worth keeping.
+## How to save memories
+Saving a memory is a two-step process:
+**Step 1** — write the memory to its own file (e.g., `user_role.md`, `feedback_testing.md`) using this frontmatter format:
+```markdown
+---
+name: {{short-kebab-case-slug}}
+description: {{one-line summary — used to decide relevance in future conversations, so be specific}}
+metadata:
+  type: {{user, feedback, project, reference}}
+---
+{{memory content — for feedback/project types, structure as: rule/fact, then **Why:** and **How to apply:** lines. Link related memories with [[their-name]].}}
+```
+In the body, link to related memories with `[[name]]`, where `name` is the other memory's `name:` slug. Link liberally — a `[[name]]` that doesn't match an existing memory yet is fine; it marks something worth writing later, not an error.
+**Step 2** — add a pointer to that file in `MEMORY.md`. `MEMORY.md` is an index, not a memory — each entry should be one line, under ~150 characters: `- [Title](file.md) — one-line hook`. It has no frontmatter. Never write memory content directly into `MEMORY.md`.
+- `MEMORY.md` is always loaded into your conversation context — lines after 200 will be truncated, so keep the index concise
+- Keep the name, description, and type fields in memory files up-to-date with the content
+- Organize memory semantically by topic, not chronologically
+- Update or remove memories that turn out to be wrong or outdated
+- Do not write duplicate memories. First check if there is an existing memory you can update before writing a new one.
+## When to access memories
+- When memories seem relevant, or the user references prior-conversation work.
+- You MUST access memory when the user explicitly asks you to check, recall, or remember.
+- If the user says to *ignore* or *not use* memory: Do not apply remembered facts, cite, compare against, or mention memory content.
+- Memory records can become stale over time. Use memory as context for what was true at a given point in time. Before answering the user or building assumptions based solely on information in memory records, verify that the memory is still correct and up-to-date by reading the current state of the files or resources. If a recalled memory conflicts with current information, trust what you observe now — and update or remove the stale memory rather than acting on it.
+## Before recommending from memory
+A memory that names a specific function, file, or flag is a claim that it existed *when the memory was written*. It may have been renamed, removed, or never merged. Before recommending it:
+- If the memory names a file path: check the file exists.
+- If the memory names a function or flag: grep for it.
+- If the user is about to act on your recommendation (not just asking about history), verify first.
+"The memory says X exists" is not the same as "X exists now."
+A memory that summarizes repo state (activity logs, architecture snapshots) is frozen in time. If the user asks about *recent* or *current* state, prefer `git log` or reading the code over recalling the snapshot.
+## Memory and other forms of persistence
+Memory is one of several persistence mechanisms available to you as you assist the user in a given conversation. The distinction is often that memory can be recalled in future conversations and should not be used for persisting information that is only useful within the scope of the current conversation.
+- When to use or update a plan instead of memory: If you are about to start a non-trivial implementation task and would like to reach alignment with the user on your approach you should use a Plan rather than saving this information to memory. Similarly, if you already have a plan within the conversation and you have changed your approach persist that change by updating the plan rather than saving a memory.
+- When to use or update tasks instead of memory: When you need to break your work in current conversation into discrete steps or keep track of your progress use tasks instead of saving to memory. Tasks are great for persisting information about the work that needs to be done in the current conversation, but memory should be reserved for information that will be useful in future conversations.
+- Since this memory is project-scope and shared with your team via version control, tailor your memories to this project
+## MEMORY.md
+Your MEMORY.md is currently empty. When you save new memories, they will appear here.

.claude/settings.json CHANGED Viewed

@@ -1,8 +1,8 @@
-{
-  "enabledPlugins": {
-    "context7@claude-plugins-official": true,
-    "code-review@claude-plugins-official": true,
-    "claude-md-management@claude-plugins-official": true,
-    "feature-dev@claude-plugins-official": true
-  }
-}

+{
+  "enabledPlugins": {
+    "context7@claude-plugins-official": true,
+    "code-review@claude-plugins-official": true,
+    "claude-md-management@claude-plugins-official": true,
+    "feature-dev@claude-plugins-official": true
+  }
+}

.claude/settings.local.json CHANGED Viewed

@@ -1,20 +1,20 @@
-{
-  "permissions": {
-    "allow": [
-      "Bash(git -C /Users/bolyos/Development/FormScout status)",
-      "Bash(git init *)",
-      "Bash(git add *)",
-      "Bash(git commit *)",
-      "Bash(huggingface-cli version *)",
-      "Bash(huggingface-cli whoami *)",
-      "Bash(hf auth *)",
-      "Bash(hf whoami *)",
-      "Bash(git remote *)",
-      "Bash(git push *)",
-      "Bash(git fetch *)",
-      "Bash(git pull *)",
-      "Bash(git lfs *)",
-      "Bash(hf upload *)"
-    ]
-  }
-}

+{
+  "permissions": {
+    "allow": [
+      "Bash(git -C /Users/bolyos/Development/FormScout status)",
+      "Bash(git init *)",
+      "Bash(git add *)",
+      "Bash(git commit *)",
+      "Bash(huggingface-cli version *)",
+      "Bash(huggingface-cli whoami *)",
+      "Bash(hf auth *)",
+      "Bash(hf whoami *)",
+      "Bash(git remote *)",
+      "Bash(git push *)",
+      "Bash(git fetch *)",
+      "Bash(git pull *)",
+      "Bash(git lfs *)",
+      "Bash(hf upload *)"
+    ]
+  }
+}

.gitattributes CHANGED Viewed

@@ -1,37 +1,37 @@
-*.7z filter=lfs diff=lfs merge=lfs -text
-*.arrow filter=lfs diff=lfs merge=lfs -text
-*.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
-*.gz filter=lfs diff=lfs merge=lfs -text
-*.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
-*.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
-*.pkl filter=lfs diff=lfs merge=lfs -text
-*.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
-*.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
-*.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
-*.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text
-docs/FormScout-FMS-Spec.md.pdf filter=lfs diff=lfs merge=lfs -text
-docs/plans/FormScout-Build-Prompt.md.pdf filter=lfs diff=lfs merge=lfs -text

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+docs/FormScout-FMS-Spec.md.pdf filter=lfs diff=lfs merge=lfs -text
+docs/plans/FormScout-Build-Prompt.md.pdf filter=lfs diff=lfs merge=lfs -text

.gitignore CHANGED Viewed

@@ -1,21 +1,21 @@
-__pycache__/
-*.py[cod]
-*$py.class
-*.egg-info/
-dist/
-build/
-.eggs/
-*.egg
-.env
-.venv/
-venv/
-env/
-.DS_Store
-checkpoints/
-*.pt
-*.pth
-*.gguf
-*.bin
-traces/
-*.mp4
-!tests/fixtures/*.mp4

+__pycache__/
+*.py[cod]
+*$py.class
+*.egg-info/
+dist/
+build/
+.eggs/
+*.egg
+.env
+.venv/
+venv/
+env/
+.DS_Store
+checkpoints/
+*.pt
+*.pth
+*.gguf
+*.bin
+traces/
+*.mp4
+!tests/fixtures/*.mp4

.pytest_cache/.gitignore CHANGED Viewed

	@@ -1,2 +1,2 @@
1	- # Created by pytest automatically.
2	- *


1	+ # Created by pytest automatically.
2	+ *

.pytest_cache/CACHEDIR.TAG CHANGED Viewed

@@ -1,4 +1,4 @@
-Signature: 8a477f597d28d172789f06886806bc55
-# This file is a cache directory tag created by pytest.
-# For information about cache directory tags, see:
-#	https://bford.info/cachedir/spec.html

+Signature: 8a477f597d28d172789f06886806bc55
+# This file is a cache directory tag created by pytest.
+# For information about cache directory tags, see:
+#	https://bford.info/cachedir/spec.html

.pytest_cache/README.md CHANGED Viewed

@@ -1,8 +1,8 @@
-# pytest cache directory #
-This directory contains data from the pytest's cache plugin,
-which provides the `--lf` and `--ff` options, as well as the `cache` fixture.
-**Do not** commit this to version control.
-See [the docs](https://docs.pytest.org/en/stable/how-to/cache.html) for more information.

+# pytest cache directory #
+This directory contains data from the pytest's cache plugin,
+which provides the `--lf` and `--ff` options, as well as the `cache` fixture.
+**Do not** commit this to version control.
+See [the docs](https://docs.pytest.org/en/stable/how-to/cache.html) for more information.

.pytest_cache/v/cache/nodeids CHANGED Viewed

@@ -1,37 +1,37 @@
-[
-  "tests/test_biomechanics.py::TestBiomechanicsAgent::test_no_keypoints_returns_low_confidence",
-  "tests/test_biomechanics.py::TestBiomechanicsAgent::test_unimplemented_test_returns_low_confidence",
-  "tests/test_biomechanics.py::TestDeepSquatRubric::test_confidence_propagates",
-  "tests/test_biomechanics.py::TestDeepSquatRubric::test_never_assigns_zero",
-  "tests/test_biomechanics.py::TestDeepSquatRubric::test_score_1_femur_not_below",
-  "tests/test_biomechanics.py::TestDeepSquatRubric::test_score_1_knees_not_tracking",
-  "tests/test_biomechanics.py::TestDeepSquatRubric::test_score_1_torso_not_parallel",
-  "tests/test_biomechanics.py::TestDeepSquatRubric::test_score_2_heels_elevated",
-  "tests/test_biomechanics.py::TestDeepSquatRubric::test_score_3_all_criteria_met",
-  "tests/test_body3d.py::TestBody3DAgent::test_disabled_returns_not_used",
-  "tests/test_body3d.py::TestBody3DAgent::test_no_frames_returns_not_used",
-  "tests/test_body3d.py::TestBody3DAgent::test_result_type",
-  "tests/test_body3d.py::TestBody3DAgent::test_unavailable_checkpoint_returns_not_used",
-  "tests/test_ingest.py::TestIngestAgent::test_caps_frames",
-  "tests/test_ingest.py::TestIngestAgent::test_rejects_missing_file",
-  "tests/test_ingest.py::TestIngestAgent::test_result_is_frozen",
-  "tests/test_ingest.py::TestIngestAgent::test_returns_typed_result",
-  "tests/test_pose2d.py::TestPose2DAgent::test_graceful_on_empty_frames",
-  "tests/test_pose2d.py::TestPose2DAgent::test_keypoints_per_frame",
-  "tests/test_pose2d.py::TestPose2DAgent::test_returns_typed_result",
-  "tests/test_types.py::TestBiomechFeatures::test_invalid_view_raises",
-  "tests/test_types.py::TestBiomechFeatures::test_valid_views",
-  "tests/test_types.py::TestIngestResult::test_defaults",
-  "tests/test_types.py::TestIngestResult::test_frozen",
-  "tests/test_types.py::TestJudgeResult::test_needs_human_score_must_be_none",
-  "tests/test_types.py::TestJudgeResult::test_needs_human_with_none_score",
-  "tests/test_types.py::TestJudgeResult::test_valid_score",
-  "tests/test_types.py::TestMovementResult::test_invalid_side_raises",
-  "tests/test_types.py::TestMovementResult::test_invalid_test_raises",
-  "tests/test_types.py::TestMovementResult::test_valid_tests",
-  "tests/test_types.py::TestPipelineState::test_defaults",
-  "tests/test_types.py::TestPipelineState::test_mutable",
-  "tests/test_types.py::TestScoreResult::test_invalid_score_raises",
-  "tests/test_types.py::TestScoreResult::test_score_minus_one_invalid_when_not_needs_human",
-  "tests/test_types.py::TestScoreResult::test_valid_score"
 ]

+[
+  "tests/test_biomechanics.py::TestBiomechanicsAgent::test_no_keypoints_returns_low_confidence",
+  "tests/test_biomechanics.py::TestBiomechanicsAgent::test_unimplemented_test_returns_low_confidence",
+  "tests/test_biomechanics.py::TestDeepSquatRubric::test_confidence_propagates",
+  "tests/test_biomechanics.py::TestDeepSquatRubric::test_never_assigns_zero",
+  "tests/test_biomechanics.py::TestDeepSquatRubric::test_score_1_femur_not_below",
+  "tests/test_biomechanics.py::TestDeepSquatRubric::test_score_1_knees_not_tracking",
+  "tests/test_biomechanics.py::TestDeepSquatRubric::test_score_1_torso_not_parallel",
+  "tests/test_biomechanics.py::TestDeepSquatRubric::test_score_2_heels_elevated",
+  "tests/test_biomechanics.py::TestDeepSquatRubric::test_score_3_all_criteria_met",
+  "tests/test_body3d.py::TestBody3DAgent::test_disabled_returns_not_used",
+  "tests/test_body3d.py::TestBody3DAgent::test_no_frames_returns_not_used",
+  "tests/test_body3d.py::TestBody3DAgent::test_result_type",
+  "tests/test_body3d.py::TestBody3DAgent::test_unavailable_checkpoint_returns_not_used",
+  "tests/test_ingest.py::TestIngestAgent::test_caps_frames",
+  "tests/test_ingest.py::TestIngestAgent::test_rejects_missing_file",
+  "tests/test_ingest.py::TestIngestAgent::test_result_is_frozen",
+  "tests/test_ingest.py::TestIngestAgent::test_returns_typed_result",
+  "tests/test_pose2d.py::TestPose2DAgent::test_graceful_on_empty_frames",
+  "tests/test_pose2d.py::TestPose2DAgent::test_keypoints_per_frame",
+  "tests/test_pose2d.py::TestPose2DAgent::test_returns_typed_result",
+  "tests/test_types.py::TestBiomechFeatures::test_invalid_view_raises",
+  "tests/test_types.py::TestBiomechFeatures::test_valid_views",
+  "tests/test_types.py::TestIngestResult::test_defaults",
+  "tests/test_types.py::TestIngestResult::test_frozen",
+  "tests/test_types.py::TestJudgeResult::test_needs_human_score_must_be_none",
+  "tests/test_types.py::TestJudgeResult::test_needs_human_with_none_score",
+  "tests/test_types.py::TestJudgeResult::test_valid_score",
+  "tests/test_types.py::TestMovementResult::test_invalid_side_raises",
+  "tests/test_types.py::TestMovementResult::test_invalid_test_raises",
+  "tests/test_types.py::TestMovementResult::test_valid_tests",
+  "tests/test_types.py::TestPipelineState::test_defaults",
+  "tests/test_types.py::TestPipelineState::test_mutable",
+  "tests/test_types.py::TestScoreResult::test_invalid_score_raises",
+  "tests/test_types.py::TestScoreResult::test_score_minus_one_invalid_when_not_needs_human",
+  "tests/test_types.py::TestScoreResult::test_valid_score"
 ]

CLAUDE.md CHANGED Viewed

@@ -1,149 +1,149 @@
-# CLAUDE.md
-This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
-## Project overview
-FormScout is a Gradio app (Hugging Face Space) that scores Functional Movement Screen (FMS) videos 0–3 per test with a written rationale and an annotated overlay. It is a **screening aid** — not a diagnosis, not an injury predictor. Built for the Build Small Hackathon (Backyard AI track). Full product spec is in `docs/FormScout-FMS-Spec.md`; the engineering contract is in `docs/plans/FormScout-Build-Prompt.md`.
-## Common commands
-Once the project is scaffolded:
-```bash
-# Headless pipeline test (no Gradio)
-python -m formscout.run sample.mp4
-# Run the Gradio app locally
-python app.py
-# Run all tests
-pytest tests/
-# Run a single test
-pytest tests/test_biomechanics.py::test_deep_squat_score
-# Lint / format (Python)
-ruff check . && ruff format .
-# Run Svelte component tests
-npx vitest run
-```
-## Architecture
-The pipeline is a sequence of **typed specialist agents**. Each agent accepts and returns a frozen dataclass from `formscout/types.py`. The Director in `formscout/pipeline.py` orchestrates them as a deterministic state machine (not an LLM) and applies quality gating.
-### The tiering rule (most important invariant)
-**The 2D path is the default and must stand alone as a complete, functional pipeline.** `Body3DAgent` is only activated when `config.enable_3d == True` AND the checkpoint loads successfully. If 3D is off, unavailable, or fails for any reason, `Body3DResult(used=False, ...)` is returned — this is a normal success path, not an error. `BiomechFeatures.view` is `"2d"` or `"3d"` so the `JudgeAgent` can caveat its rationale appropriately. Never put `Body3DAgent` on the critical path.
-### Build dependency order
-```
-types.py → IngestAgent → SegmentationAgent → Pose2DAgent
-→ [Body3DAgent — optional] → MovementClassifierAgent → BiomechanicsAgent
-→ ScoringAgent → RetrievalAgent → JudgeAgent → ReportAgent → Director
-```
-**Minimum working slice (build first):** Ingest → Pose2D → Biomechanics → Judge → Report
-### Target repo structure
-```
-formscout/
-  app.py                    # Gradio entrypoint
-  formscout/
-    config.py               # model IDs, thresholds, feature flags — no scattered literals
-    pipeline.py             # Director: orchestrates agents, quality-gates
-    run.py                  # headless CLI entrypoint
-    agents/
-      prompts/              # C1 (classifier) and C2 (judge) runtime system prompts — version-controlled
-    rubric/                 # one pure-function scorer per FMS test (deep_squat.py, etc.)
-    types.py                # frozen dataclasses for every agent I/O contract
-    serving/llama_cpp.py    # llama.cpp client wrappers + transformers fallbacks
-    ui/                     # Gradio theme, Svelte custom components, CSS
-    tracing.py              # structured per-agent I/O logging
-  tests/
-  requirements.txt
-  MODEL_BUDGET.md           # running param sum — must stay ≤ 32B
-  RECON.md                  # Phase 0 model/API verification findings
-```
-### Model stack (~18B total — stay under 32B)
-| Component | Model | Params | HF Access |
-|---|---|---|---|
-| 2D pose (primary) | YOLO26-Pose L/X | ~0.05B | Public (verify AGPL-3.0 implications) |
-| 2D pose (fallback) | `noahcao/sapiens-pose-coco` | — | **Accepted** |
-| Segmentation | `facebookresearch/sam3` (SAM 3.1 base) | ~0.85B | **Accepted** |
-| 3D biomechanics | `facebook/sam-3d-body-dinov3` | ~0.7–1B | **Pending** |
-| Learned scoring | ST-GCN via pyskl (fine-tuned) | ~0.01–0.05B | Apache-2.0 |
-| Judge + Classifier | Qwen3-VL-8B-Instruct (llama.cpp) | 8B | Public |
-| Retrieval | Qwen3-VL-Embedding-8B (llama.cpp) | 8B | Public |
-Track the running sum in `MODEL_BUDGET.md`. The two Qwen3-VL-8B models share a backbone. `config.pose_backend` switches between YOLO and Sapiens. ST-GCN training lives in a separate `train_scoring.py`.
-**Open question:** whether "≤ 32B" means per-model or summed across the pipeline — confirm via the hackathon Discord AMA. Design for the summed reading (safe either way).
-**SAM 3D Body access is pending.** `facebook/sam-3d-body-dinov3` is gated; access was requested June 2026 but not yet granted. Until it arrives, the 2D path is the only path — `Body3DAgent` must immediately return `Body3DResult(used=False, ...)` when `config.enable_3d` is off or the checkpoint is unavailable.
-## Key constraints and invariants
-- **No cloud model APIs.** All inference runs on-Space (ZeroGPU). No OpenAI/Anthropic/Gemini calls.
-- **Pain is never auto-scored.** Any clearing test or visible distress sets `needs_human=true` — enforced in rubric functions and `JudgeAgent`.
-- **Quality gates (Director, never silently skip):**
-  - Any agent `confidence < config.min_confidence` → mark "low confidence — physio review"
-  - `|ScoringAgent.score - JudgeAgent.score| >= 1` → mark disagreement, require review
-  - `MovementResult.test == "unknown"` → stop pipeline, surface manual override to user
-  - `JudgeAgent.needs_human == True` → no numeric score emitted for that test
-- **Composite is null** when any test is unscored (pain/unknown/deferred). Never show a partial 0–21 as complete.
-- **Bilateral tests** (Hurdle Step, In-Line Lunge, Shoulder Mobility, ASLR): score each side, report the lower, always emit the asymmetry even when scores are equal.
-- **Rubric functions are pure.** Each scorer in `rubric/` is `(features) -> ScoreResult` with no model calls.
-- **Runtime prompts are tunable artifacts.** C1 (movement classifier) and C2 (judge) live in `formscout/agents/prompts/` under version control. Most scoring quality lives in C2.
-- **Pipeline runs headless.** No Gradio imports in any agent file.
-## Engineering standards
-- Every agent: one public entrypoint, typed dataclass I/O from `types.py`, `confidence: float` and `notes: str` on every result.
-- Models load once at module/instance init — never inside the inference hot path.
-- Every agent module docstring states: purpose, inputs, outputs, failure behavior, model param count, license, and gated status.
-- All model IDs, thresholds, k-values, and feature flags live in `config.py`.
-- `tracing.py` records structured per-agent I/O for any run; one full run gets exported to the Hub.
-- Every agent ships with a pytest in `tests/` that runs on the committed sample fixture and asserts the typed contract.
-- Fix random seeds; cache model loads at startup; warm the pipeline before demo.
-## Gradio + Svelte UI guidance
-The UI uses **Gradio `gr.Blocks`** with **custom Svelte components** for bespoke UI elements (score dial, asymmetry bars, rubric drawer). Use `gradio-svelte-expert` agent for Svelte component work.
-- Default approach: `gr.Blocks` + custom CSS/theme. Escalate to `gradio.Server` only if Blocks can't express the UI.
-- Use `gr.Video`'s `playback_position` to jump the overlay to the decisive frame.
-- Use `gr.Walkthrough`/`gr.Step` for the 7-test session flow; `gr.Navbar` if splitting pages.
-- ZeroGPU: wrap heavy inference in `@spaces.GPU`; load models once at module scope.
-- A **"Screening aid — not a diagnosis. Pain or clearing tests require a clinician."** banner must always be visible.
-- Verify Gradio APIs against current docs before use — the ecosystem moves fast. Pin exact versions in `requirements.txt`.
-- Python: `ruff` + `black`. Svelte: Prettier. Tests: `pytest` (Python), `vitest` + `@testing-library/svelte` (Svelte).
-## Build phases
-No code exists yet. Start with Phase 0. Do not write implementation code before completing Phase 0 recon.
-1. **Phase 0 — Recon:** Verify all models (license, param count, GGUF, ZeroGPU compatibility). Write `RECON.md`. Confirm Gradio version. Confirm SAM 3D Body access status.
-2. **Phase 1 — Spine:** One test (Deep Squat) end-to-end: `video in → score + rationale + overlay`. Headless + Gradio. Deterministic rubric only.
-3. **Phase 2 — All 7 tests:** `MovementClassifierAgent`, `JudgeAgent`, `ReportAgent`, composite scorecard, asymmetry view, PDF export.
-4. **Phase 3 — Learned scoring + retrieval:** ST-GCN fine-tune on physio clips, publish to Hub. Embedding index for RAG via `RetrievalAgent`.
-5. **Phase 4 — Polish + ship:** Custom UI (scout/trail theme), agent trace published to Hub, blog post, demo video.
-## Badge checklist (definition of done)
-- [ ] Space runs green; upload → scorecard works on real clips
-- [ ] Param sum verified ≤ 32B in `MODEL_BUDGET.md`
-- [ ] 🔌 **Off the Grid** — no cloud model APIs anywhere in the pipeline
-- [ ] 🎯 **Well-Tuned** — fine-tuned ST-GCN head published to Hub with honest model card
-- [ ] 🎨 **Off-Brand** — custom, non-default Gradio UI (scout/trail theme)
-- [ ] 🦙 **Llama Champion** — VLM + embedder served via llama.cpp (GGUF)
-- [ ] 📡 **Sharing is Caring** — one full agent trace (all I/O) published to Hub
-- [ ] 📓 **Field Notes** — blog post written, honesty section (FMS limitations) front-and-center
-- [ ] Demo video + social post recorded
-- [ ] Safety banner present; pain/clearing never auto-scored; low-confidence flagged

+# CLAUDE.md
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+## Project overview
+FormScout is a Gradio app (Hugging Face Space) that scores Functional Movement Screen (FMS) videos 0–3 per test with a written rationale and an annotated overlay. It is a **screening aid** — not a diagnosis, not an injury predictor. Built for the Build Small Hackathon (Backyard AI track). Full product spec is in `docs/FormScout-FMS-Spec.md`; the engineering contract is in `docs/plans/FormScout-Build-Prompt.md`.
+## Common commands
+Once the project is scaffolded:
+```bash
+# Headless pipeline test (no Gradio)
+python -m formscout.run sample.mp4
+# Run the Gradio app locally
+python app.py
+# Run all tests
+pytest tests/
+# Run a single test
+pytest tests/test_biomechanics.py::test_deep_squat_score
+# Lint / format (Python)
+ruff check . && ruff format .
+# Run Svelte component tests
+npx vitest run
+```
+## Architecture
+The pipeline is a sequence of **typed specialist agents**. Each agent accepts and returns a frozen dataclass from `formscout/types.py`. The Director in `formscout/pipeline.py` orchestrates them as a deterministic state machine (not an LLM) and applies quality gating.
+### The tiering rule (most important invariant)
+**The 2D path is the default and must stand alone as a complete, functional pipeline.** `Body3DAgent` is only activated when `config.enable_3d == True` AND the checkpoint loads successfully. If 3D is off, unavailable, or fails for any reason, `Body3DResult(used=False, ...)` is returned — this is a normal success path, not an error. `BiomechFeatures.view` is `"2d"` or `"3d"` so the `JudgeAgent` can caveat its rationale appropriately. Never put `Body3DAgent` on the critical path.
+### Build dependency order
+```
+types.py → IngestAgent → SegmentationAgent → Pose2DAgent
+→ [Body3DAgent — optional] → MovementClassifierAgent → BiomechanicsAgent
+→ ScoringAgent → RetrievalAgent → JudgeAgent → ReportAgent → Director
+```
+**Minimum working slice (build first):** Ingest → Pose2D → Biomechanics → Judge → Report
+### Target repo structure
+```
+formscout/
+  app.py                    # Gradio entrypoint
+  formscout/
+    config.py               # model IDs, thresholds, feature flags — no scattered literals
+    pipeline.py             # Director: orchestrates agents, quality-gates
+    run.py                  # headless CLI entrypoint
+    agents/
+      prompts/              # C1 (classifier) and C2 (judge) runtime system prompts — version-controlled
+    rubric/                 # one pure-function scorer per FMS test (deep_squat.py, etc.)
+    types.py                # frozen dataclasses for every agent I/O contract
+    serving/llama_cpp.py    # llama.cpp client wrappers + transformers fallbacks
+    ui/                     # Gradio theme, Svelte custom components, CSS
+    tracing.py              # structured per-agent I/O logging
+  tests/
+  requirements.txt
+  MODEL_BUDGET.md           # running param sum — must stay ≤ 32B
+  RECON.md                  # Phase 0 model/API verification findings
+```
+### Model stack (~18B total — stay under 32B)
+| Component | Model | Params | HF Access |
+|---|---|---|---|
+| 2D pose (primary) | YOLO26-Pose L/X | ~0.05B | Public (verify AGPL-3.0 implications) |
+| 2D pose (fallback) | `noahcao/sapiens-pose-coco` | — | **Accepted** |
+| Segmentation | `facebookresearch/sam3` (SAM 3.1 base) | ~0.85B | **Accepted** |
+| 3D biomechanics | `facebook/sam-3d-body-dinov3` | ~0.7–1B | **Pending** |
+| Learned scoring | ST-GCN via pyskl (fine-tuned) | ~0.01–0.05B | Apache-2.0 |
+| Judge + Classifier | Qwen3-VL-8B-Instruct (llama.cpp) | 8B | Public |
+| Retrieval | Qwen3-VL-Embedding-8B (llama.cpp) | 8B | Public |
+Track the running sum in `MODEL_BUDGET.md`. The two Qwen3-VL-8B models share a backbone. `config.pose_backend` switches between YOLO and Sapiens. ST-GCN training lives in a separate `train_scoring.py`.
+**Open question:** whether "≤ 32B" means per-model or summed across the pipeline — confirm via the hackathon Discord AMA. Design for the summed reading (safe either way).
+**SAM 3D Body access is pending.** `facebook/sam-3d-body-dinov3` is gated; access was requested June 2026 but not yet granted. Until it arrives, the 2D path is the only path — `Body3DAgent` must immediately return `Body3DResult(used=False, ...)` when `config.enable_3d` is off or the checkpoint is unavailable.
+## Key constraints and invariants
+- **No cloud model APIs.** All inference runs on-Space (ZeroGPU). No OpenAI/Anthropic/Gemini calls.
+- **Pain is never auto-scored.** Any clearing test or visible distress sets `needs_human=true` — enforced in rubric functions and `JudgeAgent`.
+- **Quality gates (Director, never silently skip):**
+  - Any agent `confidence < config.min_confidence` → mark "low confidence — physio review"
+  - `|ScoringAgent.score - JudgeAgent.score| >= 1` → mark disagreement, require review
+  - `MovementResult.test == "unknown"` → stop pipeline, surface manual override to user
+  - `JudgeAgent.needs_human == True` → no numeric score emitted for that test
+- **Composite is null** when any test is unscored (pain/unknown/deferred). Never show a partial 0–21 as complete.
+- **Bilateral tests** (Hurdle Step, In-Line Lunge, Shoulder Mobility, ASLR): score each side, report the lower, always emit the asymmetry even when scores are equal.
+- **Rubric functions are pure.** Each scorer in `rubric/` is `(features) -> ScoreResult` with no model calls.
+- **Runtime prompts are tunable artifacts.** C1 (movement classifier) and C2 (judge) live in `formscout/agents/prompts/` under version control. Most scoring quality lives in C2.
+- **Pipeline runs headless.** No Gradio imports in any agent file.
+## Engineering standards
+- Every agent: one public entrypoint, typed dataclass I/O from `types.py`, `confidence: float` and `notes: str` on every result.
+- Models load once at module/instance init — never inside the inference hot path.
+- Every agent module docstring states: purpose, inputs, outputs, failure behavior, model param count, license, and gated status.
+- All model IDs, thresholds, k-values, and feature flags live in `config.py`.
+- `tracing.py` records structured per-agent I/O for any run; one full run gets exported to the Hub.
+- Every agent ships with a pytest in `tests/` that runs on the committed sample fixture and asserts the typed contract.
+- Fix random seeds; cache model loads at startup; warm the pipeline before demo.
+## Gradio + Svelte UI guidance
+The UI uses **Gradio `gr.Blocks`** with **custom Svelte components** for bespoke UI elements (score dial, asymmetry bars, rubric drawer). Use `gradio-svelte-expert` agent for Svelte component work.
+- Default approach: `gr.Blocks` + custom CSS/theme. Escalate to `gradio.Server` only if Blocks can't express the UI.
+- Use `gr.Video`'s `playback_position` to jump the overlay to the decisive frame.
+- Use `gr.Walkthrough`/`gr.Step` for the 7-test session flow; `gr.Navbar` if splitting pages.
+- ZeroGPU: wrap heavy inference in `@spaces.GPU`; load models once at module scope.
+- A **"Screening aid — not a diagnosis. Pain or clearing tests require a clinician."** banner must always be visible.
+- Verify Gradio APIs against current docs before use — the ecosystem moves fast. Pin exact versions in `requirements.txt`.
+- Python: `ruff` + `black`. Svelte: Prettier. Tests: `pytest` (Python), `vitest` + `@testing-library/svelte` (Svelte).
+## Build phases
+No code exists yet. Start with Phase 0. Do not write implementation code before completing Phase 0 recon.
+1. **Phase 0 — Recon:** Verify all models (license, param count, GGUF, ZeroGPU compatibility). Write `RECON.md`. Confirm Gradio version. Confirm SAM 3D Body access status.
+2. **Phase 1 — Spine:** One test (Deep Squat) end-to-end: `video in → score + rationale + overlay`. Headless + Gradio. Deterministic rubric only.
+3. **Phase 2 — All 7 tests:** `MovementClassifierAgent`, `JudgeAgent`, `ReportAgent`, composite scorecard, asymmetry view, PDF export.
+4. **Phase 3 — Learned scoring + retrieval:** ST-GCN fine-tune on physio clips, publish to Hub. Embedding index for RAG via `RetrievalAgent`.
+5. **Phase 4 — Polish + ship:** Custom UI (scout/trail theme), agent trace published to Hub, blog post, demo video.
+## Badge checklist (definition of done)
+- [ ] Space runs green; upload → scorecard works on real clips
+- [ ] Param sum verified ≤ 32B in `MODEL_BUDGET.md`
+- [ ] 🔌 **Off the Grid** — no cloud model APIs anywhere in the pipeline
+- [ ] 🎯 **Well-Tuned** — fine-tuned ST-GCN head published to Hub with honest model card
+- [ ] 🎨 **Off-Brand** — custom, non-default Gradio UI (scout/trail theme)
+- [ ] 🦙 **Llama Champion** — VLM + embedder served via llama.cpp (GGUF)
+- [ ] 📡 **Sharing is Caring** — one full agent trace (all I/O) published to Hub
+- [ ] 📓 **Field Notes** — blog post written, honesty section (FMS limitations) front-and-center
+- [ ] Demo video + social post recorded
+- [ ] Safety banner present; pain/clearing never auto-scored; low-confidence flagged

MODEL_BUDGET.md CHANGED Viewed

@@ -1,20 +1,20 @@
-# MODEL_BUDGET.md
-Running sum must stay ≤ 32B params.
-| Component | Model | Params |
-|---|---|---|
-| 2D Pose (primary) | YOLO26l-Pose | 0.026B |
-| 2D Pose (HQ alt) | YOLO26x-Pose | 0.058B |
-| 2D Pose (fallback) | Sapiens2 Pose | 0.6B |
-| Segmentation | SAM 3.1 base | 0.85B |
-| 3D Body (optional) | SAM 3D Body DINOv3-H+ | 0.84B |
-| Scoring Head | ST-GCN (pyskl) | 0.03B |
-| Judge/Classifier | Qwen3-VL-8B-Instruct | 8B |
-| Retrieval | Qwen3-VL-Embedding-8B | 8B |
-| **Total** | | **~18.37B** |
-Headroom: ~13.63B under 32B cap.
-Note: The two Qwen3-VL-8B models share a backbone (counted separately here for safety).
-Only one pose backend runs at a time (YOLO or Sapiens2, not both).

+# MODEL_BUDGET.md
+Running sum must stay ≤ 32B params.
+| Component | Model | Params |
+|---|---|---|
+| 2D Pose (primary) | YOLO26l-Pose | 0.026B |
+| 2D Pose (HQ alt) | YOLO26x-Pose | 0.058B |
+| 2D Pose (fallback) | Sapiens2 Pose | 0.6B |
+| Segmentation | SAM 3.1 base | 0.85B |
+| 3D Body (optional) | SAM 3D Body DINOv3-H+ | 0.84B |
+| Scoring Head | ST-GCN (pyskl) | 0.03B |
+| Judge/Classifier | Qwen3-VL-8B-Instruct | 8B |
+| Retrieval | Qwen3-VL-Embedding-8B | 8B |
+| **Total** | | **~18.37B** |
+Headroom: ~13.63B under 32B cap.
+Note: The two Qwen3-VL-8B models share a backbone (counted separately here for safety).
+Only one pose backend runs at a time (YOLO or Sapiens2, not both).

README.md CHANGED Viewed

@@ -1,39 +1,39 @@
-# FormScout
-FMS (Functional Movement Screen) scoring pipeline — a screening aid that scores movement videos 0–3 per test with a written rationale and annotated overlay.
-**⚠️ Screening aid — not a diagnosis. Pain or clearing tests require a clinician.**
-## Quick Start
-```bash
-# Install dependencies
-pip install -r requirements.txt
-# Run headless on a video
-python -m formscout.run sample.mp4
-# Launch Gradio app
-python app.py
-# Run tests
-pytest tests/ -v
-```
-## Architecture
-Typed specialist agents orchestrated by a deterministic Director:
-```
-Ingest → Pose2D → [Body3D optional] → Biomechanics → Rubric Score → [Judge] → Report
-```
-See [CLAUDE.md](CLAUDE.md) for full architecture details.
-## Model Budget
-~18B params total (under 32B cap). See [MODEL_BUDGET.md](MODEL_BUDGET.md).
-## License
-Built for the Build Small Hackathon (Backyard AI track).

+# FormScout
+FMS (Functional Movement Screen) scoring pipeline — a screening aid that scores movement videos 0–3 per test with a written rationale and annotated overlay.
+**⚠️ Screening aid — not a diagnosis. Pain or clearing tests require a clinician.**
+## Quick Start
+```bash
+# Install dependencies
+pip install -r requirements.txt
+# Run headless on a video
+python -m formscout.run sample.mp4
+# Launch Gradio app
+python app.py
+# Run tests
+pytest tests/ -v
+```
+## Architecture
+Typed specialist agents orchestrated by a deterministic Director:
+```
+Ingest → Pose2D → [Body3D optional] → Biomechanics → Rubric Score → [Judge] → Report
+```
+See [CLAUDE.md](CLAUDE.md) for full architecture details.
+## Model Budget
+~18B params total (under 32B cap). See [MODEL_BUDGET.md](MODEL_BUDGET.md).
+## License
+Built for the Build Small Hackathon (Backyard AI track).

RECON.md CHANGED Viewed

@@ -1,57 +1,57 @@
-# RECON.md
-Phase 0 reconnaissance findings — model verification, Gradio APIs, access status.
-Updated: June 4, 2026.
-## Gradio
-- Version: TBD (will verify on first `pip install gradio`)
-- gr.Blocks: expected ✓ (used in app.py skeleton)
-- gr.Video: expected ✓
-- gr.Walkthrough / gr.Step: TBD (verify in Phase 2)
-- gr.Navbar: TBD (verify in Phase 2)
-- UI approach: gr.Blocks + custom CSS/theme (escalate to Server only if needed)
-## Python
-- Python 3.13.9 (local dev)
-- pytest 9.0.2, numpy, opencv-python installed
-## Model Verification
-| Model | Params | License | GGUF | ZeroGPU | Status |
-|---|---|---|---|---|---|
-| YOLO26l-Pose (primary) | 0.026B | AGPL-3.0 | n/a | ✓ (6.5ms T4) | ready |
-| YOLO26x-Pose (HQ alt) | 0.058B | AGPL-3.0 | n/a | ✓ (12.2ms T4) | ready |
-| SAM 3.1 base (sam2.1_hiera_base_plus) | ~0.85B | SAM License | n/a | ✓ | access accepted |
-| SAM 3D Body (facebook/sam-3d-body-dinov3) | 0.84B (DINOv3-H+) | SAM License | n/a | ✓ | **INTEGRATED** |
-| Sapiens2 Pose (noahcao/sapiens-pose-coco) | ~0.6B | CC-BY-NC-4.0 | n/a | ✓ | access accepted |
-| ST-GCN (pyskl) | ~0.03B | Apache-2.0 | n/a | ✓ | ready |
-| Qwen3-VL-8B-Instruct | 8B | Apache-2.0 | ✓ | llama.cpp | ready |
-| Qwen3-VL-Embedding-8B | 8B | Apache-2.0 | ✓ | llama.cpp | ready |
-## Param Sum
-~17.63B — well under 32B limit.
-## Gated Access Status (as of Jun 4, 2026)
-- [x] SAM 3.1 (facebookresearch/sam3) — accepted
-- [x] SAM 3D Body (facebook/sam-3d-body-dinov3) — **ACCEPTED** (confirmed Jun 4)
-- [x] Sapiens2 Pose (noahcao/sapiens-pose-coco) — accepted
-## Open Questions
-- [ ] Confirm "≤32B" = summed vs per-model in Discord AMA
-- [ ] AGPL-3.0 YOLO OK for hackathon submission? (Likely yes for non-commercial demo)
-## llama.cpp Build Plan
-- CPU-only build first (avoids libcudart.so issues on Spaces)
-- Fallback: transformers + spaces.GPU for VLM inference
-- GGUF quantized Qwen3-VL-8B at Q4_K_M (~4.5GB)
-## Key Decisions
-- Primary pose: YOLO11x-Pose (fastest, well-tested)
-- Fallback pose: Sapiens2 (more keypoints, slower)
-- 3D body: INTEGRATED — uses `setup_sam_3d_body()` from `notebook.utils`, outputs MHR joints
-  - API: `estimator.process_one_image(rgb_image)` — single RGB np.ndarray
-  - Model variants: DINOv3-H+ (840M) default, ViT-H (631M) smaller
-  - Temporal smoothing via EMA (alpha=0.3) to reduce single-frame jitter
-  - config.enable_3d=False by default; flipped when checkpoint verified on Space
-- VLM: Qwen3-VL-8B via llama.cpp (Judge + Classifier)
-- Embeddings: Qwen3-VL-Embedding-8B via llama.cpp (Retrieval)

+# RECON.md
+Phase 0 reconnaissance findings — model verification, Gradio APIs, access status.
+Updated: June 4, 2026.
+## Gradio
+- Version: TBD (will verify on first `pip install gradio`)
+- gr.Blocks: expected ✓ (used in app.py skeleton)
+- gr.Video: expected ✓
+- gr.Walkthrough / gr.Step: TBD (verify in Phase 2)
+- gr.Navbar: TBD (verify in Phase 2)
+- UI approach: gr.Blocks + custom CSS/theme (escalate to Server only if needed)
+## Python
+- Python 3.13.9 (local dev)
+- pytest 9.0.2, numpy, opencv-python installed
+## Model Verification
+| Model | Params | License | GGUF | ZeroGPU | Status |
+|---|---|---|---|---|---|
+| YOLO26l-Pose (primary) | 0.026B | AGPL-3.0 | n/a | ✓ (6.5ms T4) | ready |
+| YOLO26x-Pose (HQ alt) | 0.058B | AGPL-3.0 | n/a | ✓ (12.2ms T4) | ready |
+| SAM 3.1 base (sam2.1_hiera_base_plus) | ~0.85B | SAM License | n/a | ✓ | access accepted |
+| SAM 3D Body (facebook/sam-3d-body-dinov3) | 0.84B (DINOv3-H+) | SAM License | n/a | ✓ | **INTEGRATED** |
+| Sapiens2 Pose (noahcao/sapiens-pose-coco) | ~0.6B | CC-BY-NC-4.0 | n/a | ✓ | access accepted |
+| ST-GCN (pyskl) | ~0.03B | Apache-2.0 | n/a | ✓ | ready |
+| Qwen3-VL-8B-Instruct | 8B | Apache-2.0 | ✓ | llama.cpp | ready |
+| Qwen3-VL-Embedding-8B | 8B | Apache-2.0 | ✓ | llama.cpp | ready |
+## Param Sum
+~17.63B — well under 32B limit.
+## Gated Access Status (as of Jun 4, 2026)
+- [x] SAM 3.1 (facebookresearch/sam3) — accepted
+- [x] SAM 3D Body (facebook/sam-3d-body-dinov3) — **ACCEPTED** (confirmed Jun 4)
+- [x] Sapiens2 Pose (noahcao/sapiens-pose-coco) — accepted
+## Open Questions
+- [ ] Confirm "≤32B" = summed vs per-model in Discord AMA
+- [ ] AGPL-3.0 YOLO OK for hackathon submission? (Likely yes for non-commercial demo)
+## llama.cpp Build Plan
+- CPU-only build first (avoids libcudart.so issues on Spaces)
+- Fallback: transformers + spaces.GPU for VLM inference
+- GGUF quantized Qwen3-VL-8B at Q4_K_M (~4.5GB)
+## Key Decisions
+- Primary pose: YOLO11x-Pose (fastest, well-tested)
+- Fallback pose: Sapiens2 (more keypoints, slower)
+- 3D body: INTEGRATED — uses `setup_sam_3d_body()` from `notebook.utils`, outputs MHR joints
+  - API: `estimator.process_one_image(rgb_image)` — single RGB np.ndarray
+  - Model variants: DINOv3-H+ (840M) default, ViT-H (631M) smaller
+  - Temporal smoothing via EMA (alpha=0.3) to reduce single-frame jitter
+  - config.enable_3d=False by default; flipped when checkpoint verified on Space
+- VLM: Qwen3-VL-8B via llama.cpp (Judge + Classifier)
+- Embeddings: Qwen3-VL-Embedding-8B via llama.cpp (Retrieval)

app.py CHANGED Viewed

@@ -1,287 +1,325 @@
-"""
-FormScout — Gradio app entrypoint.
-Screening aid for Functional Movement Screen (FMS) scoring.
-NOT a diagnosis. NOT an injury predictor.
-Custom scout/trail themed UI with score dial, pipeline visualization,
-rubric breakdown, and persistent safety banner.
-"""
-from __future__ import annotations
-import gradio as gr
-from formscout.pipeline import Director
-from formscout.rubric.deep_squat import score_deep_squat
-from formscout.ui.theme import formscout_theme, FORMSCOUT_CSS
-# ─── Constants ───────────────────────────────────────────────────────────────
-DISCLAIMER = (
-    "⚠️ **Screening aid — not a diagnosis. "
-    "Pain or clearing tests require a clinician.**"
-)
-FMS_TESTS = [
-    ("Deep Squat", "deep_squat"),
-    ("Hurdle Step", "hurdle_step"),
-    ("In-Line Lunge", "inline_lunge"),
-    ("Shoulder Mobility", "shoulder_mobility"),
-    ("Active Straight-Leg Raise", "active_slr"),
-    ("Trunk Stability Push-Up", "trunk_stability_pushup"),
-    ("Rotary Stability", "rotary_stability"),
-]
-SCORE_DESCRIPTIONS = {
-    3: "Movement performed to criterion — no compensation",
-    2: "Movement completed with compensation or regression",
-    1: "Unable to perform the movement pattern",
-    0: "Pain reported — clinician referral required",
-}
-# ─── Processing ──────────────────────────────────────────────────────────────
-def process_video(video_path: str, test_name: str, side: str):
-    """Process an uploaded video through the FormScout pipeline."""
-    if not video_path:
-        return (
-            _render_empty_state(),
-            "Upload a video to begin analysis.",
-            "",
-            "",
-        )
-    director = Director()
-    state = director.run(video_path, test_name=test_name, side=side)
-    # ─── Score card ───
-    score_html = _render_empty_state()
-    score_details = ""
-    if state.features and test_name == "deep_squat":
-        result = score_deep_squat(state.features)
-        score_html = _render_score_card(result.score, result.confidence, result.needs_human)
-        score_details = _render_score_details(result, state.features)
-    # ─── Pipeline info ───
-    pipeline_md = _render_pipeline_status(state)
-    # ─── Warnings/errors ───
-    alerts = _render_alerts(state)
-    return score_html, pipeline_md, score_details, alerts
-def _render_score_card(score: int, confidence: float, needs_human: bool) -> str:
-    """Render the score dial as HTML."""
-    if needs_human:
-        return """
-        <div class="score-card needs-review">
-            <div style="font-size: 1.2em; color: #fbbf24; margin-bottom: 8px;">⚠️ Needs Clinician Review</div>
-            <div style="font-size: 0.9em; color: #94a3b8;">Pain or clearing test detected — cannot auto-score</div>
-        </div>
-        """
-    conf_pct = int(confidence * 100)
-    conf_color = "#059669" if confidence >= 0.7 else "#f59e0b" if confidence >= 0.4 else "#ef4444"
-    return f"""
-    <div class="score-card">
-        <div class="score-value">{score}/3</div>
-        <div style="font-size: 0.95em; color: #94a3b8; margin-top: 4px;">
-            {SCORE_DESCRIPTIONS.get(score, '')}
-        </div>
-        <div style="margin-top: 12px;">
-            <div style="display: flex; justify-content: space-between; font-size: 0.8em; color: #64748b;">
-                <span>Confidence</span>
-                <span style="color: {conf_color};">{conf_pct}%</span>
-            </div>
-            <div class="confidence-bar">
-                <div class="confidence-fill" style="width: {conf_pct}%;"></div>
-            </div>
-        </div>
-    </div>
-    """
-def _render_empty_state() -> str:
-    """Render placeholder when no video processed yet."""
-    return """
-    <div class="score-card" style="opacity: 0.5;">
-        <div style="font-size: 2em; margin-bottom: 8px;">🏔️</div>
-        <div style="color: #64748b;">Upload a video to begin</div>
-    </div>
-    """
-def _render_score_details(result, features) -> str:
-    """Render the rubric breakdown."""
-    parts = [f"### Rationale\n{result.rationale}\n"]
-    if features.angles:
-        parts.append("### Measurements")
-        for key, val in features.angles.items():
-            label = key.replace("_", " ").title()
-            parts.append(f"- **{label}:** {val:.1f}°")
-    if features.alignments:
-        parts.append("\n### Alignment Checks")
-        for key, val in features.alignments.items():
-            label = key.replace("_", " ").title()
-            icon = "✓" if val else "✗"
-            parts.append(f"- {icon} {label}")
-    if features.view == "2d":
-        parts.append(
-            "\n> ⚠️ *2D estimate — angles are camera-angle dependent. "
-            "For best accuracy, film from the side at hip height.*"
-        )
-    return "\n".join(parts)
-def _render_pipeline_status(state) -> str:
-    """Render pipeline step summary."""
-    parts = []
-    if state.ingest:
-        parts.append(
-            f"📹 **Ingest:** {len(state.ingest.frames)} frames · "
-            f"{state.ingest.fps:.0f}fps · {state.ingest.duration:.1f}s · "
-            f"{state.ingest.width}×{state.ingest.height}"
-        )
-    if state.pose2d:
-        n = sum(1 for kps in state.pose2d.keypoints if kps)
-        parts.append(
-            f"🦴 **Pose2D:** {n}/{len(state.pose2d.keypoints)} frames detected · "
-            f"conf={state.pose2d.confidence:.0%}"
-        )
-    if state.body3d:
-        if state.body3d.used:
-            parts.append(f"🧊 **Body3D:** active · conf={state.body3d.confidence:.0%}")
-        else:
-            parts.append("🧊 **Body3D:** 2D-only path (normal)")
-    if state.features:
-        parts.append(
-            f"📐 **Biomechanics:** view={state.features.view} · "
-            f"conf={state.features.confidence:.0%}"
-        )
-    return "\n\n".join(parts) if parts else "*Processing...*"
-def _render_alerts(state) -> str:
-    """Render errors and warnings."""
-    parts = []
-    if state.errors:
-        for e in state.errors:
-            parts.append(f"🚨 {e}")
-    if state.warnings:
-        for w in state.warnings:
-            parts.append(f"⚠️ {w}")
-    return "\n\n".join(parts)
-# ─── App Builder ─────────────────────────────────────────────────────────────
-def build_app() -> gr.Blocks:
-    """Build the FormScout Gradio app with custom scout/trail theme."""
-    with gr.Blocks(
-        title="FormScout — FMS Screening Aid",
-        theme=formscout_theme(),
-        css=FORMSCOUT_CSS,
-    ) as app:
-        # Header
-        gr.HTML("""
-        <div class="formscout-header">
-            <h1>🏔️ FormScout</h1>
-            <p style="color: #94a3b8; font-size: 0.95em;">
-                Functional Movement Screen · Automated Scoring Aid
-            </p>
-        </div>
-        """)
-        # Safety banner (always visible — non-negotiable)
-        gr.HTML(f'<div class="safety-banner">{DISCLAIMER}</div>')
-        with gr.Row(equal_height=False):
-            # Left column: Input
-            with gr.Column(scale=2):
-                gr.Markdown("### 📹 Input")
-                video_input = gr.Video(label="Upload FMS Video")
-                with gr.Row():
-                    test_dropdown = gr.Dropdown(
-                        choices=[name for name, _ in FMS_TESTS],
-                        value="Deep Squat",
-                        label="FMS Test",
-                        scale=2,
-                    )
-                    side_dropdown = gr.Dropdown(
-                        choices=["N/A", "Left", "Right"],
-                        value="N/A",
-                        label="Side",
-                        scale=1,
-                    )
-                submit_btn = gr.Button(
-                    "🎯 Score Movement",
-                    variant="primary",
-                    size="lg",
-                )
-                gr.Markdown(
-                    "*Tip: Film from the side at hip height for best accuracy. "
-                    "One athlete, one rep per clip.*",
-                    elem_classes=["topo-accent"],
-                )
-            # Right column: Results
-            with gr.Column(scale=3):
-                gr.Markdown("### 📊 Results")
-                # Score display
-                score_html = gr.HTML(value=_render_empty_state())
-                # Tabs for details
-                with gr.Tabs():
-                    with gr.TabItem("📐 Rubric Breakdown"):
-                        score_details = gr.Markdown("")
-                    with gr.TabItem("🔧 Pipeline"):
-                        pipeline_md = gr.Markdown("*Waiting for video...*")
-                    with gr.TabItem("⚠️ Alerts"):
-                        alerts_md = gr.Markdown("")
-        # Footer safety banner
-        gr.HTML(f'<div class="safety-banner" style="margin-top: 20px;">{DISCLAIMER}</div>')
-        gr.Markdown(
-            "<center style='color: #64748b; font-size: 0.8em; margin-top: 12px;'>"
-            "FormScout · ~18B params · Off the Grid · "
-            "<a href='https://github.com/' style='color: #86efac;'>Built for Build Small Hackathon</a>"
-            "</center>"
-        )
-        # ─── Event wiring ────────────────────────────────────────────────────
-        def _map_inputs(video, test_display_name, side_display):
-            """Map UI display values to internal values."""
-            test_map = {name: val for name, val in FMS_TESTS}
-            test_name = test_map.get(test_display_name, "deep_squat")
-            side = {"N/A": "na", "Left": "left", "Right": "right"}.get(side_display, "na")
-            return process_video(video, test_name, side)
-        submit_btn.click(
-            fn=_map_inputs,
-            inputs=[video_input, test_dropdown, side_dropdown],
-            outputs=[score_html, pipeline_md, score_details, alerts_md],
-        )
-    return app
-if __name__ == "__main__":
-    app = build_app()
-    app.launch()

+"""
+FormScout — Gradio app entrypoint.
+Screening aid for Functional Movement Screen (FMS) scoring.
+NOT a diagnosis. NOT an injury predictor.
+Custom scout/trail themed UI with score dial, pipeline visualization,
+rubric breakdown, and persistent safety banner.
+"""
+from __future__ import annotations
+import gradio as gr
+from formscout.pipeline import Director
+from formscout.rubric import score_test
+from formscout.ui.theme import formscout_theme, FORMSCOUT_CSS
+# ─── Constants ───────────────────────────────────────────────────────────────
+DISCLAIMER = (
+    "⚠️ **Screening aid — not a diagnosis. "
+    "Pain or clearing tests require a clinician.**"
+)
+FMS_TESTS = [
+    ("Deep Squat", "deep_squat"),
+    ("Hurdle Step", "hurdle_step"),
+    ("In-Line Lunge", "inline_lunge"),
+    ("Shoulder Mobility", "shoulder_mobility"),
+    ("Active Straight-Leg Raise", "active_slr"),
+    ("Trunk Stability Push-Up", "trunk_stability_pushup"),
+    ("Rotary Stability", "rotary_stability"),
+]
+SCORE_DESCRIPTIONS = {
+    3: "Movement performed to criterion — no compensation",
+    2: "Movement completed with compensation or regression",
+    1: "Unable to perform the movement pattern",
+    0: "Pain reported — clinician referral required",
+}
+# ─── Processing ──────────────────────────────────────────────────────────────
+def process_video(video_path: str, test_name: str, side: str):
+    """Process an uploaded video through the FormScout pipeline."""
+    if not video_path:
+        return (
+            _render_empty_state(),
+            "Upload a video to begin analysis.",
+            "",
+            "",
+        )
+    director = Director()
+    state = director.run(video_path, test_name=test_name, side=side)
+    # ─── Score card ───
+    score_html = _render_empty_state()
+    score_details = ""
+    if state.features:
+        result = score_test(state.features)
+        # Use judge result if available, otherwise rubric
+        judge = state.judge
+        if judge and judge.score is not None:
+            score_html = _render_score_card(judge.score, judge.confidence, judge.needs_human)
+            score_details = _render_score_details_judge(judge, result, state.features)
+        elif judge and judge.needs_human:
+            score_html = _render_score_card(0, 0, True)
+            score_details = f"### Needs Clinician Review\n{judge.rationale}"
+        else:
+            score_html = _render_score_card(result.score, result.confidence, result.needs_human)
+            score_details = _render_score_details(result, state.features)
+    # ─── Pipeline info ───
+    pipeline_md = _render_pipeline_status(state)
+    # ─── Warnings/errors ───
+    alerts = _render_alerts(state)
+    return score_html, pipeline_md, score_details, alerts
+def _render_score_card(score: int, confidence: float, needs_human: bool) -> str:
+    """Render the score dial as HTML."""
+    if needs_human:
+        return """
+        <div class="score-card needs-review">
+            <div style="font-size: 1.2em; color: #fbbf24; margin-bottom: 8px;">⚠️ Needs Clinician Review</div>
+            <div style="font-size: 0.9em; color: #94a3b8;">Pain or clearing test detected — cannot auto-score</div>
+        </div>
+        """
+    conf_pct = int(confidence * 100)
+    conf_color = "#059669" if confidence >= 0.7 else "#f59e0b" if confidence >= 0.4 else "#ef4444"
+    return f"""
+    <div class="score-card">
+        <div class="score-value">{score}/3</div>
+        <div style="font-size: 0.95em; color: #94a3b8; margin-top: 4px;">
+            {SCORE_DESCRIPTIONS.get(score, '')}
+        </div>
+        <div style="margin-top: 12px;">
+            <div style="display: flex; justify-content: space-between; font-size: 0.8em; color: #64748b;">
+                <span>Confidence</span>
+                <span style="color: {conf_color};">{conf_pct}%</span>
+            </div>
+            <div class="confidence-bar">
+                <div class="confidence-fill" style="width: {conf_pct}%;"></div>
+            </div>
+        </div>
+    </div>
+    """
+def _render_empty_state() -> str:
+    """Render placeholder when no video processed yet."""
+    return """
+    <div class="score-card" style="opacity: 0.5;">
+        <div style="font-size: 2em; margin-bottom: 8px;">🏔️</div>
+        <div style="color: #64748b;">Upload a video to begin</div>
+    </div>
+    """
+def _render_score_details(result, features) -> str:
+    """Render the rubric breakdown."""
+    parts = [f"### Rationale\n{result.rationale}\n"]
+    if features.angles:
+        parts.append("### Measurements")
+        for key, val in features.angles.items():
+            label = key.replace("_", " ").title()
+            parts.append(f"- **{label}:** {val:.1f}°")
+    if features.alignments:
+        parts.append("\n### Alignment Checks")
+        for key, val in features.alignments.items():
+            label = key.replace("_", " ").title()
+            icon = "✓" if val else "✗"
+            parts.append(f"- {icon} {label}")
+    if features.view == "2d":
+        parts.append(
+            "\n> ⚠️ *2D estimate — angles are camera-angle dependent. "
+            "For best accuracy, film from the side at hip height.*"
+        )
+    return "\n".join(parts)
+def _render_score_details_judge(judge, rubric, features) -> str:
+    """Render judge + rubric combined breakdown."""
+    parts = [f"### Judge Rationale\n{judge.rationale}\n"]
+    if judge.compensation_tags:
+        parts.append(f"**Compensations:** {', '.join(judge.compensation_tags)}")
+    if judge.corrective_hint:
+        parts.append(f"**Corrective:** {judge.corrective_hint}")
+    parts.append(f"\n### Rubric Score: {rubric.score}/3")
+    parts.append(f"*{rubric.rationale}*")
+    if features.angles:
+        parts.append("\n### Measurements")
+        for key, val in features.angles.items():
+            label = key.replace("_", " ").title()
+            parts.append(f"- **{label}:** {val:.1f}°" if isinstance(val, float) else f"- **{label}:** {val}")
+    if features.symmetry_delta is not None:
+        parts.append(f"\n### Asymmetry\n- **L/R Delta:** {features.symmetry_delta:.1f}°")
+    if features.view == "2d":
+        parts.append(
+            "\n> ⚠️ *2D estimate — angles are camera-angle dependent.*"
+        )
+    return "\n".join(parts)
+def _render_pipeline_status(state) -> str:
+    """Render pipeline step summary."""
+    parts = []
+    if state.ingest:
+        parts.append(
+            f"📹 **Ingest:** {len(state.ingest.frames)} frames · "
+            f"{state.ingest.fps:.0f}fps · {state.ingest.duration:.1f}s · "
+            f"{state.ingest.width}×{state.ingest.height}"
+        )
+    if state.pose2d:
+        n = sum(1 for kps in state.pose2d.keypoints if kps)
+        parts.append(
+            f"🦴 **Pose2D:** {n}/{len(state.pose2d.keypoints)} frames detected · "
+            f"conf={state.pose2d.confidence:.0%}"
+        )
+    if state.body3d:
+        if state.body3d.used:
+            parts.append(f"🧊 **Body3D:** active · conf={state.body3d.confidence:.0%}")
+        else:
+            parts.append("🧊 **Body3D:** 2D-only path (normal)")
+    if state.features:
+        parts.append(
+            f"📐 **Biomechanics:** view={state.features.view} · "
+            f"conf={state.features.confidence:.0%}"
+        )
+    return "\n\n".join(parts) if parts else "*Processing...*"
+def _render_alerts(state) -> str:
+    """Render errors and warnings."""
+    parts = []
+    if state.errors:
+        for e in state.errors:
+            parts.append(f"🚨 {e}")
+    if state.warnings:
+        for w in state.warnings:
+            parts.append(f"⚠️ {w}")
+    return "\n\n".join(parts)
+# ─── App Builder ─────────────────────────────────────────────────────────────
+def build_app() -> gr.Blocks:
+    """Build the FormScout Gradio app with custom scout/trail theme."""
+    with gr.Blocks(
+        title="FormScout — FMS Screening Aid",
+        theme=formscout_theme(),
+        css=FORMSCOUT_CSS,
+    ) as app:
+        # Header
+        gr.HTML("""
+        <div class="formscout-header">
+            <h1>🏔️ FormScout</h1>
+            <p style="color: #94a3b8; font-size: 0.95em;">
+                Functional Movement Screen · Automated Scoring Aid
+            </p>
+        </div>
+        """)
+        # Safety banner (always visible — non-negotiable)
+        gr.HTML(f'<div class="safety-banner">{DISCLAIMER}</div>')
+        with gr.Row(equal_height=False):
+            # Left column: Input
+            with gr.Column(scale=2):
+                gr.Markdown("### 📹 Input")
+                video_input = gr.Video(label="Upload FMS Video")
+                with gr.Row():
+                    test_dropdown = gr.Dropdown(
+                        choices=[name for name, _ in FMS_TESTS],
+                        value="Deep Squat",
+                        label="FMS Test",
+                        scale=2,
+                    )
+                    side_dropdown = gr.Dropdown(
+                        choices=["N/A", "Left", "Right"],
+                        value="N/A",
+                        label="Side",
+                        scale=1,
+                    )
+                submit_btn = gr.Button(
+                    "🎯 Score Movement",
+                    variant="primary",
+                    size="lg",
+                )
+                gr.Markdown(
+                    "*Tip: Film from the side at hip height for best accuracy. "
+                    "One athlete, one rep per clip.*",
+                    elem_classes=["topo-accent"],
+                )
+            # Right column: Results
+            with gr.Column(scale=3):
+                gr.Markdown("### 📊 Results")
+                # Score display
+                score_html = gr.HTML(value=_render_empty_state())
+                # Tabs for details
+                with gr.Tabs():
+                    with gr.TabItem("📐 Rubric Breakdown"):
+                        score_details = gr.Markdown("")
+                    with gr.TabItem("🔧 Pipeline"):
+                        pipeline_md = gr.Markdown("*Waiting for video...*")
+                    with gr.TabItem("⚠️ Alerts"):
+                        alerts_md = gr.Markdown("")
+        # Footer safety banner
+        gr.HTML(f'<div class="safety-banner" style="margin-top: 20px;">{DISCLAIMER}</div>')
+        gr.Markdown(
+            "<center style='color: #64748b; font-size: 0.8em; margin-top: 12px;'>"
+            "FormScout · ~18B params · Off the Grid · "
+            "<a href='https://github.com/' style='color: #86efac;'>Built for Build Small Hackathon</a>"
+            "</center>"
+        )
+        # ─── Event wiring ────────────────────────────────────────────────────
+        def _map_inputs(video, test_display_name, side_display):
+            """Map UI display values to internal values."""
+            test_map = {name: val for name, val in FMS_TESTS}
+            test_name = test_map.get(test_display_name, "deep_squat")
+            side = {"N/A": "na", "Left": "left", "Right": "right"}.get(side_display, "na")
+            return process_video(video, test_name, side)
+        submit_btn.click(
+            fn=_map_inputs,
+            inputs=[video_input, test_dropdown, side_dropdown],
+            outputs=[score_html, pipeline_md, score_details, alerts_md],
+        )
+    return app
+if __name__ == "__main__":
+    app = build_app()
+    app.launch()

docs/FormScout-FMS-Spec.md CHANGED Viewed

@@ -1,277 +1,277 @@
-# FormScout — Functional Movement Screening, scored small
-**Project specification & architecture documentation**
-*Build Small Hackathon (Gradio × Hugging Face) — Track: Backyard AI*
-*Working title; rename freely. Doc version 0.1, June 2026.*
----
-## 1. One-paragraph pitch
-A basketball team's physiotherapist screens players with the **Functional Movement Screen (FMS)** — seven movement patterns, each scored 0–3 by eye. The scoring is slow, subjective, and hard to reproduce across raters or across months. FormScout is a Gradio app that takes a video of an athlete performing an FMS test, extracts 2D and 3D body pose, measures the biomechanics the FMS rubric actually cares about, and produces a 0–3 score *with a written rationale and an annotated overlay* — anchored to the physio's own previously-scored clips. It is a **screening aid that standardizes and speeds up the physio's first pass**, not a diagnosis and not an injury predictor. Everything runs on models that fit on a laptop.
----
-## 2. The problem, honestly
-The FMS is a seven-test battery (Deep Squat, Hurdle Step, In-Line Lunge, Shoulder Mobility, Active Straight-Leg Raise, Trunk Stability Push-Up, Rotary Stability), each scored 0–3 for a composite 0–21. A score of 0 means **pain** during the movement and is an automatic red flag for clinical referral. Three of the tests have associated **clearing tests** (shoulder, spinal extension, spinal flexion) that also force a 0 on pain.
-Two facts shape this project and should be stated plainly in the demo and the writeup:
-- **Inter-rater reliability is decent but not perfect.** Composite-score reliability is moderate-to-good (ICC roughly 0.7–0.8), but novice and less-experienced raters grade component scores inconsistently. This is the real, addressable pain point: **variance between raters and over time.**
-- **Predictive validity for injury is weak/mixed.** The popular "≤14 = higher injury risk" cutoff is not a reliable predictor on its own. So FormScout must **not** be sold as injury prediction.
-**Where FormScout genuinely helps:**
-1. A repeatable, objective **digital baseline** to track an athlete over a season.
-2. **Asymmetry detection** (left vs. right), which is one of the FMS's most defensible outputs.
-3. A fast, consistent **first-pass / second opinion** that reduces rater variance.
-4. **Explainability** — it shows *which compensation* it saw, not just a number.
-This honest framing is also strategic: the Backyard AI track is judged partly on "honest fit between problem and the small-model constraint." Overclaiming clinical power would hurt the submission, not help it.
----
-## 3. Why this fits the hackathon
-| Hackathon rule | How FormScout satisfies it |
-|---|---|
-| **Total params ≤ 32B** | Recommended config sums to ~18B. A portfolio of small specialists beats one monolith — which is on-theme for "think small." |
-| **Built on Gradio, hosted as a HF Space** | Gradio app with `gr.Video` input, a custom-styled results panel, on-Space inference (ZeroGPU or llama.cpp). |
-| **Show, Don't Tell** | Demo video = physio uploads a real player clip, gets a scored overlay in seconds. Social post = before/after of a manual vs. assisted screening session. |
-| **Track: Backyard AI** | The "someone you know" is the team physiotherapist. The deliverable is something they *actually use* on real players. |
-**Badge targets (aim for all six):**
-- 🔌 **Off the Grid** — no cloud APIs; all models served on the Space.
-- 🎯 **Well-Tuned** — the skeletal-temporal scoring head is fine-tuned on the physio's labels and published to the Hub.
-- 🎨 **Off-Brand** — custom Gradio frontend (scorecard UI, video overlay, per-test rubric panel), pushing past default Gradio.
-- 🦙 **Llama Champion** — VLM + embedding model served through llama.cpp (GGUF builds exist for both).
-- 📡 **Sharing is Caring** — publish the agent trace (one full screening run, agent by agent) to the Hub.
-- 📓 **Field Notes** — a blog post on building a clinical-adjacent AQA pipeline under a 32B budget, with the honesty section front and center.
----
-## 4. Core technical framing: FMS *is* Action Quality Assessment
-Don't reinvent this from scratch. **Action Quality Assessment (AQA)** is the established field for "score how well a movement was performed." Skeleton-based AQA (sports scoring, surgical-skill and rehab assessment) is the directly relevant lineage. The "Skeletal-Temporal Transformer" idea maps onto the **AQA scoring head**.
-The key design constraint is the **tiny labeled dataset** (a couple of physio-scored videos). That rules out training a large score regressor from scratch and dictates a hybrid approach:
-1. **Deterministic biomechanics** carry most of the load. The FMS rubric is, to a large degree, a set of *angle and alignment thresholds* (e.g. Deep Squat "3" = femur below horizontal, torso parallel to tibia, knees tracking over feet, dowel over feet). These are computable from 3D pose with **zero training** and are inherently interpretable — exactly what earns a physio's trust.
-2. **A small learned head** (ST-GCN or a compact temporal transformer) refines the score and captures the patterns rules miss. It is small enough to fine-tune on a few labeled clips, *especially* if pre-trained on public AQA/pose datasets first.
-3. **Retrieval over the physio's labeled clips** (RAG) gives the language model few-shot anchors at judgment time — the right move when you have examples but not enough to train on.
-4. **A VLM as the judge/explainer** synthesizes rubric + measurements + retrieved exemplars into a final score and a human-readable rationale, and conservatively flags anything pain-related for a human.
----
-## 5. Parameter budget (the single most important table)
-Assume "total parameters" = **sum of all model weights in the pipeline**. Design to this; confirm the exact interpretation in the Discord AMA.
-### Recommended config — "Portfolio of specialists" (~18B)
-| Component | Model | Params | Role |
-|---|---|---:|---|
-| 2D pose + tracking | YOLO26-Pose (L/X) | ~0.05B | Per-frame 17-keypoint skeletons, multi-person tracking |
-| Segmentation | SAM 3.1 (base) | ~0.85B | Clean athlete mask, occlusion handling, prompt for 3D |
-| 3D body | SAM 3D Body | ~0.7–1B* | Single-image 3D mesh → true joint angles, view-invariant |
-| Scoring head | ST-GCN / temporal transformer (fine-tuned) | ~0.01–0.05B | Pose-sequence → candidate 0–3 + confidence |
-| Judge / explainer | Qwen3-VL-8B-Instruct | 8B | Movement ID, rubric reasoning, final score + rationale |
-| Retrieval | Qwen3-VL-Embedding-8B | 8B | Nearest physio-scored reference clips (RAG) |
-| **Total** | | **~17.8B** | Comfortable headroom under 32B |
-\* SAM 3D Body's exact count isn't published prominently — verify on the model card. It's SAM-3-family and sub-billion-class; budget impact is small either way. The two 8B Qwen models **share the Qwen3-VL-8B backbone** (the embedder is built on the instruct model), which is conceptually clean and operationally efficient.
-### Alternative config — "Heavy reasoner" (~28.7B)
-Swap the 8B judge for **Qwen3.6-27B** (multimodal, strong tool-calling, MTP speedups on llama.cpp). Budget then = 27 + ~0.85 + ~1 + small ≈ **28.7B**. This **leaves no room for the 8B embedder**, so you'd drop RAG (or replace it with a sub-0.5B embedder, or use pose-feature similarity for retrieval). Note: Qwen3.6-27B's MTP speculative decoding currently can't run simultaneously with image input (`--mmproj`), so for vision you run it without MTP.
-**Recommendation: ship the ~18B portfolio config.** RAG over the physio's few labeled clips is worth more than raw reasoning horsepower on this task, the headroom de-risks the budget, and "many small specialists" is the better hackathon story.
----
-## 6. Model selection rationale
-**YOLO26-Pose** — current-generation YOLO pose; single forward pass for detection + keypoints, NMS-free, real-time even on edge. Tiny param cost. It also handles **multiple people in frame** (important: team videos often have other players/staff visible) and feeds keypoints downstream. Off-the-shelf it predicts COCO human keypoints; can be fine-tuned for custom landmarks (e.g. dowel endpoints) if needed.
-**SAM 3.1** — gives a clean athlete mask and stable multi-object video tracking (Object Multiplex makes it fast). Two jobs: (a) isolate the target athlete from teammates/background so pose and 3D aren't polluted, (b) provide the mask prompt that SAM 3D Body consumes. Concept prompts ("the person in the blue jersey performing the squat") are a bonus for disambiguation.
-**SAM 3D Body** — *the addition that makes the scores trustworthy.* FMS criteria are joint angles and symmetry; 2D pose can't measure these reliably across camera angles (projection ambiguity). 3D mesh recovery from a single image, promptable with the 2D keypoints + mask you already have, yields view-invariant joint angles (the MHR rig even separates skeletal structure from soft-tissue shape, which is convenient for angle extraction). This is the difference between "looks bent" and "femur is 4° above horizontal → not a 3."
-**Skeletal-temporal scoring head** — your AQA component and your **Well-Tuned** badge. Recommend a compact **ST-GCN** (graph conv over the skeleton, temporal conv over frames) over a from-scratch transformer, because it's far more data-efficient on a tiny labeled set. Pre-train on public AQA / pose-action data, then fine-tune on the physio's labels. Output: per-test candidate score + a confidence the judge can weigh.
-**Qwen3-VL-8B-Instruct** — the judge. Strong video temporal modeling (Interleaved-MRoPE, timestamp alignment) suits movement clips. It identifies which of the 7 tests is being performed, reads the biomechanics, considers retrieved exemplars and the head's candidate, and emits the final score + rationale + detected compensation. GGUF → llama.cpp → Llama Champion.
-**Qwen3-VL-Embedding-8B** — retrieval. Embeds the query clip (or its keyframes/pose-render) and finds the physio's most similar already-scored clips to anchor the judge. Top multimodal retriever on MMEB-V2; same backbone as the judge; GGUF available.
----
-## 7. Architecture — an agentic pipeline
-Structured as cooperating specialist agents (maps naturally onto an OFP-style orchestration, with a Director coordinating and quality-gating). Each agent has one job and a typed output.
-```
-                         ┌──────────────────────────────────────────────┐
-   video upload  ───────▶│  IngestAgent                                  │
-                         │  decode, normalize FPS, sample frames         │
-                         └───────────────┬──────────────────────────────┘
-                                         ▼
-                         ┌──────────────────────────────────────────────┐
-                         │  SegmentationAgent  (SAM 3.1)                 │
-                         │  athlete mask + track id (reject teammates)   │
-                         └───────────────┬──────────────────────────────┘
-                                         ▼
-              ┌──────────────────────────┴──────────────────────────┐
-              ▼                                                      ▼
-  ┌───────────────────────────┐                      ┌───────────────────────────┐
-  │ PoseAgent (YOLO26-Pose)    │                      │ Body3DAgent (SAM 3D Body)  │
-  │ 2D keypoints per frame     │ ───keypoints+mask──▶ │ 3D mesh / joint angles     │
-  └───────────────┬───────────┘                      └───────────────┬───────────┘
-                  └─────────────────────┬────────────────────────────┘
-                                        ▼
-                         ┌──────────────────────────────────────────────┐
-                         │  MovementClassifierAgent                      │
-                         │  which of the 7 FMS tests? (VLM or small CLS) │
-                         └───────────────┬──────────────────────────────┘
-                                         ▼
-              ┌──────────────────────────┴──────────────────────────┐
-              ▼                          ▼                           ▼
-  ┌────────────────────┐   ┌─────────────────────────┐   ┌────────────────────────┐
-  │ BiomechanicsAgent  │   │ ScoringAgent (ST-GCN)    │   │ RetrievalAgent          │
-  │ rubric angles,     │   │ candidate 0–3 + conf     │   │ (Qwen3-VL-Embedding)    │
-  │ ROM, symmetry,     │   │ from pose sequence       │   │ k nearest physio clips  │
-  │ alignment, timing  │   │                          │   │ + their scores          │
-  └─────────┬──────────┘   └───────────┬─────────────┘   └───────────┬────────────┘
-            └───────────────────────────┴──────────────────────────┘
-                                        ▼
-                         ┌──────────────────────────────────────────────┐
-                         │  JudgeAgent  (Qwen3-VL-8B)                    │
-                         │  rubric + measurements + exemplars + candidate│
-                         │  → final 0–3, rationale, compensation tag,    │
-                         │    corrective hint, PAIN/CLEARING → defer      │
-                         └───────────────┬────────────────────────────���─┘
-                                         ▼
-                         ┌──────────────────────────────────────────────┐
-                         │  ReportAgent                                  │
-                         │  per-test card, composite 0–21, asymmetry     │
-                         │  flags, annotated video, exportable PDF       │
-                         └──────────────────────────────────────────────┘
-```
-**Agent contracts (sketch):**
-- `IngestAgent` → `{frames[], fps, duration, n_people}`
-- `SegmentationAgent` → `{athlete_track_id, masks[]}`
-- `PoseAgent` → `{keypoints_2d[frame][joint]={x,y,conf}}`
-- `Body3DAgent` → `{joints_3d[frame][joint]={x,y,z}, mesh_optional}`
-- `MovementClassifierAgent` → `{test_name, side: left|right|n/a, confidence}`
-- `BiomechanicsAgent` → `{features: {torso_tibia_angle, hip_flexion_deg, knee_valgus_deg, dowel_alignment, L_R_symmetry, ...}}`
-- `ScoringAgent` → `{candidate_score: 0–3, confidence}`
-- `RetrievalAgent` → `{exemplars: [{clip_id, score, similarity}]}`
-- `JudgeAgent` → `{score: 0–3, rationale, compensation_tags[], corrective_hint, needs_human: bool}`
-- `ReportAgent` → `{per_test[], composite, asymmetries[], overlay_video, pdf}`
-**Quality gating:** if the ST-GCN candidate and the JudgeAgent disagree by ≥1 point, or any agent confidence is low, the report marks the test **"low confidence — physio review recommended."** This keeps the human in the loop and is itself a selling point.
----
-## 8. Scoring methodology, per test
-The seven tests reduce to measurable quantities. Build a small rubric module — one scoring function per test — that consumes the 3D features and returns a score with the triggering reason. Examples:
-- **Deep Squat (3):** femur below horizontal AND torso parallel to tibia AND knees tracking over feet AND dowel over feet. **(2):** same but achieved only with heels elevated. **(1):** criteria unmet even with heels elevated. → all four conditions are angle/alignment checks on the 3D pose.
-- **Hurdle Step / In-Line Lunge / Shoulder Mobility / ASLR:** bilateral — score each side, **record the lower** as the test score, and **always emit the asymmetry** even when the score is the same.
-- **Trunk Stability Push-Up / Rotary Stability:** trunk rigidity / timing of limb movement — temporal features from the pose sequence; the ST-GCN head is most valuable here.
-- **Pain / clearing tests (0):** the system **cannot** detect pain. Any clearing test, or a visible distress/abort, sets `needs_human = true` and the test is **not auto-scored**. Defer to the physio. State this loudly.
-Final composite = sum of seven test scores (0–21), plus an asymmetry summary. The number is never shown without its rationale.
----
-## 9. Data & fine-tuning plan (tiny-dataset survival guide)
-You have "a couple" of physio-scored clips. Treat them as gold, not as a training set.
-1. **Deterministic backbone first.** Get the biomechanics rubric working with no training. Validate the measured angles against the physio's scores qualitatively. This alone may be demo-ready.
-2. **Pre-train the ST-GCN** on public pose-action / AQA data (action recognition or generic AQA) so it learns temporal movement structure, not FMS labels.
-3. **Fine-tune on the physio's clips** with heavy augmentation: temporal crops/speed jitter, mirror (left↔right, doubles your bilateral data), camera-angle perturbation in 3D, joint noise. Few-shot, regularized, early-stopped.
-4. **Hold out at least one physio-scored clip** as a sanity check the judge never sees.
-5. **RAG instead of more training.** Every labeled clip goes into the embedding index as a scoring anchor. New clips added later improve the system with no retraining — a nice longitudinal story for the physio.
-6. **Publish the fine-tuned head** to the Hub with a model card (→ Well-Tuned badge). Include the augmentation recipe and the honest "trained on N clips, treat as assistive" caveat.
-**Label schema to collect from the physio** (if you can get a bit more data): `clip_id, athlete_id, test_name, side, score(0–3), pain(bool), compensation_notes, camera_view`. Even 20–30 well-labeled clips meaningfully helps.
----
-## 10. Gradio Space & deployment
-**UI (targets Off-Brand badge):**
-- `gr.Video` upload (or webcam capture) + a test-type selector (auto-detect, with manual override).
-- Results panel: the 0–3 score as a large dial/patch, the composite 0–21, an asymmetry strip (L/R bars), and the **rationale text**.
-- The annotated overlay video: skeleton + the specific angle that decided the score drawn on the frame where it mattered.
-- A rubric drawer that shows the official 3/2/1 criteria for the detected test, with the met/unmet conditions checked off.
-- A persistent **"Screening aid — not a diagnosis. Pain or clearing tests require a clinician."** banner.
-- Custom CSS / `gr.Server` for a non-default look (scout/trail-map theme would rhyme with the hackathon, and with your design instincts).
-**Compute:**
-- ZeroGPU (H200 slice) can host the ~18B portfolio; load pose/SAM/3D eagerly, the VLM + embedder via llama.cpp.
-- For **Off the Grid**, ensure zero external API calls — everything served on-Space.
-- For **Llama Champion**, route the VLM + embedding through llama.cpp (GGUF builds exist for Qwen3-VL-8B-Instruct, Qwen3-VL-Embedding-8B, and Qwen3.6-27B). On a Space, watch the CUDA/llama-cpp build flags — recent hackathon Spaces hit `libcudart` issues; a CPU-only or pinned-CUDA build is the usual fix.
-- Persist the embedding index and accumulated labels in Space storage for the longitudinal baseline.
----
-## 11. Clinical safety & ethics (bake this in, don't bolt it on)
-- **Not a medical device.** Screening aid only. No diagnosis, no injury prediction, no treatment advice beyond generic FMS-style correctives.
-- **Pain is out of scope** for automatic scoring — always defer to the physio.
-- **Human-in-the-loop by design:** low-confidence and disagreement cases are surfaced, not hidden.
-- **Consent & privacy:** athlete videos are biometric data. Get consent; don't log/persist clips beyond what the physio approves; document retention in the writeup.
-- **Honesty in the demo:** show a case the system gets right *and* one it flags as uncertain. Judges (and physios) trust calibrated tools more than confident ones.
----
-## 12. Build plan — two weekends (June 5–15)
-**Weekend 1 — the spine works end to end:**
-- Day 1: Space scaffold, `gr.Video` in → skeleton overlay out (YOLO26-Pose). Ingest + Segmentation + Pose agents.
-- Day 2: SAM 3D Body integrated; BiomechanicsAgent computing Deep-Squat angles; first deterministic score on a real clip.
-- Goal: upload a squat video, get a rationalized 0–3. *This alone is a viable demo.*
-**Midweek:** wire the JudgeAgent (Qwen3-VL via llama.cpp), MovementClassifier, and the rubric module for all 7 tests. Attend the AMA — confirm the param-sum interpretation.
-**Weekend 2 — make it sing:**
-- ST-GCN pre-train + few-shot fine-tune on physio clips; publish to Hub.
-- RetrievalAgent + embedding index over labeled clips.
-- Custom UI polish, asymmetry view, PDF export, safety banners.
-- Record the demo video (physio uses it on a real player), write the social post, publish the agent trace and the blog post.
----
-## 13. Risks & open questions
-- **Param-sum interpretation** — biggest unknown. The ~18B config is safe under either reading; confirm anyway.
-- **SAM 3D Body on a Space** — verify weights, license, and that it runs within ZeroGPU limits; have a 2D-only fallback (angles from 2D + camera-angle caveats) if it's too heavy.
-- **Single-camera angle limits** even with 3D — note it; recommend a consistent capture protocol (fixed camera position) for the physio, which also improves the longitudinal baseline.
-- **Tiny dataset** — the deterministic rubric must stand on its own so the demo doesn't hinge on the learned head generalizing from a few clips.
-- **llama.cpp + vision build** on Spaces — budget time for the CUDA build dance; CPU fallback for the embedder is fine.
-- **Movement misclassification** — if the wrong test is detected, scoring is meaningless; keep the manual override prominent.
----
-## 14. Quick reference — the stack
-| Layer | Choice | Badge it helps |
-|---|---|---|
-| 2D pose | YOLO26-Pose | — |
-| Segmentation/track | SAM 3.1 | — |
-| 3D biomechanics | SAM 3D Body | — |
-| Learned scoring | ST-GCN (fine-tuned, published) | Well-Tuned |
-| Judge/explainer | Qwen3-VL-8B-Instruct (llama.cpp) | Llama Champion |
-| Retrieval | Qwen3-VL-Embedding-8B (llama.cpp) | Llama Champion |
-| Serving | On-Space, no cloud APIs | Off the Grid |
-| Frontend | Custom Gradio (scout theme) | Off-Brand |
-| Trace | Published agent run on Hub | Sharing is Caring |
-| Writeup | Blog post w/ honesty section | Field Notes |
-*Total ≈ 18B params. Honest, explainable, human-in-the-loop, runs on a laptop.*

+# FormScout — Functional Movement Screening, scored small
+**Project specification & architecture documentation**
+*Build Small Hackathon (Gradio × Hugging Face) — Track: Backyard AI*
+*Working title; rename freely. Doc version 0.1, June 2026.*
+---
+## 1. One-paragraph pitch
+A basketball team's physiotherapist screens players with the **Functional Movement Screen (FMS)** — seven movement patterns, each scored 0–3 by eye. The scoring is slow, subjective, and hard to reproduce across raters or across months. FormScout is a Gradio app that takes a video of an athlete performing an FMS test, extracts 2D and 3D body pose, measures the biomechanics the FMS rubric actually cares about, and produces a 0–3 score *with a written rationale and an annotated overlay* — anchored to the physio's own previously-scored clips. It is a **screening aid that standardizes and speeds up the physio's first pass**, not a diagnosis and not an injury predictor. Everything runs on models that fit on a laptop.
+---
+## 2. The problem, honestly
+The FMS is a seven-test battery (Deep Squat, Hurdle Step, In-Line Lunge, Shoulder Mobility, Active Straight-Leg Raise, Trunk Stability Push-Up, Rotary Stability), each scored 0–3 for a composite 0–21. A score of 0 means **pain** during the movement and is an automatic red flag for clinical referral. Three of the tests have associated **clearing tests** (shoulder, spinal extension, spinal flexion) that also force a 0 on pain.
+Two facts shape this project and should be stated plainly in the demo and the writeup:
+- **Inter-rater reliability is decent but not perfect.** Composite-score reliability is moderate-to-good (ICC roughly 0.7–0.8), but novice and less-experienced raters grade component scores inconsistently. This is the real, addressable pain point: **variance between raters and over time.**
+- **Predictive validity for injury is weak/mixed.** The popular "≤14 = higher injury risk" cutoff is not a reliable predictor on its own. So FormScout must **not** be sold as injury prediction.
+**Where FormScout genuinely helps:**
+1. A repeatable, objective **digital baseline** to track an athlete over a season.
+2. **Asymmetry detection** (left vs. right), which is one of the FMS's most defensible outputs.
+3. A fast, consistent **first-pass / second opinion** that reduces rater variance.
+4. **Explainability** — it shows *which compensation* it saw, not just a number.
+This honest framing is also strategic: the Backyard AI track is judged partly on "honest fit between problem and the small-model constraint." Overclaiming clinical power would hurt the submission, not help it.
+---
+## 3. Why this fits the hackathon
+| Hackathon rule | How FormScout satisfies it |
+|---|---|
+| **Total params ≤ 32B** | Recommended config sums to ~18B. A portfolio of small specialists beats one monolith — which is on-theme for "think small." |
+| **Built on Gradio, hosted as a HF Space** | Gradio app with `gr.Video` input, a custom-styled results panel, on-Space inference (ZeroGPU or llama.cpp). |
+| **Show, Don't Tell** | Demo video = physio uploads a real player clip, gets a scored overlay in seconds. Social post = before/after of a manual vs. assisted screening session. |
+| **Track: Backyard AI** | The "someone you know" is the team physiotherapist. The deliverable is something they *actually use* on real players. |
+**Badge targets (aim for all six):**
+- 🔌 **Off the Grid** — no cloud APIs; all models served on the Space.
+- 🎯 **Well-Tuned** — the skeletal-temporal scoring head is fine-tuned on the physio's labels and published to the Hub.
+- 🎨 **Off-Brand** — custom Gradio frontend (scorecard UI, video overlay, per-test rubric panel), pushing past default Gradio.
+- 🦙 **Llama Champion** — VLM + embedding model served through llama.cpp (GGUF builds exist for both).
+- 📡 **Sharing is Caring** — publish the agent trace (one full screening run, agent by agent) to the Hub.
+- 📓 **Field Notes** — a blog post on building a clinical-adjacent AQA pipeline under a 32B budget, with the honesty section front and center.
+---
+## 4. Core technical framing: FMS *is* Action Quality Assessment
+Don't reinvent this from scratch. **Action Quality Assessment (AQA)** is the established field for "score how well a movement was performed." Skeleton-based AQA (sports scoring, surgical-skill and rehab assessment) is the directly relevant lineage. The "Skeletal-Temporal Transformer" idea maps onto the **AQA scoring head**.
+The key design constraint is the **tiny labeled dataset** (a couple of physio-scored videos). That rules out training a large score regressor from scratch and dictates a hybrid approach:
+1. **Deterministic biomechanics** carry most of the load. The FMS rubric is, to a large degree, a set of *angle and alignment thresholds* (e.g. Deep Squat "3" = femur below horizontal, torso parallel to tibia, knees tracking over feet, dowel over feet). These are computable from 3D pose with **zero training** and are inherently interpretable — exactly what earns a physio's trust.
+2. **A small learned head** (ST-GCN or a compact temporal transformer) refines the score and captures the patterns rules miss. It is small enough to fine-tune on a few labeled clips, *especially* if pre-trained on public AQA/pose datasets first.
+3. **Retrieval over the physio's labeled clips** (RAG) gives the language model few-shot anchors at judgment time — the right move when you have examples but not enough to train on.
+4. **A VLM as the judge/explainer** synthesizes rubric + measurements + retrieved exemplars into a final score and a human-readable rationale, and conservatively flags anything pain-related for a human.
+---
+## 5. Parameter budget (the single most important table)
+Assume "total parameters" = **sum of all model weights in the pipeline**. Design to this; confirm the exact interpretation in the Discord AMA.
+### Recommended config — "Portfolio of specialists" (~18B)
+| Component | Model | Params | Role |
+|---|---|---:|---|
+| 2D pose + tracking | YOLO26-Pose (L/X) | ~0.05B | Per-frame 17-keypoint skeletons, multi-person tracking |
+| Segmentation | SAM 3.1 (base) | ~0.85B | Clean athlete mask, occlusion handling, prompt for 3D |
+| 3D body | SAM 3D Body | ~0.7–1B* | Single-image 3D mesh → true joint angles, view-invariant |
+| Scoring head | ST-GCN / temporal transformer (fine-tuned) | ~0.01–0.05B | Pose-sequence → candidate 0–3 + confidence |
+| Judge / explainer | Qwen3-VL-8B-Instruct | 8B | Movement ID, rubric reasoning, final score + rationale |
+| Retrieval | Qwen3-VL-Embedding-8B | 8B | Nearest physio-scored reference clips (RAG) |
+| **Total** | | **~17.8B** | Comfortable headroom under 32B |
+\* SAM 3D Body's exact count isn't published prominently — verify on the model card. It's SAM-3-family and sub-billion-class; budget impact is small either way. The two 8B Qwen models **share the Qwen3-VL-8B backbone** (the embedder is built on the instruct model), which is conceptually clean and operationally efficient.
+### Alternative config — "Heavy reasoner" (~28.7B)
+Swap the 8B judge for **Qwen3.6-27B** (multimodal, strong tool-calling, MTP speedups on llama.cpp). Budget then = 27 + ~0.85 + ~1 + small ≈ **28.7B**. This **leaves no room for the 8B embedder**, so you'd drop RAG (or replace it with a sub-0.5B embedder, or use pose-feature similarity for retrieval). Note: Qwen3.6-27B's MTP speculative decoding currently can't run simultaneously with image input (`--mmproj`), so for vision you run it without MTP.
+**Recommendation: ship the ~18B portfolio config.** RAG over the physio's few labeled clips is worth more than raw reasoning horsepower on this task, the headroom de-risks the budget, and "many small specialists" is the better hackathon story.
+---
+## 6. Model selection rationale
+**YOLO26-Pose** — current-generation YOLO pose; single forward pass for detection + keypoints, NMS-free, real-time even on edge. Tiny param cost. It also handles **multiple people in frame** (important: team videos often have other players/staff visible) and feeds keypoints downstream. Off-the-shelf it predicts COCO human keypoints; can be fine-tuned for custom landmarks (e.g. dowel endpoints) if needed.
+**SAM 3.1** — gives a clean athlete mask and stable multi-object video tracking (Object Multiplex makes it fast). Two jobs: (a) isolate the target athlete from teammates/background so pose and 3D aren't polluted, (b) provide the mask prompt that SAM 3D Body consumes. Concept prompts ("the person in the blue jersey performing the squat") are a bonus for disambiguation.
+**SAM 3D Body** — *the addition that makes the scores trustworthy.* FMS criteria are joint angles and symmetry; 2D pose can't measure these reliably across camera angles (projection ambiguity). 3D mesh recovery from a single image, promptable with the 2D keypoints + mask you already have, yields view-invariant joint angles (the MHR rig even separates skeletal structure from soft-tissue shape, which is convenient for angle extraction). This is the difference between "looks bent" and "femur is 4° above horizontal → not a 3."
+**Skeletal-temporal scoring head** — your AQA component and your **Well-Tuned** badge. Recommend a compact **ST-GCN** (graph conv over the skeleton, temporal conv over frames) over a from-scratch transformer, because it's far more data-efficient on a tiny labeled set. Pre-train on public AQA / pose-action data, then fine-tune on the physio's labels. Output: per-test candidate score + a confidence the judge can weigh.
+**Qwen3-VL-8B-Instruct** — the judge. Strong video temporal modeling (Interleaved-MRoPE, timestamp alignment) suits movement clips. It identifies which of the 7 tests is being performed, reads the biomechanics, considers retrieved exemplars and the head's candidate, and emits the final score + rationale + detected compensation. GGUF → llama.cpp → Llama Champion.
+**Qwen3-VL-Embedding-8B** — retrieval. Embeds the query clip (or its keyframes/pose-render) and finds the physio's most similar already-scored clips to anchor the judge. Top multimodal retriever on MMEB-V2; same backbone as the judge; GGUF available.
+---
+## 7. Architecture — an agentic pipeline
+Structured as cooperating specialist agents (maps naturally onto an OFP-style orchestration, with a Director coordinating and quality-gating). Each agent has one job and a typed output.
+```
+                         ┌──────────────────────────────────────────────┐
+   video upload  ───────▶│  IngestAgent                                  │
+                         │  decode, normalize FPS, sample frames         │
+                         └───────────────┬──────────────────────────────┘
+                                         ▼
+                         ┌──────────────────────────────────────────────┐
+                         │  SegmentationAgent  (SAM 3.1)                 │
+                         │  athlete mask + track id (reject teammates)   │
+                         └───────────────┬──────────────────────────────┘
+                                         ▼
+              ┌──────────────────────────┴──────────────────────────┐
+              ▼                                                      ▼
+  ┌───────────────────────────┐                      ┌───────────────────────────┐
+  │ PoseAgent (YOLO26-Pose)    │                      │ Body3DAgent (SAM 3D Body)  │
+  │ 2D keypoints per frame     │ ───keypoints+mask──▶ │ 3D mesh / joint angles     │
+  └───────────────┬───────────┘                      └───────────────┬───────────┘
+                  └─────────────────────┬────────────────────────────┘
+                                        ▼
+                         ┌──────────────────────────────────────────────┐
+                         │  MovementClassifierAgent                      │
+                         │  which of the 7 FMS tests? (VLM or small CLS) │
+                         └───────────────┬──────────────────────────────┘
+                                         ▼
+              ┌──────────────────────────┴──────────────────────────┐
+              ▼                          ▼                           ▼
+  ┌────────────────────┐   ┌─────────────────────────┐   ┌────────────────────────┐
+  │ BiomechanicsAgent  │   │ ScoringAgent (ST-GCN)    │   │ RetrievalAgent          │
+  │ rubric angles,     │   │ candidate 0–3 + conf     │   │ (Qwen3-VL-Embedding)    │
+  │ ROM, symmetry,     │   │ from pose sequence       │   │ k nearest physio clips  │
+  │ alignment, timing  │   │                          │   │ + their scores          │
+  └─────────┬──────────┘   └───────────┬─────────────┘   └───────────┬────────────┘
+            └───────────────────────────┴──────────────────────────┘
+                                        ▼
+                         ┌──────────────────────────────────────────────┐
+                         │  JudgeAgent  (Qwen3-VL-8B)                    │
+                         │  rubric + measurements + exemplars + candidate│
+                         │  → final 0–3, rationale, compensation tag,    │
+                         │    corrective hint, PAIN/CLEARING → defer      │
+                         └───────────────┬──────────────────────────────┘
+                                         ▼
+                         ┌──────────────────────────────────────────────┐
+                         │  ReportAgent                                  │
+                         │  per-test card, composite 0–21, asymmetry     │
+                         │  flags, annotated video, exportable PDF       │
+                         └──────────────────────────────────────────────┘
+```
+**Agent contracts (sketch):**
+- `IngestAgent` → `{frames[], fps, duration, n_people}`
+- `SegmentationAgent` → `{athlete_track_id, masks[]}`
+- `PoseAgent` → `{keypoints_2d[frame][joint]={x,y,conf}}`
+- `Body3DAgent` → `{joints_3d[frame][joint]={x,y,z}, mesh_optional}`
+- `MovementClassifierAgent` → `{test_name, side: left|right|n/a, confidence}`
+- `BiomechanicsAgent` → `{features: {torso_tibia_angle, hip_flexion_deg, knee_valgus_deg, dowel_alignment, L_R_symmetry, ...}}`
+- `ScoringAgent` → `{candidate_score: 0–3, confidence}`
+- `RetrievalAgent` → `{exemplars: [{clip_id, score, similarity}]}`
+- `JudgeAgent` → `{score: 0–3, rationale, compensation_tags[], corrective_hint, needs_human: bool}`
+- `ReportAgent` → `{per_test[], composite, asymmetries[], overlay_video, pdf}`
+**Quality gating:** if the ST-GCN candidate and the JudgeAgent disagree by ≥1 point, or any agent confidence is low, the report marks the test **"low confidence — physio review recommended."** This keeps the human in the loop and is itself a selling point.
+---
+## 8. Scoring methodology, per test
+The seven tests reduce to measurable quantities. Build a small rubric module — one scoring function per test — that consumes the 3D features and returns a score with the triggering reason. Examples:
+- **Deep Squat (3):** femur below horizontal AND torso parallel to tibia AND knees tracking over feet AND dowel over feet. **(2):** same but achieved only with heels elevated. **(1):** criteria unmet even with heels elevated. → all four conditions are angle/alignment checks on the 3D pose.
+- **Hurdle Step / In-Line Lunge / Shoulder Mobility / ASLR:** bilateral — score each side, **record the lower** as the test score, and **always emit the asymmetry** even when the score is the same.
+- **Trunk Stability Push-Up / Rotary Stability:** trunk rigidity / timing of limb movement — temporal features from the pose sequence; the ST-GCN head is most valuable here.
+- **Pain / clearing tests (0):** the system **cannot** detect pain. Any clearing test, or a visible distress/abort, sets `needs_human = true` and the test is **not auto-scored**. Defer to the physio. State this loudly.
+Final composite = sum of seven test scores (0–21), plus an asymmetry summary. The number is never shown without its rationale.
+---
+## 9. Data & fine-tuning plan (tiny-dataset survival guide)
+You have "a couple" of physio-scored clips. Treat them as gold, not as a training set.
+1. **Deterministic backbone first.** Get the biomechanics rubric working with no training. Validate the measured angles against the physio's scores qualitatively. This alone may be demo-ready.
+2. **Pre-train the ST-GCN** on public pose-action / AQA data (action recognition or generic AQA) so it learns temporal movement structure, not FMS labels.
+3. **Fine-tune on the physio's clips** with heavy augmentation: temporal crops/speed jitter, mirror (left↔right, doubles your bilateral data), camera-angle perturbation in 3D, joint noise. Few-shot, regularized, early-stopped.
+4. **Hold out at least one physio-scored clip** as a sanity check the judge never sees.
+5. **RAG instead of more training.** Every labeled clip goes into the embedding index as a scoring anchor. New clips added later improve the system with no retraining — a nice longitudinal story for the physio.
+6. **Publish the fine-tuned head** to the Hub with a model card (→ Well-Tuned badge). Include the augmentation recipe and the honest "trained on N clips, treat as assistive" caveat.
+**Label schema to collect from the physio** (if you can get a bit more data): `clip_id, athlete_id, test_name, side, score(0–3), pain(bool), compensation_notes, camera_view`. Even 20–30 well-labeled clips meaningfully helps.
+---
+## 10. Gradio Space & deployment
+**UI (targets Off-Brand badge):**
+- `gr.Video` upload (or webcam capture) + a test-type selector (auto-detect, with manual override).
+- Results panel: the 0–3 score as a large dial/patch, the composite 0–21, an asymmetry strip (L/R bars), and the **rationale text**.
+- The annotated overlay video: skeleton + the specific angle that decided the score drawn on the frame where it mattered.
+- A rubric drawer that shows the official 3/2/1 criteria for the detected test, with the met/unmet conditions checked off.
+- A persistent **"Screening aid — not a diagnosis. Pain or clearing tests require a clinician."** banner.
+- Custom CSS / `gr.Server` for a non-default look (scout/trail-map theme would rhyme with the hackathon, and with your design instincts).
+**Compute:**
+- ZeroGPU (H200 slice) can host the ~18B portfolio; load pose/SAM/3D eagerly, the VLM + embedder via llama.cpp.
+- For **Off the Grid**, ensure zero external API calls — everything served on-Space.
+- For **Llama Champion**, route the VLM + embedding through llama.cpp (GGUF builds exist for Qwen3-VL-8B-Instruct, Qwen3-VL-Embedding-8B, and Qwen3.6-27B). On a Space, watch the CUDA/llama-cpp build flags — recent hackathon Spaces hit `libcudart` issues; a CPU-only or pinned-CUDA build is the usual fix.
+- Persist the embedding index and accumulated labels in Space storage for the longitudinal baseline.
+---
+## 11. Clinical safety & ethics (bake this in, don't bolt it on)
+- **Not a medical device.** Screening aid only. No diagnosis, no injury prediction, no treatment advice beyond generic FMS-style correctives.
+- **Pain is out of scope** for automatic scoring — always defer to the physio.
+- **Human-in-the-loop by design:** low-confidence and disagreement cases are surfaced, not hidden.
+- **Consent & privacy:** athlete videos are biometric data. Get consent; don't log/persist clips beyond what the physio approves; document retention in the writeup.
+- **Honesty in the demo:** show a case the system gets right *and* one it flags as uncertain. Judges (and physios) trust calibrated tools more than confident ones.
+---
+## 12. Build plan — two weekends (June 5–15)
+**Weekend 1 — the spine works end to end:**
+- Day 1: Space scaffold, `gr.Video` in → skeleton overlay out (YOLO26-Pose). Ingest + Segmentation + Pose agents.
+- Day 2: SAM 3D Body integrated; BiomechanicsAgent computing Deep-Squat angles; first deterministic score on a real clip.
+- Goal: upload a squat video, get a rationalized 0–3. *This alone is a viable demo.*
+**Midweek:** wire the JudgeAgent (Qwen3-VL via llama.cpp), MovementClassifier, and the rubric module for all 7 tests. Attend the AMA — confirm the param-sum interpretation.
+**Weekend 2 — make it sing:**
+- ST-GCN pre-train + few-shot fine-tune on physio clips; publish to Hub.
+- RetrievalAgent + embedding index over labeled clips.
+- Custom UI polish, asymmetry view, PDF export, safety banners.
+- Record the demo video (physio uses it on a real player), write the social post, publish the agent trace and the blog post.
+---
+## 13. Risks & open questions
+- **Param-sum interpretation** — biggest unknown. The ~18B config is safe under either reading; confirm anyway.
+- **SAM 3D Body on a Space** — verify weights, license, and that it runs within ZeroGPU limits; have a 2D-only fallback (angles from 2D + camera-angle caveats) if it's too heavy.
+- **Single-camera angle limits** even with 3D — note it; recommend a consistent capture protocol (fixed camera position) for the physio, which also improves the longitudinal baseline.
+- **Tiny dataset** — the deterministic rubric must stand on its own so the demo doesn't hinge on the learned head generalizing from a few clips.
+- **llama.cpp + vision build** on Spaces — budget time for the CUDA build dance; CPU fallback for the embedder is fine.
+- **Movement misclassification** — if the wrong test is detected, scoring is meaningless; keep the manual override prominent.
+---
+## 14. Quick reference — the stack
+| Layer | Choice | Badge it helps |
+|---|---|---|
+| 2D pose | YOLO26-Pose | — |
+| Segmentation/track | SAM 3.1 | — |
+| 3D biomechanics | SAM 3D Body | — |
+| Learned scoring | ST-GCN (fine-tuned, published) | Well-Tuned |
+| Judge/explainer | Qwen3-VL-8B-Instruct (llama.cpp) | Llama Champion |
+| Retrieval | Qwen3-VL-Embedding-8B (llama.cpp) | Llama Champion |
+| Serving | On-Space, no cloud APIs | Off the Grid |
+| Frontend | Custom Gradio (scout theme) | Off-Brand |
+| Trace | Published agent run on Hub | Sharing is Caring |
+| Writeup | Blog post w/ honesty section | Field Notes |
+*Total ≈ 18B params. Honest, explainable, human-in-the-loop, runs on a laptop.*

docs/FormScout-Starter-Kit.md CHANGED Viewed

@@ -1,169 +1,169 @@
-# FormScout — Starter Kit & Resource Pack
-Companion to `FormScout-FMS-Spec.md` and `FormScout-Build-Prompt.md`. Every link below was checked. Read §1 first — some items are time-sensitive and block the build if you leave them late.
----
-## 1. Do this NOW (before the hack window — some take hours to clear)
-- [ ] **Request access to the gated Meta checkpoints today.** Both are gated on Hugging Face and approval isn't instant:
-  - SAM 3 / SAM 3.1 — request on the SAM 3 repos (you need the latest code for the 3.1 checkpoints).
-  - SAM 3D Body — `facebook/sam-3d-body-dinov3` and `facebook/sam-3d-body-vith` both require an access request, then an authenticated download. **Note:** data/checkpoints are blocked in sanctioned jurisdictions — shouldn't affect SK, but verify.
-- [ ] **Put your HF token in the Space secrets** so the Space can pull the gated weights at build time.
-- [ ] **Check licenses before you commit to a model** (this affects whether you can even submit):
-  - Qwen3-VL-8B / Qwen3-VL-Embedding-8B / Qwen3.6 → **Apache-2.0** (clean).
-  - SAM 3 / SAM 3.1 / SAM 3D Body → **SAM License** (not Apache; read the terms — there are use restrictions).
-  - Ultralytics YOLO26 → historically **AGPL-3.0** (open-sourcing obligations; commercial license exists). Verify on the model/repo and make sure an AGPL dependency is OK for your submission. If it's a problem, RTMPose/ViTPose are alternatives.
-  - pyskl / MMAction2 → Apache-2.0.
-  - KIMORE / UI-PRMD → academic/research terms; check before redistributing anything derived.
-- [ ] **Confirm the param-counting rule in the Discord AMA.** Specifically: (a) is it summed across the pipeline or per-model? (b) do **frozen** base models count? (c) does a LoRA adapter's base count? Your ~18B config is safe under the strict reading either way, but get it on record.
----
-## 2. Literature package
-### 2.1 The framing that wins — "evaluate like an FMS reliability study"
-The single most credible move in your writeup: evaluate FormScout the way the clinical literature evaluates human FMS raters. Treat the model as a *second rater* and report **weighted Cohen's κ** and **ICC** against the physio, the exact metrics the reliability papers use. That instantly makes your results legible to any sports-medicine reader and is far more honest than a vanity accuracy number.
-| Resource | What it gives you | Link |
-|---|---|---|
-| Physiopedia — FMS | Clean overview of the 7 tests + 0–21 scoring | https://www.physio-pedia.com/Functional_Movement_Screen_(FMS) |
-| FMS reliability study (JOSPT 2012) | The ICC/κ numbers and method you'll mirror in your eval | https://www.jospt.org/doi/10.2519/jospt.2012.3838 |
-| FMS in elite youth soccer (PMC) | Per-test scores, asymmetries, clearing-test order | https://pmc.ncbi.nlm.nih.gov/articles/PMC5675373/ |
-| Clinician's guide to FMS scoring | Per-test 3/2/1 criteria in plain language (rubric source) | https://meloqdevices.com/blogs/meloq-updates/functional-movement-screening |
-> **Honesty anchor for the blog post:** the popular "≤14 → injury risk" cutoff has weak/mixed predictive validity. Sell standardization, asymmetry detection, and a repeatable baseline — not prediction.
-### 2.2 Action Quality Assessment — surveys & living lists
-| Resource | Why | Link |
-|---|---|---|
-| *A Decade of AQA* (survey, 2025, 200+ papers, PRISMA) | The map of the whole field; start here | https://arxiv.org/abs/2502.02817 · code: https://github.com/HaoYin116/Survey_of_AQA |
-| *Comprehensive Survey of AQA: Method & Benchmark* (2024) | Taxonomy by modality (video / **skeleton** / multimodal) + unified benchmark | https://arxiv.org/abs/2412.11149 · page: https://zhoukanglei.github.io/AQA-Survey |
-| Awesome-AQA (ZhouKanglei) | Curated, **has a Medical-Care/rehab section** — your closest analogues | https://github.com/ZhouKanglei/Awesome-AQA |
-| Awesome-AQA (Lyman-Smoker) | Second list; catches papers the other misses (FLEX, ExAct, etc.) | https://github.com/Lyman-Smoker/Awesome-AQA |
-### 2.3 Skeleton-based scoring — the methods your head will borrow from
-| Paper | Relevance to FormScout | Link |
-|---|---|---|
-| ST-GCN (original) | The graph-over-skeleton + temporal-conv backbone | https://github.com/open-mmlab/mmaction2/blob/main/configs/skeleton/stgcn/README.md |
-| AQA via Hierarchical **Pose-guided** Multi-Stage Contrastive Regression (TIP 2025) | Pose-guided + contrastive regression with few labels — close to your setup | https://arxiv.org/abs/2501.03674 |
-| Attention-guided Movement **Quality** Assessment + skeletal augmentation (UI-PRMD/KIMORE) | Transformer MQA on clinician-scored rehab data; **augmentation recipe for tiny sets** | https://arxiv.org/pdf/2204.07840 |
-| SSL-Rehab: self-supervised 3D skeleton + **LoRA** fine-tune (KIMORE/UI-PRMD) | Pretrain→LoRA recipe for small clinical datasets (uses your LoRA muscle) | https://www.sciencedirect.com/science/article/abs/pii/S1077314224003564 |
-| Skeleton-based AQA w/ anomaly-aware DTW (Sensors 2025) | DTW alignment + anomaly scoring; cheap, label-light baseline | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12693942/ |
----
-## 3. Models & tooling (verified)
-| Component | Repo / card | Params | License | Gated? |
-|---|---|---:|---|---|
-| YOLO26-Pose | https://docs.ultralytics.com/tasks/pose | <0.1B | AGPL-3.0* | no |
-| SAM 3.1 | https://github.com/facebookresearch/sam3 | ~0.85B | SAM License | **yes** |
-| SAM 3D Body | https://github.com/facebookresearch/sam-3d-body · https://huggingface.co/facebook/sam-3d-body-dinov3 | sub-1B† | SAM License | **yes** |
-| ST-GCN++ / PoseConv3D | https://github.com/kennymckormick/pyskl | ~0.01–0.05B | Apache-2.0 | no |
-| Qwen3-VL-8B-Instruct | https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct | 8B | Apache-2.0 | no |
-| Qwen3-VL-Embedding-8B | https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B (GGUF: dam2452/...-GGUF) | 8B | Apache-2.0 | no |
-| Qwen3.6-27B (alt brain) | https://huggingface.co/unsloth/Qwen3.6-27B-GGUF | 27B | Apache-2.0 | no |
-\* verify the current YOLO26 license. † two variants (`dinov3`, `vith`); confirm exact count on the card — budget impact is small either way. SAM 3 itself is 848M.
-**Useful extras:** SAM 3D Body uses a Momentum Human Rig (MHR) that separates skeleton from soft-tissue shape — convenient for clean joint-angle extraction. The repo ships a notebook combining SAM 3D Body + SAM 3D Objects in one frame of reference. SAM 3D Body demo: https://www.aidemos.meta.com/segment-anything/editor/convert-body-to-3d
----
-## 4. Datasets for transfer / pretraining
-You have a couple of labeled clips. Pretrain on clinician-scored movement-quality data first, then few-shot fine-tune. These are the most transferable to FMS (ranked by relevance):
-| Dataset | Why it's the closest analogue | Link |
-|---|---|---|
-| **KIMORE** | Clinician **scores** of low-back-pain rehab exercises (trunk control, multi-plane) — same "score movement quality" task as FMS; partially overlaps Deep Squat / Rotary Stability / TSPU mechanics | https://www.researchgate.net/publication/333791841 (search "KIMORE dataset") |
-| **UI-PRMD** | 10 rehab movements, correct vs. incorrect executions; standard MQA benchmark, pairs with KIMORE | search "UI-PRMD University of Idaho Physical Rehabilitation Movements" |
-| **Fitness-AQA** | Real gym **squat/deadlift form errors** — directly relevant to Deep Squat compensations | https://github.com/ParitoshParmar/MTL-AQA (links Fitness-AQA) |
-| **FLEX** | Large multi-modal fitness AQA dataset | via Lyman-Smoker/Awesome-AQA |
-| **MTL-AQA / AQA-7 / FineFS** | General sports AQA for backbone pretraining (diving, skating) | https://github.com/ParitoshParmar/MTL-AQA |
-**FMS-specific public video data is scarce** — don't expect a drop-in set. Your physio's clips are the gold; everything above is for pretraining the temporal backbone so it learns movement structure before it ever sees an FMS label.
----
-## 5. Build & deploy tooling
-| Need | Link |
-|---|---|
-| Gradio docs (v6) | https://www.gradio.app/docs |
-| `gradio.Server` — custom frontend + Gradio backend (Off-Brand badge) | https://www.gradio.app/guides/server-mode · blog: https://huggingface.co/blog/introducing-gradio-server |
-| Gradio AI coding-assistant skill | `gradio skills add --claude` (PyPI: https://pypi.org/project/gradio/) |
-| Gradio changelog (confirm `gr.Walkthrough`, `gr.Navbar`, `gr.Video.playback_position`) | https://www.gradio.app/changelog |
-| HF Spaces ZeroGPU (`@spaces.GPU`) | https://huggingface.co/docs/hub/spaces-zerogpu |
-| llama.cpp | https://github.com/ggml-org/llama.cpp |
-| pyskl (ST-GCN++/PoseConv3D, custom-video tutorial incl. diving48) | https://github.com/kennymckormick/pyskl |
-| MMAction2 (broader video understanding) | https://github.com/open-mmlab/mmaction2 |
-| Hackathon's own trailheads (ML Intern, Gradio guides) | https://github.com/huggingface/ml-intern |
-> **Hackathon-specific gotcha already seen in the org:** another team's Space hit `libcudart.so.12` errors and had to swap llama.cpp for transformers + `spaces.GPU`. Plan for it — isolate the llama.cpp build (CPU-only or pinned-CUDA) and keep a transformers fallback. For the scoring head, a small hand-rolled ST-GCN may deploy more cleanly on a Space than the full MMAction2/pyskl stack — prototype with pyskl, ship lean.
----
-## 6. Two artifacts you probably haven't made yet
-### 6.1 Data & capture protocol (highest-leverage non-code work)
-With a tiny dataset, controlling *how* clips are captured beats any model tweak. Give the physio a one-pager:
-- **Camera:** one fixed position, tripod, ~3 m back, lens at hip height, landscape, 1080p/30fps+. Same setup every session — this is what makes 3D consistent and the longitudinal baseline meaningful.
-- **Framing:** whole body in frame for the whole rep, including the dowel. Plain-ish background, even lighting, no backlight.
-- **One athlete in frame** at scoring time (or note who to track). For bilateral tests, capture **both sides** and label each.
-- **Label schema (CSV):** `clip_id, athlete_id, date, test_name, side(L/R/NA), score(0–3), pain(bool), compensation_notes(free text), camera_view, consent_on_file(bool)`.
-- **One rep per clip** to start (simplest). If sessions are continuous, you'll need temporal segmentation first — flag it to the build agent at Phase 1.
-### 6.2 Evaluation plan
-Define "good" before you train, given so few labels:
-- **Primary:** Spearman ρ between predicted and physio scores (the AQA-standard metric), plus **exact-match** and **±1 accuracy** per test.
-- **Clinical credibility:** **weighted Cohen's κ** and **ICC** of model-vs-physio, reported alongside the human inter-rater numbers from the JOSPT study — i.e. "how does FormScout compare to a second human rater?"
-- **Asymmetry:** detection rate of L/R asymmetries the physio flagged (this is one of the FMS's most defensible outputs).
-- **Validation:** leave-one-clip-out CV (you can't afford a held-out test split). Keep ≥1 clip the judge never sees for the demo.
-- **Calibration:** report when the system says "low confidence / physio review" and show it's right to do so. A well-calibrated, humble tool reads as more trustworthy than a confident one.
----
-## 7. Ethics, consent & data handling (EU / Slovakia)
-You're filming identifiable athletes, possibly **minors** on a youth team. This is biometric personal data under GDPR — treat it as first-class, and say so in your submission (judges and physios both reward it):
-- **Consent:** written consent from each athlete (and a parent/guardian for anyone under 18) before any footage is used. No consent → not in the dataset, not in the demo.
-- **Data minimization & retention:** keep only what you need; don't persist raw clips on the Space beyond what's approved; document a retention/deletion policy. Prefer storing derived skeletons over raw video where possible.
-- **Demo footage:** use a consenting adult (you, a teammate) for the public demo video rather than a minor athlete, even if you trained on team data privately.
-- **Framing:** screening aid, not a medical device; pain/clearing tests always defer to the clinician; human-in-the-loop by design.
----
-## 8. The transfer-learning recipe (ties it together)
-1. **Backbone pretrain** — ST-GCN++ on a general skeleton-action set (NTU/Kinetics skeletons via pyskl) so it learns motion structure.
-2. **Domain adapt** — continue on **KIMORE + UI-PRMD** (clinician-scored movement quality) so it learns *quality*, not just *what action*.
-3. **Few-shot fine-tune** — **LoRA** on the physio's FMS clips with heavy augmentation (temporal jitter, **L↔R mirror** to double bilateral data, 3D camera-angle perturbation, joint noise). The SSL-Rehab paper (§2.3) is your blueprint and it's exactly your LoRA wheelhouse.
-4. **Don't over-train the head** — let deterministic biomechanics carry the demo; the learned head and RAG are the refinement and the badges, not the foundation.
----
-## 9. Demo & submission storyboard (the "make it sing" 30%)
-The submission needs a demo video + social post; "Show, Don't Tell" is a literal rule. A tight 60–90s cut:
-1. **0–10s** — the problem: physio eyeballing a squat, scribbling a score. "Same player, two raters, two scores."
-2. **10–35s** — upload the clip to FormScout → skeleton overlay → 0–3 with the *deciding angle drawn on the frame* (`playback_position` jump). The "aha" shot.
-3. **35–55s** — the scorecard: composite 0–21, the L/R asymmetry strip, a "low confidence — physio review" flag on a borderline case (honesty sells).
-4. **55–75s** — the physio reacting / using it on a real player (the Backyard AI "they actually used it" proof).
-5. **End card** — "Runs on a laptop. ~18B params. Screening aid, not a diagnosis." Link the Space, the published head, the agent trace, the blog.
-Social post: lead with the overlay GIF + the asymmetry-detection angle; tag Gradio/HF; one line of honest framing.
----
-*Built to give FormScout the best shot. The two things most teams underinvest in — the capture protocol (§6.1) and the honest, clinical-style evaluation (§6.2, §2.1) — are exactly where this project can out-class flashier entries. Good luck. 🏀*

+# FormScout — Starter Kit & Resource Pack
+Companion to `FormScout-FMS-Spec.md` and `FormScout-Build-Prompt.md`. Every link below was checked. Read §1 first — some items are time-sensitive and block the build if you leave them late.
+---
+## 1. Do this NOW (before the hack window — some take hours to clear)
+- [ ] **Request access to the gated Meta checkpoints today.** Both are gated on Hugging Face and approval isn't instant:
+  - SAM 3 / SAM 3.1 — request on the SAM 3 repos (you need the latest code for the 3.1 checkpoints).
+  - SAM 3D Body — `facebook/sam-3d-body-dinov3` and `facebook/sam-3d-body-vith` both require an access request, then an authenticated download. **Note:** data/checkpoints are blocked in sanctioned jurisdictions — shouldn't affect SK, but verify.
+- [ ] **Put your HF token in the Space secrets** so the Space can pull the gated weights at build time.
+- [ ] **Check licenses before you commit to a model** (this affects whether you can even submit):
+  - Qwen3-VL-8B / Qwen3-VL-Embedding-8B / Qwen3.6 → **Apache-2.0** (clean).
+  - SAM 3 / SAM 3.1 / SAM 3D Body → **SAM License** (not Apache; read the terms — there are use restrictions).
+  - Ultralytics YOLO26 → historically **AGPL-3.0** (open-sourcing obligations; commercial license exists). Verify on the model/repo and make sure an AGPL dependency is OK for your submission. If it's a problem, RTMPose/ViTPose are alternatives.
+  - pyskl / MMAction2 → Apache-2.0.
+  - KIMORE / UI-PRMD → academic/research terms; check before redistributing anything derived.
+- [ ] **Confirm the param-counting rule in the Discord AMA.** Specifically: (a) is it summed across the pipeline or per-model? (b) do **frozen** base models count? (c) does a LoRA adapter's base count? Your ~18B config is safe under the strict reading either way, but get it on record.
+---
+## 2. Literature package
+### 2.1 The framing that wins — "evaluate like an FMS reliability study"
+The single most credible move in your writeup: evaluate FormScout the way the clinical literature evaluates human FMS raters. Treat the model as a *second rater* and report **weighted Cohen's κ** and **ICC** against the physio, the exact metrics the reliability papers use. That instantly makes your results legible to any sports-medicine reader and is far more honest than a vanity accuracy number.
+| Resource | What it gives you | Link |
+|---|---|---|
+| Physiopedia — FMS | Clean overview of the 7 tests + 0–21 scoring | https://www.physio-pedia.com/Functional_Movement_Screen_(FMS) |
+| FMS reliability study (JOSPT 2012) | The ICC/κ numbers and method you'll mirror in your eval | https://www.jospt.org/doi/10.2519/jospt.2012.3838 |
+| FMS in elite youth soccer (PMC) | Per-test scores, asymmetries, clearing-test order | https://pmc.ncbi.nlm.nih.gov/articles/PMC5675373/ |
+| Clinician's guide to FMS scoring | Per-test 3/2/1 criteria in plain language (rubric source) | https://meloqdevices.com/blogs/meloq-updates/functional-movement-screening |
+> **Honesty anchor for the blog post:** the popular "≤14 → injury risk" cutoff has weak/mixed predictive validity. Sell standardization, asymmetry detection, and a repeatable baseline — not prediction.
+### 2.2 Action Quality Assessment — surveys & living lists
+| Resource | Why | Link |
+|---|---|---|
+| *A Decade of AQA* (survey, 2025, 200+ papers, PRISMA) | The map of the whole field; start here | https://arxiv.org/abs/2502.02817 · code: https://github.com/HaoYin116/Survey_of_AQA |
+| *Comprehensive Survey of AQA: Method & Benchmark* (2024) | Taxonomy by modality (video / **skeleton** / multimodal) + unified benchmark | https://arxiv.org/abs/2412.11149 · page: https://zhoukanglei.github.io/AQA-Survey |
+| Awesome-AQA (ZhouKanglei) | Curated, **has a Medical-Care/rehab section** — your closest analogues | https://github.com/ZhouKanglei/Awesome-AQA |
+| Awesome-AQA (Lyman-Smoker) | Second list; catches papers the other misses (FLEX, ExAct, etc.) | https://github.com/Lyman-Smoker/Awesome-AQA |
+### 2.3 Skeleton-based scoring — the methods your head will borrow from
+| Paper | Relevance to FormScout | Link |
+|---|---|---|
+| ST-GCN (original) | The graph-over-skeleton + temporal-conv backbone | https://github.com/open-mmlab/mmaction2/blob/main/configs/skeleton/stgcn/README.md |
+| AQA via Hierarchical **Pose-guided** Multi-Stage Contrastive Regression (TIP 2025) | Pose-guided + contrastive regression with few labels — close to your setup | https://arxiv.org/abs/2501.03674 |
+| Attention-guided Movement **Quality** Assessment + skeletal augmentation (UI-PRMD/KIMORE) | Transformer MQA on clinician-scored rehab data; **augmentation recipe for tiny sets** | https://arxiv.org/pdf/2204.07840 |
+| SSL-Rehab: self-supervised 3D skeleton + **LoRA** fine-tune (KIMORE/UI-PRMD) | Pretrain→LoRA recipe for small clinical datasets (uses your LoRA muscle) | https://www.sciencedirect.com/science/article/abs/pii/S1077314224003564 |
+| Skeleton-based AQA w/ anomaly-aware DTW (Sensors 2025) | DTW alignment + anomaly scoring; cheap, label-light baseline | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12693942/ |
+---
+## 3. Models & tooling (verified)
+| Component | Repo / card | Params | License | Gated? |
+|---|---|---:|---|---|
+| YOLO26-Pose | https://docs.ultralytics.com/tasks/pose | <0.1B | AGPL-3.0* | no |
+| SAM 3.1 | https://github.com/facebookresearch/sam3 | ~0.85B | SAM License | **yes** |
+| SAM 3D Body | https://github.com/facebookresearch/sam-3d-body · https://huggingface.co/facebook/sam-3d-body-dinov3 | sub-1B† | SAM License | **yes** |
+| ST-GCN++ / PoseConv3D | https://github.com/kennymckormick/pyskl | ~0.01–0.05B | Apache-2.0 | no |
+| Qwen3-VL-8B-Instruct | https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct | 8B | Apache-2.0 | no |
+| Qwen3-VL-Embedding-8B | https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B (GGUF: dam2452/...-GGUF) | 8B | Apache-2.0 | no |
+| Qwen3.6-27B (alt brain) | https://huggingface.co/unsloth/Qwen3.6-27B-GGUF | 27B | Apache-2.0 | no |
+\* verify the current YOLO26 license. † two variants (`dinov3`, `vith`); confirm exact count on the card — budget impact is small either way. SAM 3 itself is 848M.
+**Useful extras:** SAM 3D Body uses a Momentum Human Rig (MHR) that separates skeleton from soft-tissue shape — convenient for clean joint-angle extraction. The repo ships a notebook combining SAM 3D Body + SAM 3D Objects in one frame of reference. SAM 3D Body demo: https://www.aidemos.meta.com/segment-anything/editor/convert-body-to-3d
+---
+## 4. Datasets for transfer / pretraining
+You have a couple of labeled clips. Pretrain on clinician-scored movement-quality data first, then few-shot fine-tune. These are the most transferable to FMS (ranked by relevance):
+| Dataset | Why it's the closest analogue | Link |
+|---|---|---|
+| **KIMORE** | Clinician **scores** of low-back-pain rehab exercises (trunk control, multi-plane) — same "score movement quality" task as FMS; partially overlaps Deep Squat / Rotary Stability / TSPU mechanics | https://www.researchgate.net/publication/333791841 (search "KIMORE dataset") |
+| **UI-PRMD** | 10 rehab movements, correct vs. incorrect executions; standard MQA benchmark, pairs with KIMORE | search "UI-PRMD University of Idaho Physical Rehabilitation Movements" |
+| **Fitness-AQA** | Real gym **squat/deadlift form errors** — directly relevant to Deep Squat compensations | https://github.com/ParitoshParmar/MTL-AQA (links Fitness-AQA) |
+| **FLEX** | Large multi-modal fitness AQA dataset | via Lyman-Smoker/Awesome-AQA |
+| **MTL-AQA / AQA-7 / FineFS** | General sports AQA for backbone pretraining (diving, skating) | https://github.com/ParitoshParmar/MTL-AQA |
+**FMS-specific public video data is scarce** — don't expect a drop-in set. Your physio's clips are the gold; everything above is for pretraining the temporal backbone so it learns movement structure before it ever sees an FMS label.
+---
+## 5. Build & deploy tooling
+| Need | Link |
+|---|---|
+| Gradio docs (v6) | https://www.gradio.app/docs |
+| `gradio.Server` — custom frontend + Gradio backend (Off-Brand badge) | https://www.gradio.app/guides/server-mode · blog: https://huggingface.co/blog/introducing-gradio-server |
+| Gradio AI coding-assistant skill | `gradio skills add --claude` (PyPI: https://pypi.org/project/gradio/) |
+| Gradio changelog (confirm `gr.Walkthrough`, `gr.Navbar`, `gr.Video.playback_position`) | https://www.gradio.app/changelog |
+| HF Spaces ZeroGPU (`@spaces.GPU`) | https://huggingface.co/docs/hub/spaces-zerogpu |
+| llama.cpp | https://github.com/ggml-org/llama.cpp |
+| pyskl (ST-GCN++/PoseConv3D, custom-video tutorial incl. diving48) | https://github.com/kennymckormick/pyskl |
+| MMAction2 (broader video understanding) | https://github.com/open-mmlab/mmaction2 |
+| Hackathon's own trailheads (ML Intern, Gradio guides) | https://github.com/huggingface/ml-intern |
+> **Hackathon-specific gotcha already seen in the org:** another team's Space hit `libcudart.so.12` errors and had to swap llama.cpp for transformers + `spaces.GPU`. Plan for it — isolate the llama.cpp build (CPU-only or pinned-CUDA) and keep a transformers fallback. For the scoring head, a small hand-rolled ST-GCN may deploy more cleanly on a Space than the full MMAction2/pyskl stack — prototype with pyskl, ship lean.
+---
+## 6. Two artifacts you probably haven't made yet
+### 6.1 Data & capture protocol (highest-leverage non-code work)
+With a tiny dataset, controlling *how* clips are captured beats any model tweak. Give the physio a one-pager:
+- **Camera:** one fixed position, tripod, ~3 m back, lens at hip height, landscape, 1080p/30fps+. Same setup every session — this is what makes 3D consistent and the longitudinal baseline meaningful.
+- **Framing:** whole body in frame for the whole rep, including the dowel. Plain-ish background, even lighting, no backlight.
+- **One athlete in frame** at scoring time (or note who to track). For bilateral tests, capture **both sides** and label each.
+- **Label schema (CSV):** `clip_id, athlete_id, date, test_name, side(L/R/NA), score(0–3), pain(bool), compensation_notes(free text), camera_view, consent_on_file(bool)`.
+- **One rep per clip** to start (simplest). If sessions are continuous, you'll need temporal segmentation first — flag it to the build agent at Phase 1.
+### 6.2 Evaluation plan
+Define "good" before you train, given so few labels:
+- **Primary:** Spearman ρ between predicted and physio scores (the AQA-standard metric), plus **exact-match** and **±1 accuracy** per test.
+- **Clinical credibility:** **weighted Cohen's κ** and **ICC** of model-vs-physio, reported alongside the human inter-rater numbers from the JOSPT study — i.e. "how does FormScout compare to a second human rater?"
+- **Asymmetry:** detection rate of L/R asymmetries the physio flagged (this is one of the FMS's most defensible outputs).
+- **Validation:** leave-one-clip-out CV (you can't afford a held-out test split). Keep ≥1 clip the judge never sees for the demo.
+- **Calibration:** report when the system says "low confidence / physio review" and show it's right to do so. A well-calibrated, humble tool reads as more trustworthy than a confident one.
+---
+## 7. Ethics, consent & data handling (EU / Slovakia)
+You're filming identifiable athletes, possibly **minors** on a youth team. This is biometric personal data under GDPR — treat it as first-class, and say so in your submission (judges and physios both reward it):
+- **Consent:** written consent from each athlete (and a parent/guardian for anyone under 18) before any footage is used. No consent → not in the dataset, not in the demo.
+- **Data minimization & retention:** keep only what you need; don't persist raw clips on the Space beyond what's approved; document a retention/deletion policy. Prefer storing derived skeletons over raw video where possible.
+- **Demo footage:** use a consenting adult (you, a teammate) for the public demo video rather than a minor athlete, even if you trained on team data privately.
+- **Framing:** screening aid, not a medical device; pain/clearing tests always defer to the clinician; human-in-the-loop by design.
+---
+## 8. The transfer-learning recipe (ties it together)
+1. **Backbone pretrain** — ST-GCN++ on a general skeleton-action set (NTU/Kinetics skeletons via pyskl) so it learns motion structure.
+2. **Domain adapt** — continue on **KIMORE + UI-PRMD** (clinician-scored movement quality) so it learns *quality*, not just *what action*.
+3. **Few-shot fine-tune** — **LoRA** on the physio's FMS clips with heavy augmentation (temporal jitter, **L↔R mirror** to double bilateral data, 3D camera-angle perturbation, joint noise). The SSL-Rehab paper (§2.3) is your blueprint and it's exactly your LoRA wheelhouse.
+4. **Don't over-train the head** — let deterministic biomechanics carry the demo; the learned head and RAG are the refinement and the badges, not the foundation.
+---
+## 9. Demo & submission storyboard (the "make it sing" 30%)
+The submission needs a demo video + social post; "Show, Don't Tell" is a literal rule. A tight 60–90s cut:
+1. **0–10s** — the problem: physio eyeballing a squat, scribbling a score. "Same player, two raters, two scores."
+2. **10–35s** — upload the clip to FormScout → skeleton overlay → 0–3 with the *deciding angle drawn on the frame* (`playback_position` jump). The "aha" shot.
+3. **35–55s** — the scorecard: composite 0–21, the L/R asymmetry strip, a "low confidence — physio review" flag on a borderline case (honesty sells).
+4. **55–75s** — the physio reacting / using it on a real player (the Backyard AI "they actually used it" proof).
+5. **End card** — "Runs on a laptop. ~18B params. Screening aid, not a diagnosis." Link the Space, the published head, the agent trace, the blog.
+Social post: lead with the overlay GIF + the asymmetry-detection angle; tag Gradio/HF; one line of honest framing.
+---
+*Built to give FormScout the best shot. The two things most teams underinvest in — the capture protocol (§6.1) and the honest, clinical-style evaluation (§6.2, §2.1) — are exactly where this project can out-class flashier entries. Good luck. 🏀*

docs/plans/FormScout-Build-Prompt.md CHANGED Viewed

@@ -1,168 +1,168 @@
-# Build Prompt — FormScout (FMS scoring on Gradio, ≤32B)
-> **How to use this:** paste everything below the line into your coding agent (Claude Code, Codex, Cursor, etc.) as the opening instruction. Attach `FormScout-FMS-Spec.md` alongside it — that file is the product source of truth; this file is the engineering contract and process. Work through it phase by phase.
----
-## ROLE
-You are a **senior Python + Gradio architect with ~10 years of shipping ML web apps**, including production Hugging Face Spaces, custom-frontend Gradio deployments, ZeroGPU services, and llama.cpp-served models. You are pragmatic, opinionated about defaults, allergic to dead code, and you **verify APIs against current docs instead of trusting your memory** — Gradio and the model ecosystem move fast and your training data may be stale. You build **vertical slices** that run end to end early, then deepen. You never hand back a broken app.
-## MISSION
-Build **FormScout**, a Gradio app hosted as a Hugging Face Space that scores Functional Movement Screen (FMS) videos 0–3 per test with an explainable rationale and an annotated overlay, for the Build Small Hackathon (Backyard AI track). Full product requirements are in the attached `FormScout-FMS-Spec.md`. Honor it; if you deviate, say why.
-## PRIME DIRECTIVES (read before writing any code)
-1. **Verify before you build.** Do Phase 0 recon first. Do not write against a Gradio/model API you have not confirmed exists in the current version. When unsure, read the doc or the model card, don't guess.
-2. **Vertical slice first.** The fastest path to a working `video in → scored overlay out` for *one* test beats a half-built version of all seven. Get something running on day one, then expand.
-3. **Stay under budget.** Total model parameters across the whole pipeline must be **≤ 32B**. Track a running sum in `MODEL_BUDGET.md` and update it whenever you add or swap a model. The target config is ~18B (see spec §5). If a choice would exceed 32B, stop and flag it.
-4. **No cloud model APIs.** All inference runs on the Space (Off the Grid badge). No OpenAI/Anthropic/Gemini/etc. calls for the core pipeline.
-5. **Honesty & safety are features, not footnotes.** This is a screening aid, not a diagnosis and not injury prediction. Pain and clearing tests are never auto-scored — they set `needs_human=true`. A safety banner is always visible. Low-confidence and agent-disagreement cases are surfaced, not hidden.
-6. **Modular agents, typed contracts.** Each pipeline stage is an independent module with a typed input/output (see spec §7). No god-functions. The pipeline must be runnable headless (no Gradio) for testing.
----
-## PHASE 0 — Recon & environment (do this first, report findings before coding)
-**Goal:** confirm the ground truth, then write a short `RECON.md` summarizing what you found and any deviations from the spec.
-1. **Install the Gradio skill** for this agent so you get current Gradio knowledge:
-   `gradio skills add --claude` (use the right flag for your agent; `--global` is fine).
-2. **Pin and confirm Gradio.** Determine the current major version (expect Gradio 6.x). Record the exact version you'll target in `requirements.txt`. Confirm these still exist and note their current signatures:
-   - `gr.Blocks`, `gr.Video` (incl. `playback_position` for jumping to the decisive frame), `gr.Walkthrough` / `gr.Step` (for the 7-test flow), `gr.Navbar` (multipage), custom theming / CSS.
-   - `gradio.Server` (custom-frontend mode) — decide **Blocks vs Server** for the UI (see UI section).
-   - ZeroGPU usage: the `@spaces.GPU` decorator pattern, and the caveat that with `gradio.Server` + ZeroGPU you must call endpoints via `@gradio/client` from the browser.
-3. **Verify every model** on its Hugging Face card — confirm it exists, its **license**, its **parameter count**, and whether a **GGUF** build exists for llama.cpp:
-   - YOLO26-Pose (Ultralytics) — pick a variant (l/x) and confirm license implications.
-   - SAM 3.1 (`facebookresearch/sam3`) — base checkpoint size.
-   - **SAM 3D Body** — *this is the uncertain one.* Confirm weights are public, the license, the **exact param count**, and that it runs within a ZeroGPU slice. If it's too heavy or not usable, fall back to **2D-only biomechanics** (angles from 2D pose + explicit camera-angle caveats) and note it.
-   - Qwen3-VL-8B-Instruct + Qwen3-VL-Embedding-8B — confirm GGUF builds and that they share the Qwen3-VL backbone.
-4. **llama.cpp on Spaces reality check.** Confirm a working install path; prior hackathon Spaces hit `libcudart.so` errors. Decide CPU-only vs pinned-CUDA build per model. Have a `transformers`/`spaces.GPU` fallback ready for any model that won't build under llama.cpp in time.
-5. **Open question to surface, not solve:** does "total parameters ≤ 32B" mean *per model* or *summed across the pipeline*? Design for the **summed** reading (safe under either). Note in `RECON.md` to confirm via the Discord AMA.
-**Exit criteria for Phase 0:** `RECON.md` exists with the Gradio version, a verified model table (name, params, license, GGUF y/n, runs-on-ZeroGPU y/n), the running param sum, the chosen UI approach, and any fallbacks triggered.
----
-## PHASE 1 — The spine (one test, end to end, headless + Gradio)
-**Goal:** upload a Deep Squat clip → get a rationalized 0–3 + skeleton overlay.
-- Scaffold the repo (structure below). Pipeline runs **headless** via `python -m formscout.run sample.mp4` before any UI.
-- Implement `IngestAgent` → `SegmentationAgent` (SAM 3.1) → `PoseAgent` (YOLO26-Pose). Reject non-target people via the mask/track id.
-- Implement `Body3DAgent` (SAM 3D Body) **or** the 2D fallback from Phase 0.
-- Implement `BiomechanicsAgent` for Deep Squat only: torso–tibia angle, hip-flexion depth (femur vs horizontal), knee tracking, dowel alignment.
-- Implement a **deterministic** rubric scorer for Deep Squat (3/2/1 per spec §8). No ML scoring yet.
-- Minimal Gradio UI: `gr.Video` in, score + rationale + overlay out.
-**Exit criteria:** a real squat clip produces a defensible score, a one-line reason citing the deciding measurement, and an overlay video. Runs on the Space.
----
-## PHASE 2 — All seven tests + the judge
-- Extend `BiomechanicsAgent` + rubric scorers to all 7 tests. Bilateral tests score each side, **report the lower**, and **always emit the asymmetry**.
-- `MovementClassifierAgent`: identify which test is in the clip (VLM or a small classifier) with a **manual override** in the UI.
-- `JudgeAgent` (Qwen3-VL-8B via llama.cpp): consumes rubric + measurements + the deterministic candidate → final 0–3, rationale, compensation tag, corrective hint. Pain/clearing → `needs_human=true`, **not scored**.
-- `ReportAgent`: per-test card, composite 0–21, asymmetry strip, annotated overlay, PDF export.
-**Exit criteria:** a multi-test session produces a full scorecard with composite + asymmetries; pain/clearing cases defer to human; disagreements between deterministic and judge scores are flagged.
----
-## PHASE 3 — Learned scoring + retrieval (the badges)
-- `ScoringAgent`: compact **ST-GCN** scoring head. Pre-train on public AQA/pose data, then **few-shot fine-tune** on the physio's labeled clips with heavy augmentation (temporal jitter, **left↔right mirror**, 3D camera-angle perturbation, joint noise). Hold out ≥1 labeled clip. **Publish the fine-tuned head to the Hub** with an honest model card → *Well-Tuned*.
-- `RetrievalAgent`: build a Qwen3-VL-Embedding-8B index over the physio's labeled clips; return k nearest + their scores to anchor the judge → RAG.
-- Wire the judge to weigh: deterministic candidate + ST-GCN candidate + retrieved exemplars.
-**Exit criteria:** scores incorporate the learned head and exemplars; adding a new labeled clip improves retrieval with **no retraining**.
----
-## PHASE 4 — Polish, ship, document
-- Custom UI pass (Off-Brand): scout/trail theme, score dial, asymmetry bars, rubric drawer with met/unmet checkboxes, decisive-frame jump via `playback_position`, persistent safety banner.
-- Persist the embedding index + accumulated labels in Space storage (longitudinal baseline).
-- **Publish one full agent trace** to the Hub (every agent's I/O for one run) → *Sharing is Caring*.
-- Write the **blog post / field notes** with the honesty section front-and-center → *Field Notes*.
-- Record the demo video (physio scores a real player) + the social post.
-**Exit criteria:** all six badges attempted, Space is green, demo + post + trace + blog are linked from the README.
----
-## REPO STRUCTURE (target)
-```
-formscout/
-  app.py                 # Gradio entrypoint (Blocks or Server)
-  formscout/
-    __init__.py
-    config.py            # paths, model ids, thresholds, feature flags
-    pipeline.py          # Director: orchestrates agents, quality-gates
-    run.py               # headless CLI entrypoint (no Gradio)
-    agents/
-      ingest.py
-      segmentation.py    # SAM 3.1
-      pose2d.py          # YOLO26-Pose
-      body3d.py          # SAM 3D Body (+ 2d fallback)
-      classify.py        # movement classifier
-      biomechanics.py    # rubric features per test
-      scoring.py         # ST-GCN learned head
-      retrieval.py       # Qwen3-VL-Embedding index
-      judge.py           # Qwen3-VL-8B judge
-      report.py          # scorecard, overlay, pdf
-    rubric/
-      deep_squat.py ...  # one scorer per FMS test, pure functions
-    types.py             # typed dataclasses for every agent contract
-    serving/
-      llama_cpp.py       # llama.cpp client wrappers + fallbacks
-    ui/
-      theme.py, components.py, custom/  # frontend assets
-    tracing.py           # structured per-agent I/O logging (for the trace badge)
-  tests/                 # headless tests per agent + a golden-clip e2e test
-  requirements.txt
-  README.md              # Space card: pitch, demo, trace, blog, safety
-  MODEL_BUDGET.md        # running param sum, must stay ≤32B
-  RECON.md               # Phase 0 findings
-```
-## ENGINEERING STANDARDS
-- **Typing everywhere.** Every agent takes and returns a dataclass from `types.py`. Validate at boundaries.
-- **Pure rubric functions.** Each test scorer is a pure function `(features) -> ScoreResult` with the triggering reason. Unit-test each against hand-computed cases.
-- **Defensive by default.** Handle: no person detected, multiple people, wrong/ambiguous test, occlusion, too-short clip, bad FPS, 3D model OOM. Degrade gracefully and tell the user what happened — never crash the Space.
-- **Confidence is first-class.** Every agent emits a confidence; the Director flags low confidence and ≥1-point judge/ST-GCN disagreement as "physio review recommended."
-- **Config over constants.** Thresholds, model ids, k for retrieval, feature flags live in `config.py`, not scattered literals.
-- **Tracing for free badge.** `tracing.py` records structured per-agent inputs/outputs for any run; one run gets exported for the Hub trace.
-- **Determinism in demos.** Fix seeds; cache model loads at startup; warm the pipeline so the demo isn't a cold-start.
-- **Tests:** per-agent unit tests on fixtures + one golden-clip end-to-end test asserting score, `needs_human`, and overlay presence. Keep a tiny committed sample clip.
-## GRADIO-SPECIFIC GUIDANCE
-- **Blocks vs Server:** start with `gr.Blocks` + custom CSS/theme — fastest to a polished result and enough for Off-Brand. Escalate to `gradio.Server` with your own frontend **only if** Blocks can't express the UI; document the reason. (Server still gives queuing, ZeroGPU, MCP.)
-- Use `gr.Walkthrough`/`gr.Step` to guide the physio through a 7-test session; `gr.Navbar` if you split pages.
-- Use `gr.Video`'s `playback_position` to jump the result video to the frame that decided the score.
-- ZeroGPU: wrap heavy inference in `@spaces.GPU`; load models once at module scope; mind the per-call GPU time limit. If using `gradio.Server` + ZeroGPU, call endpoints via `@gradio/client` from the browser.
-- `requirements.txt`: pin Gradio and every model lib; isolate the llama.cpp build (CPU-only or pinned-CUDA) to dodge `libcudart` failures; keep a `transformers` + `spaces.GPU` fallback path.
-## DEFINITION OF DONE (badge checklist)
-- [ ] Space runs green; upload → scorecard works on real clips.
-- [ ] Param sum verified ≤ 32B in `MODEL_BUDGET.md`.
-- [ ] 🔌 No cloud model APIs anywhere in the pipeline.
-- [ ] 🎯 Fine-tuned ST-GCN head published to the Hub w/ honest card.
-- [ ] 🎨 Custom, non-default Gradio UI.
-- [ ] 🦙 VLM + embedder served via llama.cpp.
-- [ ] 📡 One full agent trace published to the Hub.
-- [ ] 📓 Blog post / field notes written, honesty section included.
-- [ ] Demo video + social post recorded.
-- [ ] Safety banner present; pain/clearing never auto-scored; low-confidence flagged.
-## INTERACTION PROTOCOL
-- **After each phase**, post: what runs now, the updated param sum, deviations from the spec, and the next step. Don't silently change architecture.
-- **Ask the human only when blocked on a real decision** — e.g. single-test clips vs continuous sessions (changes segmentation + UI), SAM 3D Body unusable (triggers 2D fallback), or the param-sum interpretation. Otherwise proceed with the spec's defaults and note your assumption inline.
-- **Never claim a Gradio/model API works without having verified it** this session. If you didn't check it, say so.

+# Build Prompt — FormScout (FMS scoring on Gradio, ≤32B)
+> **How to use this:** paste everything below the line into your coding agent (Claude Code, Codex, Cursor, etc.) as the opening instruction. Attach `FormScout-FMS-Spec.md` alongside it — that file is the product source of truth; this file is the engineering contract and process. Work through it phase by phase.
+---
+## ROLE
+You are a **senior Python + Gradio architect with ~10 years of shipping ML web apps**, including production Hugging Face Spaces, custom-frontend Gradio deployments, ZeroGPU services, and llama.cpp-served models. You are pragmatic, opinionated about defaults, allergic to dead code, and you **verify APIs against current docs instead of trusting your memory** — Gradio and the model ecosystem move fast and your training data may be stale. You build **vertical slices** that run end to end early, then deepen. You never hand back a broken app.
+## MISSION
+Build **FormScout**, a Gradio app hosted as a Hugging Face Space that scores Functional Movement Screen (FMS) videos 0–3 per test with an explainable rationale and an annotated overlay, for the Build Small Hackathon (Backyard AI track). Full product requirements are in the attached `FormScout-FMS-Spec.md`. Honor it; if you deviate, say why.
+## PRIME DIRECTIVES (read before writing any code)
+1. **Verify before you build.** Do Phase 0 recon first. Do not write against a Gradio/model API you have not confirmed exists in the current version. When unsure, read the doc or the model card, don't guess.
+2. **Vertical slice first.** The fastest path to a working `video in → scored overlay out` for *one* test beats a half-built version of all seven. Get something running on day one, then expand.
+3. **Stay under budget.** Total model parameters across the whole pipeline must be **≤ 32B**. Track a running sum in `MODEL_BUDGET.md` and update it whenever you add or swap a model. The target config is ~18B (see spec §5). If a choice would exceed 32B, stop and flag it.
+4. **No cloud model APIs.** All inference runs on the Space (Off the Grid badge). No OpenAI/Anthropic/Gemini/etc. calls for the core pipeline.
+5. **Honesty & safety are features, not footnotes.** This is a screening aid, not a diagnosis and not injury prediction. Pain and clearing tests are never auto-scored — they set `needs_human=true`. A safety banner is always visible. Low-confidence and agent-disagreement cases are surfaced, not hidden.
+6. **Modular agents, typed contracts.** Each pipeline stage is an independent module with a typed input/output (see spec §7). No god-functions. The pipeline must be runnable headless (no Gradio) for testing.
+---
+## PHASE 0 — Recon & environment (do this first, report findings before coding)
+**Goal:** confirm the ground truth, then write a short `RECON.md` summarizing what you found and any deviations from the spec.
+1. **Install the Gradio skill** for this agent so you get current Gradio knowledge:
+   `gradio skills add --claude` (use the right flag for your agent; `--global` is fine).
+2. **Pin and confirm Gradio.** Determine the current major version (expect Gradio 6.x). Record the exact version you'll target in `requirements.txt`. Confirm these still exist and note their current signatures:
+   - `gr.Blocks`, `gr.Video` (incl. `playback_position` for jumping to the decisive frame), `gr.Walkthrough` / `gr.Step` (for the 7-test flow), `gr.Navbar` (multipage), custom theming / CSS.
+   - `gradio.Server` (custom-frontend mode) — decide **Blocks vs Server** for the UI (see UI section).
+   - ZeroGPU usage: the `@spaces.GPU` decorator pattern, and the caveat that with `gradio.Server` + ZeroGPU you must call endpoints via `@gradio/client` from the browser.
+3. **Verify every model** on its Hugging Face card — confirm it exists, its **license**, its **parameter count**, and whether a **GGUF** build exists for llama.cpp:
+   - YOLO26-Pose (Ultralytics) — pick a variant (l/x) and confirm license implications.
+   - SAM 3.1 (`facebookresearch/sam3`) — base checkpoint size.
+   - **SAM 3D Body** — *this is the uncertain one.* Confirm weights are public, the license, the **exact param count**, and that it runs within a ZeroGPU slice. If it's too heavy or not usable, fall back to **2D-only biomechanics** (angles from 2D pose + explicit camera-angle caveats) and note it.
+   - Qwen3-VL-8B-Instruct + Qwen3-VL-Embedding-8B — confirm GGUF builds and that they share the Qwen3-VL backbone.
+4. **llama.cpp on Spaces reality check.** Confirm a working install path; prior hackathon Spaces hit `libcudart.so` errors. Decide CPU-only vs pinned-CUDA build per model. Have a `transformers`/`spaces.GPU` fallback ready for any model that won't build under llama.cpp in time.
+5. **Open question to surface, not solve:** does "total parameters ≤ 32B" mean *per model* or *summed across the pipeline*? Design for the **summed** reading (safe under either). Note in `RECON.md` to confirm via the Discord AMA.
+**Exit criteria for Phase 0:** `RECON.md` exists with the Gradio version, a verified model table (name, params, license, GGUF y/n, runs-on-ZeroGPU y/n), the running param sum, the chosen UI approach, and any fallbacks triggered.
+---
+## PHASE 1 — The spine (one test, end to end, headless + Gradio)
+**Goal:** upload a Deep Squat clip → get a rationalized 0–3 + skeleton overlay.
+- Scaffold the repo (structure below). Pipeline runs **headless** via `python -m formscout.run sample.mp4` before any UI.
+- Implement `IngestAgent` → `SegmentationAgent` (SAM 3.1) → `PoseAgent` (YOLO26-Pose). Reject non-target people via the mask/track id.
+- Implement `Body3DAgent` (SAM 3D Body) **or** the 2D fallback from Phase 0.
+- Implement `BiomechanicsAgent` for Deep Squat only: torso–tibia angle, hip-flexion depth (femur vs horizontal), knee tracking, dowel alignment.
+- Implement a **deterministic** rubric scorer for Deep Squat (3/2/1 per spec §8). No ML scoring yet.
+- Minimal Gradio UI: `gr.Video` in, score + rationale + overlay out.
+**Exit criteria:** a real squat clip produces a defensible score, a one-line reason citing the deciding measurement, and an overlay video. Runs on the Space.
+---
+## PHASE 2 — All seven tests + the judge
+- Extend `BiomechanicsAgent` + rubric scorers to all 7 tests. Bilateral tests score each side, **report the lower**, and **always emit the asymmetry**.
+- `MovementClassifierAgent`: identify which test is in the clip (VLM or a small classifier) with a **manual override** in the UI.
+- `JudgeAgent` (Qwen3-VL-8B via llama.cpp): consumes rubric + measurements + the deterministic candidate → final 0–3, rationale, compensation tag, corrective hint. Pain/clearing → `needs_human=true`, **not scored**.
+- `ReportAgent`: per-test card, composite 0–21, asymmetry strip, annotated overlay, PDF export.
+**Exit criteria:** a multi-test session produces a full scorecard with composite + asymmetries; pain/clearing cases defer to human; disagreements between deterministic and judge scores are flagged.
+---
+## PHASE 3 — Learned scoring + retrieval (the badges)
+- `ScoringAgent`: compact **ST-GCN** scoring head. Pre-train on public AQA/pose data, then **few-shot fine-tune** on the physio's labeled clips with heavy augmentation (temporal jitter, **left↔right mirror**, 3D camera-angle perturbation, joint noise). Hold out ≥1 labeled clip. **Publish the fine-tuned head to the Hub** with an honest model card → *Well-Tuned*.
+- `RetrievalAgent`: build a Qwen3-VL-Embedding-8B index over the physio's labeled clips; return k nearest + their scores to anchor the judge → RAG.
+- Wire the judge to weigh: deterministic candidate + ST-GCN candidate + retrieved exemplars.
+**Exit criteria:** scores incorporate the learned head and exemplars; adding a new labeled clip improves retrieval with **no retraining**.
+---
+## PHASE 4 — Polish, ship, document
+- Custom UI pass (Off-Brand): scout/trail theme, score dial, asymmetry bars, rubric drawer with met/unmet checkboxes, decisive-frame jump via `playback_position`, persistent safety banner.
+- Persist the embedding index + accumulated labels in Space storage (longitudinal baseline).
+- **Publish one full agent trace** to the Hub (every agent's I/O for one run) → *Sharing is Caring*.
+- Write the **blog post / field notes** with the honesty section front-and-center → *Field Notes*.
+- Record the demo video (physio scores a real player) + the social post.
+**Exit criteria:** all six badges attempted, Space is green, demo + post + trace + blog are linked from the README.
+---
+## REPO STRUCTURE (target)
+```
+formscout/
+  app.py                 # Gradio entrypoint (Blocks or Server)
+  formscout/
+    __init__.py
+    config.py            # paths, model ids, thresholds, feature flags
+    pipeline.py          # Director: orchestrates agents, quality-gates
+    run.py               # headless CLI entrypoint (no Gradio)
+    agents/
+      ingest.py
+      segmentation.py    # SAM 3.1
+      pose2d.py          # YOLO26-Pose
+      body3d.py          # SAM 3D Body (+ 2d fallback)
+      classify.py        # movement classifier
+      biomechanics.py    # rubric features per test
+      scoring.py         # ST-GCN learned head
+      retrieval.py       # Qwen3-VL-Embedding index
+      judge.py           # Qwen3-VL-8B judge
+      report.py          # scorecard, overlay, pdf
+    rubric/
+      deep_squat.py ...  # one scorer per FMS test, pure functions
+    types.py             # typed dataclasses for every agent contract
+    serving/
+      llama_cpp.py       # llama.cpp client wrappers + fallbacks
+    ui/
+      theme.py, components.py, custom/  # frontend assets
+    tracing.py           # structured per-agent I/O logging (for the trace badge)
+  tests/                 # headless tests per agent + a golden-clip e2e test
+  requirements.txt
+  README.md              # Space card: pitch, demo, trace, blog, safety
+  MODEL_BUDGET.md        # running param sum, must stay ≤32B
+  RECON.md               # Phase 0 findings
+```
+## ENGINEERING STANDARDS
+- **Typing everywhere.** Every agent takes and returns a dataclass from `types.py`. Validate at boundaries.
+- **Pure rubric functions.** Each test scorer is a pure function `(features) -> ScoreResult` with the triggering reason. Unit-test each against hand-computed cases.
+- **Defensive by default.** Handle: no person detected, multiple people, wrong/ambiguous test, occlusion, too-short clip, bad FPS, 3D model OOM. Degrade gracefully and tell the user what happened — never crash the Space.
+- **Confidence is first-class.** Every agent emits a confidence; the Director flags low confidence and ≥1-point judge/ST-GCN disagreement as "physio review recommended."
+- **Config over constants.** Thresholds, model ids, k for retrieval, feature flags live in `config.py`, not scattered literals.
+- **Tracing for free badge.** `tracing.py` records structured per-agent inputs/outputs for any run; one run gets exported for the Hub trace.
+- **Determinism in demos.** Fix seeds; cache model loads at startup; warm the pipeline so the demo isn't a cold-start.
+- **Tests:** per-agent unit tests on fixtures + one golden-clip end-to-end test asserting score, `needs_human`, and overlay presence. Keep a tiny committed sample clip.
+## GRADIO-SPECIFIC GUIDANCE
+- **Blocks vs Server:** start with `gr.Blocks` + custom CSS/theme — fastest to a polished result and enough for Off-Brand. Escalate to `gradio.Server` with your own frontend **only if** Blocks can't express the UI; document the reason. (Server still gives queuing, ZeroGPU, MCP.)
+- Use `gr.Walkthrough`/`gr.Step` to guide the physio through a 7-test session; `gr.Navbar` if you split pages.
+- Use `gr.Video`'s `playback_position` to jump the result video to the frame that decided the score.
+- ZeroGPU: wrap heavy inference in `@spaces.GPU`; load models once at module scope; mind the per-call GPU time limit. If using `gradio.Server` + ZeroGPU, call endpoints via `@gradio/client` from the browser.
+- `requirements.txt`: pin Gradio and every model lib; isolate the llama.cpp build (CPU-only or pinned-CUDA) to dodge `libcudart` failures; keep a `transformers` + `spaces.GPU` fallback path.
+## DEFINITION OF DONE (badge checklist)
+- [ ] Space runs green; upload → scorecard works on real clips.
+- [ ] Param sum verified ≤ 32B in `MODEL_BUDGET.md`.
+- [ ] 🔌 No cloud model APIs anywhere in the pipeline.
+- [ ] 🎯 Fine-tuned ST-GCN head published to the Hub w/ honest card.
+- [ ] 🎨 Custom, non-default Gradio UI.
+- [ ] 🦙 VLM + embedder served via llama.cpp.
+- [ ] 📡 One full agent trace published to the Hub.
+- [ ] 📓 Blog post / field notes written, honesty section included.
+- [ ] Demo video + social post recorded.
+- [ ] Safety banner present; pain/clearing never auto-scored; low-confidence flagged.
+## INTERACTION PROTOCOL
+- **After each phase**, post: what runs now, the updated param sum, deviations from the spec, and the next step. Don't silently change architecture.
+- **Ask the human only when blocked on a real decision** — e.g. single-test clips vs continuous sessions (changes segmentation + UI), SAM 3D Body unusable (triggers 2D fallback), or the param-sum interpretation. Otherwise proceed with the spec's defaults and note your assumption inline.
+- **Never claim a Gradio/model API works without having verified it** this session. If you didn't check it, say so.

docs/superpowers/plans/2026-06-04-formscout-full-build.md CHANGED Viewed

The diff for this file is too large to render. See raw diff

formscout.egg-info/PKG-INFO CHANGED Viewed

@@ -1,4 +1,4 @@
-Metadata-Version: 2.4
-Name: formscout
-Version: 0.1.0
-Requires-Python: >=3.11

+Metadata-Version: 2.4
+Name: formscout
+Version: 0.1.0
+Requires-Python: >=3.11

formscout.egg-info/SOURCES.txt CHANGED Viewed

@@ -1,26 +1,26 @@
-README.md
-pyproject.toml
-formscout/__init__.py
-formscout/config.py
-formscout/pipeline.py
-formscout/run.py
-formscout/tracing.py
-formscout/types.py
-formscout.egg-info/PKG-INFO
-formscout.egg-info/SOURCES.txt
-formscout.egg-info/dependency_links.txt
-formscout.egg-info/top_level.txt
-formscout/agents/__init__.py
-formscout/agents/biomechanics.py
-formscout/agents/body3d.py
-formscout/agents/ingest.py
-formscout/agents/pose2d.py
-formscout/rubric/__init__.py
-formscout/rubric/deep_squat.py
-formscout/serving/__init__.py
-formscout/ui/__init__.py
-tests/test_biomechanics.py
-tests/test_body3d.py
-tests/test_ingest.py
-tests/test_pose2d.py
 tests/test_types.py

+README.md
+pyproject.toml
+formscout/__init__.py
+formscout/config.py
+formscout/pipeline.py
+formscout/run.py
+formscout/tracing.py
+formscout/types.py
+formscout.egg-info/PKG-INFO
+formscout.egg-info/SOURCES.txt
+formscout.egg-info/dependency_links.txt
+formscout.egg-info/top_level.txt
+formscout/agents/__init__.py
+formscout/agents/biomechanics.py
+formscout/agents/body3d.py
+formscout/agents/ingest.py
+formscout/agents/pose2d.py
+formscout/rubric/__init__.py
+formscout/rubric/deep_squat.py
+formscout/serving/__init__.py
+formscout/ui/__init__.py
+tests/test_biomechanics.py
+tests/test_body3d.py
+tests/test_ingest.py
+tests/test_pose2d.py
 tests/test_types.py

formscout.egg-info/dependency_links.txt CHANGED Viewed

	@@ -1 +1 @@
1	-


1	+

formscout.egg-info/top_level.txt CHANGED Viewed

	@@ -1 +1 @@
1	- formscout


1	+ formscout

formscout/agents/biomechanics.py CHANGED Viewed

@@ -1,200 +1,608 @@
-"""
-BiomechanicsAgent — extracts named, documented, unit-bearing measurements from pose data.
-Input:  Pose2DResult (or Body3DResult if used), MovementResult
-Output: BiomechFeatures(test_name, view, angles, alignments, ...)
-Failure: returns BiomechFeatures with confidence=0.0 and notes.
-Params: 0 (pure computation — no model).
-License: n/a.
-Gated: no.
-This module is MEASUREMENT ONLY — no scoring happens here.
-Scoring is done by the rubric functions in formscout/rubric/.
-"""
-from __future__ import annotations
-import math
-from typing import Any
-from formscout.types import (
-    Pose2DResult, Body3DResult, MovementResult, BiomechFeatures,
-)
-from formscout import config
-def _angle_between_points(a: tuple, b: tuple, c: tuple) -> float:
-    """
-    Compute angle at point b formed by segments ba and bc.
-    Returns degrees. Returns NaN if any point is missing.
-    """
-    try:
-        ba = (a[0] - b[0], a[1] - b[1])
-        bc = (c[0] - b[0], c[1] - b[1])
-        dot = ba[0] * bc[0] + ba[1] * bc[1]
-        mag_ba = math.sqrt(ba[0] ** 2 + ba[1] ** 2)
-        mag_bc = math.sqrt(bc[0] ** 2 + bc[1] ** 2)
-        if mag_ba == 0 or mag_bc == 0:
-            return float("nan")
-        cos_angle = max(-1.0, min(1.0, dot / (mag_ba * mag_bc)))
-        return math.degrees(math.acos(cos_angle))
-    except (TypeError, IndexError, ZeroDivisionError):
-        return float("nan")
-def _get_joint(keypoints: dict, joint_id: int) -> tuple | None:
-    """Extract (x, y) for a joint, or None if missing/low-confidence."""
-    j = keypoints.get(joint_id)
-    if j is None:
-        return None
-    if j.get("conf", 0) < config.POSE_CONF_THRESHOLD:
-        return None
-    return (j["x"], j["y"])
-# COCO joint indices
-NOSE, L_EYE, R_EYE, L_EAR, R_EAR = 0, 1, 2, 3, 4
-L_SHOULDER, R_SHOULDER = 5, 6
-L_ELBOW, R_ELBOW = 7, 8
-L_WRIST, R_WRIST = 9, 10
-L_HIP, R_HIP = 11, 12
-L_KNEE, R_KNEE = 13, 14
-L_ANKLE, R_ANKLE = 15, 16
-class BiomechanicsAgent:
-    """Pure-function biomechanics measurement — no model calls."""
-    def run(
-        self,
-        pose2d: Pose2DResult,
-        body3d: Body3DResult,
-        movement: MovementResult,
-    ) -> BiomechFeatures:
-        if not pose2d.keypoints:
-            return BiomechFeatures(
-                test_name=movement.test_name,
-                view="2d",
-                side=movement.side,
-                angles={}, alignments={},
-                symmetry_delta=None, timing={},
-                confidence=0.0,
-                notes="no keypoints available",
-            )
-        view = "3d" if body3d.used else "2d"
-        # Select the analysis frame (deepest point of movement)
-        # For now, use the frame with the lowest hip position (deepest squat)
-        if movement.test_name == "deep_squat":
-            return self._deep_squat(pose2d, view, movement.side)
-        # Add other tests as they are implemented
-        return BiomechFeatures(
-            test_name=movement.test_name,
-            view=view,
-            side=movement.side,
-            angles={}, alignments={},
-            symmetry_delta=None, timing={},
-            confidence=0.3,
-            notes=f"biomechanics not yet implemented for {movement.test_name}",
-        )
-    def _deep_squat(self, pose2d: Pose2DResult, view: str, side: str) -> BiomechFeatures:
-        """Extract deep squat biomechanics from the deepest frame."""
-        # Find the frame with lowest hip Y (deepest squat position)
-        best_frame_idx = 0
-        lowest_hip_y = -1.0
-        for i, kps in enumerate(pose2d.keypoints):
-            l_hip = _get_joint(kps, L_HIP)
-            r_hip = _get_joint(kps, R_HIP)
-            if l_hip and r_hip:
-                mid_hip_y = (l_hip[1] + r_hip[1]) / 2
-                if mid_hip_y > lowest_hip_y:  # higher Y = lower in image
-                    lowest_hip_y = mid_hip_y
-                    best_frame_idx = i
-        kps = pose2d.keypoints[best_frame_idx]
-        notes_parts: list[str] = []
-        # Extract joints
-        l_hip = _get_joint(kps, L_HIP)
-        r_hip = _get_joint(kps, R_HIP)
-        l_knee = _get_joint(kps, L_KNEE)
-        r_knee = _get_joint(kps, R_KNEE)
-        l_ankle = _get_joint(kps, L_ANKLE)
-        r_ankle = _get_joint(kps, R_ANKLE)
-        l_shoulder = _get_joint(kps, L_SHOULDER)
-        r_shoulder = _get_joint(kps, R_SHOULDER)
-        # Compute angles
-        angles: dict[str, float] = {}
-        # Hip-knee-ankle angle (knee flexion) — average of both sides
-        l_knee_angle = _angle_between_points(l_hip, l_knee, l_ankle) if all([l_hip, l_knee, l_ankle]) else float("nan")
-        r_knee_angle = _angle_between_points(r_hip, r_knee, r_ankle) if all([r_hip, r_knee, r_ankle]) else float("nan")
-        if not math.isnan(l_knee_angle):
-            angles["left_knee_flexion_deg"] = l_knee_angle
-        else:
-            notes_parts.append("left knee angle unavailable")
-        if not math.isnan(r_knee_angle):
-            angles["right_knee_flexion_deg"] = r_knee_angle
-        else:
-            notes_parts.append("right knee angle unavailable")
-        # Femur angle from horizontal
-        # Femur = hip to knee. Angle from horizontal = atan2(dy, dx)
-        if l_hip and l_knee:
-            dy = l_knee[1] - l_hip[1]
-            dx = l_knee[0] - l_hip[0]
-            angles["left_femur_from_horizontal_deg"] = abs(math.degrees(math.atan2(dy, dx)))
-        if r_hip and r_knee:
-            dy = r_knee[1] - r_hip[1]
-            dx = r_knee[0] - r_hip[0]
-            angles["right_femur_from_horizontal_deg"] = abs(math.degrees(math.atan2(dy, dx)))
-        # Torso-tibia angle (torso parallel to tibia = score 3 criterion)
-        if l_shoulder and l_hip and l_knee and l_ankle:
-            torso_angle = math.degrees(math.atan2(l_hip[1] - l_shoulder[1], l_hip[0] - l_shoulder[0]))
-            tibia_angle = math.degrees(math.atan2(l_ankle[1] - l_knee[1], l_ankle[0] - l_knee[0]))
-            angles["torso_tibia_angle_deg"] = abs(torso_angle - tibia_angle)
-        # Alignments
-        alignments: dict[str, Any] = {}
-        # Knee valgus check: are knees inside the ankle line?
-        if l_knee and r_knee and l_ankle and r_ankle:
-            knee_width = abs(l_knee[0] - r_knee[0])
-            ankle_width = abs(l_ankle[0] - r_ankle[0])
-            alignments["knees_tracking_over_feet"] = knee_width >= (ankle_width - config.DEEP_SQUAT_KNEE_TRACKING_MARGIN_PX)
-            alignments["knee_valgus_deg"] = 0.0  # placeholder for actual valgus angle
-        # Heels elevated detection (approximation: ankle Y relative to frame bottom)
-        # This is a rough heuristic — proper detection needs foot keypoints or depth
-        alignments["heels_elevated"] = False  # default; refine with better detection
-        # Dowel position (need wrist positions relative to feet)
-        if l_wrist := _get_joint(kps, L_WRIST):
-            if r_wrist := _get_joint(kps, R_WRIST):
-                if l_ankle and r_ankle:
-                    mid_wrist_x = (l_wrist[0] + r_wrist[0]) / 2
-                    mid_ankle_x = (l_ankle[0] + r_ankle[0]) / 2
-                    alignments["dowel_over_feet"] = abs(mid_wrist_x - mid_ankle_x) < 50
-                    alignments["dowel_feet_offset_px"] = mid_wrist_x - mid_ankle_x
-        # Confidence based on how many measurements we got
-        n_expected = 6  # main measurements
-        n_got = len(angles) + len([v for v in alignments.values() if v is not None])
-        confidence = min(1.0, n_got / n_expected) * pose2d.confidence
-        return BiomechFeatures(
-            test_name="deep_squat",
-            view=view,
-            side="na",
-            angles=angles,
-            alignments=alignments,
-            symmetry_delta=None,
-            timing={"deepest_frame": best_frame_idx},
-            confidence=confidence,
-            notes="; ".join(notes_parts) if notes_parts else "",
-        )

+"""
+BiomechanicsAgent — extracts named, documented, unit-bearing measurements from pose data.
+Input:  Pose2DResult (or Body3DResult if used), MovementResult
+Output: BiomechFeatures(test_name, view, angles, alignments, ...)
+Failure: returns BiomechFeatures with confidence=0.0 and notes.
+Params: 0 (pure computation — no model).
+License: n/a.
+Gated: no.
+This module is MEASUREMENT ONLY — no scoring happens here.
+Scoring is done by the rubric functions in formscout/rubric/.
+"""
+from __future__ import annotations
+import math
+from typing import Any
+from formscout.types import (
+    Pose2DResult, Body3DResult, MovementResult, BiomechFeatures,
+)
+from formscout import config
+def _angle_between_points(a: tuple, b: tuple, c: tuple) -> float:
+    """
+    Compute angle at point b formed by segments ba and bc.
+    Returns degrees. Returns NaN if any point is missing.
+    """
+    try:
+        ba = (a[0] - b[0], a[1] - b[1])
+        bc = (c[0] - b[0], c[1] - b[1])
+        dot = ba[0] * bc[0] + ba[1] * bc[1]
+        mag_ba = math.sqrt(ba[0] ** 2 + ba[1] ** 2)
+        mag_bc = math.sqrt(bc[0] ** 2 + bc[1] ** 2)
+        if mag_ba == 0 or mag_bc == 0:
+            return float("nan")
+        cos_angle = max(-1.0, min(1.0, dot / (mag_ba * mag_bc)))
+        return math.degrees(math.acos(cos_angle))
+    except (TypeError, IndexError, ZeroDivisionError):
+        return float("nan")
+def _get_joint(keypoints: dict, joint_id: int) -> tuple | None:
+    """Extract (x, y) for a joint, or None if missing/low-confidence."""
+    j = keypoints.get(joint_id)
+    if j is None:
+        return None
+    if j.get("conf", 0) < config.POSE_CONF_THRESHOLD:
+        return None
+    return (j["x"], j["y"])
+# COCO joint indices
+NOSE, L_EYE, R_EYE, L_EAR, R_EAR = 0, 1, 2, 3, 4
+L_SHOULDER, R_SHOULDER = 5, 6
+L_ELBOW, R_ELBOW = 7, 8
+L_WRIST, R_WRIST = 9, 10
+L_HIP, R_HIP = 11, 12
+L_KNEE, R_KNEE = 13, 14
+L_ANKLE, R_ANKLE = 15, 16
+class BiomechanicsAgent:
+    """Pure-function biomechanics measurement — no model calls."""
+    def run(
+        self,
+        pose2d: Pose2DResult,
+        body3d: Body3DResult,
+        movement: MovementResult,
+    ) -> BiomechFeatures:
+        if not pose2d.keypoints:
+            return BiomechFeatures(
+                test_name=movement.test_name,
+                view="2d",
+                side=movement.side,
+                angles={}, alignments={},
+                symmetry_delta=None, timing={},
+                confidence=0.0,
+                notes="no keypoints available",
+            )
+        view = "3d" if body3d.used else "2d"
+        dispatch = {
+            "deep_squat": self._deep_squat,
+            "hurdle_step": self._hurdle_step,
+            "inline_lunge": self._inline_lunge,
+            "shoulder_mobility": self._shoulder_mobility,
+            "active_slr": self._active_slr,
+            "trunk_stability_pushup": self._trunk_stability_pushup,
+            "rotary_stability": self._rotary_stability,
+        }
+        fn = dispatch.get(movement.test_name)
+        if fn is None:
+            return BiomechFeatures(
+                test_name=movement.test_name, view=view, side=movement.side,
+                angles={}, alignments={}, symmetry_delta=None, timing={},
+                confidence=0.0, notes=f"unknown test: {movement.test_name}",
+            )
+        return fn(pose2d, view, movement.side)
+    def _deep_squat(self, pose2d: Pose2DResult, view: str, side: str) -> BiomechFeatures:
+        """Extract deep squat biomechanics from the deepest frame."""
+        # Find the frame with lowest hip Y (deepest squat position)
+        best_frame_idx = 0
+        lowest_hip_y = -1.0
+        for i, kps in enumerate(pose2d.keypoints):
+            l_hip = _get_joint(kps, L_HIP)
+            r_hip = _get_joint(kps, R_HIP)
+            if l_hip and r_hip:
+                mid_hip_y = (l_hip[1] + r_hip[1]) / 2
+                if mid_hip_y > lowest_hip_y:  # higher Y = lower in image
+                    lowest_hip_y = mid_hip_y
+                    best_frame_idx = i
+        kps = pose2d.keypoints[best_frame_idx]
+        notes_parts: list[str] = []
+        # Extract joints
+        l_hip = _get_joint(kps, L_HIP)
+        r_hip = _get_joint(kps, R_HIP)
+        l_knee = _get_joint(kps, L_KNEE)
+        r_knee = _get_joint(kps, R_KNEE)
+        l_ankle = _get_joint(kps, L_ANKLE)
+        r_ankle = _get_joint(kps, R_ANKLE)
+        l_shoulder = _get_joint(kps, L_SHOULDER)
+        r_shoulder = _get_joint(kps, R_SHOULDER)
+        # Compute angles
+        angles: dict[str, float] = {}
+        # Hip-knee-ankle angle (knee flexion) — average of both sides
+        l_knee_angle = _angle_between_points(l_hip, l_knee, l_ankle) if all([l_hip, l_knee, l_ankle]) else float("nan")
+        r_knee_angle = _angle_between_points(r_hip, r_knee, r_ankle) if all([r_hip, r_knee, r_ankle]) else float("nan")
+        if not math.isnan(l_knee_angle):
+            angles["left_knee_flexion_deg"] = l_knee_angle
+        else:
+            notes_parts.append("left knee angle unavailable")
+        if not math.isnan(r_knee_angle):
+            angles["right_knee_flexion_deg"] = r_knee_angle
+        else:
+            notes_parts.append("right knee angle unavailable")
+        # Femur angle from horizontal
+        # Femur = hip to knee. Angle from horizontal = atan2(dy, dx)
+        if l_hip and l_knee:
+            dy = l_knee[1] - l_hip[1]
+            dx = l_knee[0] - l_hip[0]
+            angles["left_femur_from_horizontal_deg"] = abs(math.degrees(math.atan2(dy, dx)))
+        if r_hip and r_knee:
+            dy = r_knee[1] - r_hip[1]
+            dx = r_knee[0] - r_hip[0]
+            angles["right_femur_from_horizontal_deg"] = abs(math.degrees(math.atan2(dy, dx)))
+        # Torso-tibia angle (torso parallel to tibia = score 3 criterion)
+        if l_shoulder and l_hip and l_knee and l_ankle:
+            torso_angle = math.degrees(math.atan2(l_hip[1] - l_shoulder[1], l_hip[0] - l_shoulder[0]))
+            tibia_angle = math.degrees(math.atan2(l_ankle[1] - l_knee[1], l_ankle[0] - l_knee[0]))
+            angles["torso_tibia_angle_deg"] = abs(torso_angle - tibia_angle)
+        # Alignments
+        alignments: dict[str, Any] = {}
+        # Knee valgus check: are knees inside the ankle line?
+        if l_knee and r_knee and l_ankle and r_ankle:
+            knee_width = abs(l_knee[0] - r_knee[0])
+            ankle_width = abs(l_ankle[0] - r_ankle[0])
+            alignments["knees_tracking_over_feet"] = knee_width >= (ankle_width - config.DEEP_SQUAT_KNEE_TRACKING_MARGIN_PX)
+            alignments["knee_valgus_deg"] = 0.0  # placeholder for actual valgus angle
+        # Heels elevated detection (approximation: ankle Y relative to frame bottom)
+        # This is a rough heuristic — proper detection needs foot keypoints or depth
+        alignments["heels_elevated"] = False  # default; refine with better detection
+        # Dowel position (need wrist positions relative to feet)
+        if l_wrist := _get_joint(kps, L_WRIST):
+            if r_wrist := _get_joint(kps, R_WRIST):
+                if l_ankle and r_ankle:
+                    mid_wrist_x = (l_wrist[0] + r_wrist[0]) / 2
+                    mid_ankle_x = (l_ankle[0] + r_ankle[0]) / 2
+                    alignments["dowel_over_feet"] = abs(mid_wrist_x - mid_ankle_x) < 50
+                    alignments["dowel_feet_offset_px"] = mid_wrist_x - mid_ankle_x
+        # Confidence based on how many measurements we got
+        n_expected = 6  # main measurements
+        n_got = len(angles) + len([v for v in alignments.values() if v is not None])
+        confidence = min(1.0, n_got / n_expected) * pose2d.confidence
+        return BiomechFeatures(
+            test_name="deep_squat",
+            view=view,
+            side="na",
+            angles=angles,
+            alignments=alignments,
+            symmetry_delta=None,
+            timing={"deepest_frame": best_frame_idx},
+            confidence=confidence,
+            notes="; ".join(notes_parts) if notes_parts else "",
+        )
+    # ─── Helper: find peak frame by joint Y ─────────────────────────────────
+    def _find_peak_frame(self, pose2d: Pose2DResult, joint_id: int, maximize: bool = True) -> int:
+        """Find frame where a joint reaches its extreme Y position."""
+        best_idx, best_val = 0, -1.0 if maximize else float("inf")
+        for i, kps in enumerate(pose2d.keypoints):
+            j = _get_joint(kps, joint_id)
+            if j:
+                if (maximize and j[1] > best_val) or (not maximize and j[1] < best_val):
+                    best_val = j[1]
+                    best_idx = i
+        return best_idx
+    def _bilateral_features(
+        self, pose2d: Pose2DResult, view: str, side: str, test_name: str,
+        extractor,
+    ) -> BiomechFeatures:
+        """Run a bilateral test: compute both sides, report the specified side + symmetry_delta."""
+        left = extractor(pose2d, "left")
+        right = extractor(pose2d, "right")
+        # Pick the requested side as primary
+        primary = left if side == "left" else right if side == "right" else left
+        other = right if side == "left" else left if side == "right" else right
+        # Merge angles with side prefix for the primary
+        angles = primary.get("angles", {})
+        alignments = primary.get("alignments", {})
+        timing = primary.get("timing", {})
+        # Compute symmetry delta from the main measurement
+        main_key = primary.get("main_measure_key")
+        sym_delta = None
+        if main_key and main_key in left.get("angles", {}) and main_key in right.get("angles", {}):
+            sym_delta = abs(left["angles"][main_key] - right["angles"][main_key])
+        n_got = len(angles) + len([v for v in alignments.values() if v is not None])
+        confidence = min(1.0, n_got / max(primary.get("expected", 3), 1)) * pose2d.confidence
+        return BiomechFeatures(
+            test_name=test_name, view=view, side=side,
+            angles=angles, alignments=alignments,
+            symmetry_delta=sym_delta, timing=timing,
+            confidence=confidence,
+            notes=primary.get("notes", ""),
+        )
+    # ─── Hurdle Step ─────────────────────────────────────────────────────────
+    def _hurdle_step(self, pose2d: Pose2DResult, view: str, side: str) -> BiomechFeatures:
+        """Hurdle Step: hip/knee flexion of stepping leg, stance stability."""
+        def extract(p2d: Pose2DResult, s: str) -> dict:
+            hip_id = L_HIP if s == "left" else R_HIP
+            knee_id = L_KNEE if s == "left" else R_KNEE
+            ankle_id = L_ANKLE if s == "left" else R_ANKLE
+            # Stance side is opposite
+            stance_hip = R_HIP if s == "left" else L_HIP
+            stance_knee = R_KNEE if s == "left" else L_KNEE
+            stance_ankle = R_ANKLE if s == "left" else L_ANKLE
+            # Peak = frame where stepping knee is highest (lowest Y in image)
+            peak_idx = self._find_peak_frame(p2d, knee_id, maximize=False)
+            kps = p2d.keypoints[peak_idx]
+            hip = _get_joint(kps, hip_id)
+            knee = _get_joint(kps, knee_id)
+            ankle = _get_joint(kps, ankle_id)
+            s_hip = _get_joint(kps, stance_hip)
+            s_knee = _get_joint(kps, stance_knee)
+            s_ankle = _get_joint(kps, stance_ankle)
+            angles = {}
+            alignments = {}
+            notes_parts = []
+            # Hip flexion of stepping leg
+            if all([hip, knee, ankle]):
+                angles["step_knee_flexion_deg"] = _angle_between_points(hip, knee, ankle)
+            # Hip angle (torso-femur)
+            shoulder_id = L_SHOULDER if s == "left" else R_SHOULDER
+            shoulder = _get_joint(kps, shoulder_id)
+            if all([shoulder, hip, knee]):
+                angles["step_hip_flexion_deg"] = _angle_between_points(shoulder, hip, knee)
+            # Stance knee should stay extended
+            if all([s_hip, s_knee, s_ankle]):
+                angles["stance_knee_angle_deg"] = _angle_between_points(s_hip, s_knee, s_ankle)
+                alignments["stance_knee_extended"] = angles["stance_knee_angle_deg"] > 160
+            # Lateral trunk lean: shoulders should be level
+            l_sh = _get_joint(kps, L_SHOULDER)
+            r_sh = _get_joint(kps, R_SHOULDER)
+            if l_sh and r_sh:
+                angles["shoulder_tilt_deg"] = abs(math.degrees(
+                    math.atan2(r_sh[1] - l_sh[1], r_sh[0] - l_sh[0])
+                ))
+                alignments["trunk_stable"] = angles["shoulder_tilt_deg"] < 10
+            return {
+                "angles": angles, "alignments": alignments,
+                "timing": {"peak_step_frame": peak_idx},
+                "main_measure_key": "step_hip_flexion_deg",
+                "expected": 4, "notes": "; ".join(notes_parts),
+            }
+        return self._bilateral_features(pose2d, view, side, "hurdle_step", extract)
+    # ─── In-Line Lunge ───────────────────────────────────────────────────────
+    def _inline_lunge(self, pose2d: Pose2DResult, view: str, side: str) -> BiomechFeatures:
+        """In-Line Lunge: knee flexion depth, trunk upright, balance."""
+        def extract(p2d: Pose2DResult, s: str) -> dict:
+            # Front leg is the assessed side
+            hip_id = L_HIP if s == "left" else R_HIP
+            knee_id = L_KNEE if s == "left" else R_KNEE
+            ankle_id = L_ANKLE if s == "left" else R_ANKLE
+            rear_knee_id = R_KNEE if s == "left" else L_KNEE
+            # Deepest lunge = front knee lowest
+            peak_idx = self._find_peak_frame(p2d, knee_id, maximize=True)
+            kps = p2d.keypoints[peak_idx]
+            hip = _get_joint(kps, hip_id)
+            knee = _get_joint(kps, knee_id)
+            ankle = _get_joint(kps, ankle_id)
+            l_sh = _get_joint(kps, L_SHOULDER)
+            r_sh = _get_joint(kps, R_SHOULDER)
+            l_hip = _get_joint(kps, L_HIP)
+            r_hip = _get_joint(kps, R_HIP)
+            angles = {}
+            alignments = {}
+            # Front knee flexion
+            if all([hip, knee, ankle]):
+                angles["front_knee_flexion_deg"] = _angle_between_points(hip, knee, ankle)
+            # Trunk upright: midline shoulder-to-hip angle from vertical
+            if l_sh and r_sh and l_hip and r_hip:
+                mid_sh = ((l_sh[0] + r_sh[0]) / 2, (l_sh[1] + r_sh[1]) / 2)
+                mid_hip = ((l_hip[0] + r_hip[0]) / 2, (l_hip[1] + r_hip[1]) / 2)
+                trunk_from_vert = abs(math.degrees(
+                    math.atan2(mid_hip[0] - mid_sh[0], mid_sh[1] - mid_hip[1])
+                ))
+                angles["trunk_lean_from_vertical_deg"] = trunk_from_vert
+                alignments["trunk_upright"] = trunk_from_vert < 15
+            # Knee over ankle alignment
+            if knee and ankle:
+                alignments["knee_over_ankle"] = abs(knee[0] - ankle[0]) < 40
+            return {
+                "angles": angles, "alignments": alignments,
+                "timing": {"deepest_lunge_frame": peak_idx},
+                "main_measure_key": "front_knee_flexion_deg",
+                "expected": 3, "notes": "",
+            }
+        return self._bilateral_features(pose2d, view, side, "inline_lunge", extract)
+    # ─── Shoulder Mobility ───────────────────────────────────────────────────
+    def _shoulder_mobility(self, pose2d: Pose2DResult, view: str, side: str) -> BiomechFeatures:
+        """Shoulder Mobility: inter-fist distance normalized to hand length."""
+        def extract(p2d: Pose2DResult, s: str) -> dict:
+            # "side" = the hand reaching over (top hand)
+            top_wrist = L_WRIST if s == "left" else R_WRIST
+            bot_wrist = R_WRIST if s == "left" else L_WRIST
+            # Use mid-sequence frame (static hold)
+            mid_idx = len(p2d.keypoints) // 2
+            kps = p2d.keypoints[mid_idx]
+            top_w = _get_joint(kps, top_wrist)
+            bot_w = _get_joint(kps, bot_wrist)
+            angles = {}
+            alignments = {}
+            if top_w and bot_w:
+                # Vertical distance between fists (normalized by torso length)
+                fist_dist_px = math.sqrt((top_w[0] - bot_w[0])**2 + (top_w[1] - bot_w[1])**2)
+                angles["inter_fist_distance_px"] = fist_dist_px
+                # Normalize by torso length (shoulder to hip)
+                sh_id = L_SHOULDER if s == "left" else R_SHOULDER
+                hip_id = L_HIP if s == "left" else R_HIP
+                sh = _get_joint(kps, sh_id)
+                hip = _get_joint(kps, hip_id)
+                if sh and hip:
+                    torso_len = math.sqrt((sh[0] - hip[0])**2 + (sh[1] - hip[1])**2)
+                    if torso_len > 0:
+                        norm_dist = fist_dist_px / torso_len
+                        angles["inter_fist_normalized"] = norm_dist
+                        # Score 3: fists within 1 hand-length (~0.3 torso)
+                        # Score 2: within 1.5 hand-lengths
+                        alignments["fists_within_one_hand"] = norm_dist < 0.35
+                        alignments["fists_within_1_5_hand"] = norm_dist < 0.55
+            return {
+                "angles": angles, "alignments": alignments,
+                "timing": {"measure_frame": mid_idx},
+                "main_measure_key": "inter_fist_normalized",
+                "expected": 2, "notes": "",
+            }
+        return self._bilateral_features(pose2d, view, side, "shoulder_mobility", extract)
+    # ─── Active Straight-Leg Raise ───────────────────────────────────────────
+    def _active_slr(self, pose2d: Pose2DResult, view: str, side: str) -> BiomechFeatures:
+        """ASLR: hip flexion angle of raised leg; down-leg stays flat."""
+        def extract(p2d: Pose2DResult, s: str) -> dict:
+            hip_id = L_HIP if s == "left" else R_HIP
+            knee_id = L_KNEE if s == "left" else R_KNEE
+            ankle_id = L_ANKLE if s == "left" else R_ANKLE
+            # Down leg
+            d_hip_id = R_HIP if s == "left" else L_HIP
+            d_knee_id = R_KNEE if s == "left" else L_KNEE
+            d_ankle_id = R_ANKLE if s == "left" else L_ANKLE
+            # Peak = raised ankle at highest point (lowest Y)
+            peak_idx = self._find_peak_frame(p2d, ankle_id, maximize=False)
+            kps = p2d.keypoints[peak_idx]
+            hip = _get_joint(kps, hip_id)
+            knee = _get_joint(kps, knee_id)
+            ankle = _get_joint(kps, ankle_id)
+            d_hip = _get_joint(kps, d_hip_id)
+            d_knee = _get_joint(kps, d_knee_id)
+            d_ankle = _get_joint(kps, d_ankle_id)
+            angles = {}
+            alignments = {}
+            # Raised leg hip flexion: angle of femur from horizontal
+            if hip and ankle:
+                dy = hip[1] - ankle[1]  # positive = ankle above hip
+                dx = ankle[0] - hip[0]
+                hip_flex = math.degrees(math.atan2(dy, abs(dx) if abs(dx) > 1 else 1))
+                angles["raised_leg_angle_deg"] = max(0, hip_flex)
+                # Score 3: malleolus past contralateral knee (>70°)
+                # Score 2: between contralateral knee and mid-thigh (45-70°)
+                alignments["past_contralateral_knee"] = hip_flex > 70
+                alignments["past_mid_thigh"] = hip_flex > 45
+            # Down leg: should stay flat (knee angle ~180)
+            if all([d_hip, d_knee, d_ankle]):
+                down_knee_angle = _angle_between_points(d_hip, d_knee, d_ankle)
+                angles["down_leg_knee_angle_deg"] = down_knee_angle
+                alignments["down_leg_flat"] = down_knee_angle > 160
+            return {
+                "angles": angles, "alignments": alignments,
+                "timing": {"peak_raise_frame": peak_idx},
+                "main_measure_key": "raised_leg_angle_deg",
+                "expected": 3, "notes": "",
+            }
+        return self._bilateral_features(pose2d, view, side, "active_slr", extract)
+    # ─── Trunk Stability Push-Up ─────────────────────────────────────────────
+    def _trunk_stability_pushup(self, pose2d: Pose2DResult, view: str, side: str) -> BiomechFeatures:
+        """Trunk Stability Push-Up: body rigidity through the press."""
+        angles = {}
+        alignments = {}
+        notes_parts = []
+        # Analyze multiple frames to detect sag/lag
+        trunk_angles_over_time = []
+        for i, kps in enumerate(pose2d.keypoints):
+            l_sh = _get_joint(kps, L_SHOULDER)
+            r_sh = _get_joint(kps, R_SHOULDER)
+            l_hip = _get_joint(kps, L_HIP)
+            r_hip = _get_joint(kps, R_HIP)
+            l_ankle = _get_joint(kps, L_ANKLE)
+            r_ankle = _get_joint(kps, R_ANKLE)
+            if l_sh and r_sh and l_hip and r_hip and l_ankle and r_ankle:
+                mid_sh = ((l_sh[1] + r_sh[1]) / 2,)
+                mid_hip = ((l_hip[1] + r_hip[1]) / 2,)
+                mid_ankle = ((l_ankle[1] + r_ankle[1]) / 2,)
+                # Sag = hip drops below shoulder-ankle line
+                sh_y = (l_sh[1] + r_sh[1]) / 2
+                hip_y = (l_hip[1] + r_hip[1]) / 2
+                ankle_y = (l_ankle[1] + r_ankle[1]) / 2
+                # In image coords: sag = hip_y > midpoint of shoulder-ankle Y
+                expected_hip_y = (sh_y + ankle_y) / 2
+                sag_px = hip_y - expected_hip_y
+                trunk_angles_over_time.append(sag_px)
+        if trunk_angles_over_time:
+            max_sag = max(trunk_angles_over_time)
+            variance = (sum((x - sum(trunk_angles_over_time) / len(trunk_angles_over_time))**2
+                       for x in trunk_angles_over_time) / len(trunk_angles_over_time)) ** 0.5
+            angles["max_sag_px"] = max_sag
+            angles["trunk_variance_px"] = variance
+            alignments["body_rigid"] = max_sag < 30 and variance < 15
+            alignments["no_sag"] = max_sag < 30
+        else:
+            notes_parts.append("insufficient landmarks for trunk analysis")
+        # Hand position (near head = harder = score 3 position)
+        if pose2d.keypoints:
+            mid_kps = pose2d.keypoints[0]
+            nose = _get_joint(mid_kps, NOSE)
+            l_w = _get_joint(mid_kps, L_WRIST)
+            r_w = _get_joint(mid_kps, R_WRIST)
+            if nose and l_w and r_w:
+                avg_wrist_y = (l_w[1] + r_w[1]) / 2
+                # Hands near head = wrist Y close to nose Y
+                alignments["hands_at_forehead"] = abs(avg_wrist_y - nose[1]) < 50
+        n_got = len(angles) + len([v for v in alignments.values() if v is not None])
+        confidence = min(1.0, n_got / 3) * pose2d.confidence
+        return BiomechFeatures(
+            test_name="trunk_stability_pushup", view=view, side="na",
+            angles=angles, alignments=alignments,
+            symmetry_delta=None,
+            timing={"n_frames_analyzed": len(trunk_angles_over_time)},
+            confidence=confidence,
+            notes="; ".join(notes_parts) if notes_parts else "",
+        )
+    # ─── Rotary Stability ────────────────────────────────────────────────────
+    def _rotary_stability(self, pose2d: Pose2DResult, view: str, side: str) -> BiomechFeatures:
+        """Rotary Stability: coordination of ipsilateral arm/leg extension."""
+        angles = {}
+        alignments = {}
+        notes_parts = []
+        # Look for the frame with max arm+leg extension
+        # Quadruped: hands + knees on ground, extending one arm + one leg
+        best_ext_frame = 0
+        best_ext_val = 0
+        for i, kps in enumerate(pose2d.keypoints):
+            l_w = _get_joint(kps, L_WRIST)
+            r_w = _get_joint(kps, R_WRIST)
+            l_a = _get_joint(kps, L_ANKLE)
+            r_a = _get_joint(kps, R_ANKLE)
+            l_sh = _get_joint(kps, L_SHOULDER)
+            r_sh = _get_joint(kps, R_SHOULDER)
+            # Extension = distance of wrist from shoulder + ankle from hip
+            ext_val = 0
+            if l_w and l_sh:
+                ext_val += abs(l_w[0] - l_sh[0])
+            if r_w and r_sh:
+                ext_val += abs(r_w[0] - r_sh[0])
+            if ext_val > best_ext_val:
+                best_ext_val = ext_val
+                best_ext_frame = i
+        kps = pose2d.keypoints[best_ext_frame] if pose2d.keypoints else {}
+        # Trunk stability: shoulders level, hips level
+        l_sh = _get_joint(kps, L_SHOULDER)
+        r_sh = _get_joint(kps, R_SHOULDER)
+        l_hip = _get_joint(kps, L_HIP)
+        r_hip = _get_joint(kps, R_HIP)
+        if l_sh and r_sh:
+            sh_tilt = abs(l_sh[1] - r_sh[1])
+            angles["shoulder_level_diff_px"] = sh_tilt
+            alignments["shoulders_level"] = sh_tilt < 20
+        if l_hip and r_hip:
+            hip_tilt = abs(l_hip[1] - r_hip[1])
+            angles["hip_level_diff_px"] = hip_tilt
+            alignments["hips_level"] = hip_tilt < 20
+        # Check for trunk sag across frames (similar to pushup)
+        trunk_variance = []
+        for kps_frame in pose2d.keypoints:
+            ls = _get_joint(kps_frame, L_SHOULDER)
+            rs = _get_joint(kps_frame, R_SHOULDER)
+            lh = _get_joint(kps_frame, L_HIP)
+            rh = _get_joint(kps_frame, R_HIP)
+            if ls and rs and lh and rh:
+                mid_sh_y = (ls[1] + rs[1]) / 2
+                mid_hip_y = (lh[1] + rh[1]) / 2
+                trunk_variance.append(mid_hip_y - mid_sh_y)
+        if trunk_variance:
+            std = (sum((x - sum(trunk_variance) / len(trunk_variance))**2
+                   for x in trunk_variance) / len(trunk_variance)) ** 0.5
+            angles["trunk_stability_std_px"] = std
+            alignments["trunk_stable"] = std < 15
+        n_got = len(angles) + len([v for v in alignments.values() if v is not None])
+        confidence = min(1.0, n_got / 3) * pose2d.confidence
+        return BiomechFeatures(
+            test_name="rotary_stability", view=view, side="na",
+            angles=angles, alignments=alignments,
+            symmetry_delta=None,
+            timing={"peak_extension_frame": best_ext_frame},
+            confidence=confidence,
+            notes="; ".join(notes_parts) if notes_parts else "",
+        )

formscout/agents/body3d.py CHANGED Viewed

@@ -1,221 +1,221 @@
-"""
-Body3DAgent — optional 3D mesh/joint angle recovery via SAM 3D Body.
-Input:  Pose2DResult, list of athlete masks, list of frames (np.ndarray BGR)
-Output: Body3DResult(used, joints_3d, confidence)
-Failure: ALWAYS returns Body3DResult(used=False) when enable_3d=False or
-         checkpoint unavailable — this is a normal success path, not an error.
-Model:  facebook/sam-3d-body-dinov3 (840M params, SAM License, GATED).
-Gated:  YES — access GRANTED June 4, 2026.
-Params: ~0.84B (DINOv3-H+ variant).
-API (verified from github.com/facebookresearch/sam-3d-body README, Jun 2026):
-  from notebook.utils import setup_sam_3d_body
-  estimator = setup_sam_3d_body(hf_repo_id="facebook/sam-3d-body-dinov3")
-  outputs = estimator.process_one_image(rgb_image)  # single RGB np.ndarray
-  # outputs contains MHR joints, body mesh, etc.
-"""
-from __future__ import annotations
-import numpy as np
-from formscout.types import Pose2DResult, Body3DResult, IngestResult
-from formscout import config
-_NOT_USED = Body3DResult(
-    used=False, joints_3d=[], confidence=0.0,
-    notes="3D disabled or checkpoint unavailable",
-)
-# Subsample frames for 3D inference (expensive per-frame)
-_MAX_3D_FRAMES = 30
-class Body3DAgent:
-    """
-    Optional 3D body joint estimation via SAM 3D Body (MHR rig).
-    Falls back gracefully when unavailable — returning Body3DResult(used=False)
-    is the expected success path for the 2D-only pipeline.
-    """
-    def __init__(self, enable_3d: bool | None = None):
-        self._enabled = config.ENABLE_3D if enable_3d is None else enable_3d
-        self._estimator = None
-        if self._enabled:
-            self._estimator = self._try_load()
-    def _try_load(self):
-        """
-        Attempt to load SAM 3D Body from HuggingFace.
-        Returns the estimator object or None on any failure.
-        """
-        try:
-            from notebook.utils import setup_sam_3d_body  # noqa: F401
-            estimator = setup_sam_3d_body(
-                hf_repo_id=config.SAM_3D_HF_REPO,
-            )
-            return estimator
-        except ImportError:
-            return None
-        except Exception:
-            return None
-    def run(
-        self,
-        pose2d: Pose2DResult,
-        masks: list,
-        frames: list | None = None,
-    ) -> Body3DResult:
-        """
-        Run 3D body estimation on selected keyframes.
-        Args:
-            pose2d: 2D pose results (used for confidence weighting)
-            masks: Per-frame athlete masks from SegmentationAgent
-            frames: Raw BGR frames from IngestResult.frames
-        Returns:
-            Body3DResult with used=True and 3D joints if successful,
-            or Body3DResult(used=False) if disabled/unavailable (normal path).
-        """
-        if not self._enabled or self._estimator is None:
-            return _NOT_USED
-        if not frames:
-            return Body3DResult(
-                used=False, joints_3d=[], confidence=0.0,
-                notes="3D enabled but no frames provided",
-            )
-        try:
-            import cv2
-            # Subsample frames evenly for 3D (it's expensive per-image)
-            n_frames = len(frames)
-            step = max(1, n_frames // _MAX_3D_FRAMES)
-            selected_indices = list(range(0, n_frames, step))[:_MAX_3D_FRAMES]
-            joints_3d_per_frame: list[dict] = []
-            confidences: list[float] = []
-            for idx in selected_indices:
-                frame_bgr = frames[idx]
-                # SAM 3D Body expects RGB
-                frame_rgb = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2RGB)
-                outputs = self._estimator.process_one_image(frame_rgb)
-                # Extract MHR joint positions from outputs
-                # The model returns joints in the MHR (Momentum Human Rig) format
-                frame_joints = self._extract_joints(outputs, idx)
-                joints_3d_per_frame.append(frame_joints)
-                # Confidence from detection quality
-                conf = self._estimate_confidence(outputs)
-                confidences.append(conf)
-            # Apply light temporal smoothing to reduce jitter
-            joints_3d_smoothed = self._temporal_smooth(joints_3d_per_frame)
-            overall_conf = float(np.mean(confidences)) if confidences else 0.0
-            return Body3DResult(
-                used=True,
-                joints_3d=joints_3d_smoothed,
-                confidence=overall_conf,
-                notes=f"3D mesh recovery on {len(selected_indices)}/{n_frames} frames",
-            )
-        except Exception as e:
-            return Body3DResult(
-                used=False, joints_3d=[], confidence=0.0,
-                notes=f"3D inference failed: {e}",
-            )
-    def _extract_joints(self, outputs: dict, frame_idx: int) -> dict:
-        """
-        Extract 3D joint positions from SAM 3D Body outputs.
-        Maps MHR rig joints to a standardized dict format.
-        """
-        joints: dict = {"frame_index": frame_idx}
-        # SAM 3D Body outputs MHR model params including joint positions
-        # The exact key depends on the model output format
-        if hasattr(outputs, "joints_3d"):
-            joint_data = outputs.joints_3d
-        elif isinstance(outputs, dict) and "joints_3d" in outputs:
-            joint_data = outputs["joints_3d"]
-        elif isinstance(outputs, dict) and "pred_joints" in outputs:
-            joint_data = outputs["pred_joints"]
-        else:
-            # Fallback: extract from vertices/body model params
-            joint_data = None
-        if joint_data is not None:
-            if hasattr(joint_data, "cpu"):
-                joint_data = joint_data.cpu().numpy()
-            if isinstance(joint_data, np.ndarray):
-                # Map to named joints (MHR has standard SMPL-like ordering)
-                joint_names = [
-                    "pelvis", "left_hip", "right_hip", "spine1",
-                    "left_knee", "right_knee", "spine2",
-                    "left_ankle", "right_ankle", "spine3",
-                    "left_foot", "right_foot", "neck",
-                    "left_collar", "right_collar", "head",
-                    "left_shoulder", "right_shoulder",
-                    "left_elbow", "right_elbow",
-                    "left_wrist", "right_wrist",
-                ]
-                for i, name in enumerate(joint_names):
-                    if i < len(joint_data):
-                        pos = joint_data[i]
-                        joints[name] = {
-                            "x": float(pos[0]),
-                            "y": float(pos[1]),
-                            "z": float(pos[2]),
-                        }
-        return joints
-    def _estimate_confidence(self, outputs) -> float:
-        """Estimate confidence from the SAM 3D Body output quality."""
-        # If outputs have a confidence/score field, use it
-        if isinstance(outputs, dict):
-            if "confidence" in outputs:
-                return float(outputs["confidence"])
-            if "score" in outputs:
-                return float(outputs["score"])
-        # Default: assume reasonable confidence if we got outputs at all
-        return 0.75
-    def _temporal_smooth(
-        self, joints_3d: list[dict], alpha: float = 0.3
-    ) -> list[dict]:
-        """
-        Apply exponential moving average smoothing to 3D joint positions
-        to reduce per-frame jitter from single-image prediction.
-        """
-        if len(joints_3d) <= 1:
-            return joints_3d
-        smoothed = [joints_3d[0]]
-        for i in range(1, len(joints_3d)):
-            prev = smoothed[-1]
-            curr = joints_3d[i]
-            smooth_frame = {"frame_index": curr.get("frame_index", i)}
-            for key in curr:
-                if key == "frame_index":
-                    continue
-                if key in prev and isinstance(curr[key], dict) and isinstance(prev[key], dict):
-                    smooth_frame[key] = {
-                        "x": alpha * curr[key]["x"] + (1 - alpha) * prev[key]["x"],
-                        "y": alpha * curr[key]["y"] + (1 - alpha) * prev[key]["y"],
-                        "z": alpha * curr[key]["z"] + (1 - alpha) * prev[key]["z"],
-                    }
-                else:
-                    smooth_frame[key] = curr[key]
-            smoothed.append(smooth_frame)
-        return smoothed

+"""
+Body3DAgent — optional 3D mesh/joint angle recovery via SAM 3D Body.
+Input:  Pose2DResult, list of athlete masks, list of frames (np.ndarray BGR)
+Output: Body3DResult(used, joints_3d, confidence)
+Failure: ALWAYS returns Body3DResult(used=False) when enable_3d=False or
+         checkpoint unavailable — this is a normal success path, not an error.
+Model:  facebook/sam-3d-body-dinov3 (840M params, SAM License, GATED).
+Gated:  YES — access GRANTED June 4, 2026.
+Params: ~0.84B (DINOv3-H+ variant).
+API (verified from github.com/facebookresearch/sam-3d-body README, Jun 2026):
+  from notebook.utils import setup_sam_3d_body
+  estimator = setup_sam_3d_body(hf_repo_id="facebook/sam-3d-body-dinov3")
+  outputs = estimator.process_one_image(rgb_image)  # single RGB np.ndarray
+  # outputs contains MHR joints, body mesh, etc.
+"""
+from __future__ import annotations
+import numpy as np
+from formscout.types import Pose2DResult, Body3DResult, IngestResult
+from formscout import config
+_NOT_USED = Body3DResult(
+    used=False, joints_3d=[], confidence=0.0,
+    notes="3D disabled or checkpoint unavailable",
+)
+# Subsample frames for 3D inference (expensive per-frame)
+_MAX_3D_FRAMES = 30
+class Body3DAgent:
+    """
+    Optional 3D body joint estimation via SAM 3D Body (MHR rig).
+    Falls back gracefully when unavailable — returning Body3DResult(used=False)
+    is the expected success path for the 2D-only pipeline.
+    """
+    def __init__(self, enable_3d: bool | None = None):
+        self._enabled = config.ENABLE_3D if enable_3d is None else enable_3d
+        self._estimator = None
+        if self._enabled:
+            self._estimator = self._try_load()
+    def _try_load(self):
+        """
+        Attempt to load SAM 3D Body from HuggingFace.
+        Returns the estimator object or None on any failure.
+        """
+        try:
+            from notebook.utils import setup_sam_3d_body  # noqa: F401
+            estimator = setup_sam_3d_body(
+                hf_repo_id=config.SAM_3D_HF_REPO,
+            )
+            return estimator
+        except ImportError:
+            return None
+        except Exception:
+            return None
+    def run(
+        self,
+        pose2d: Pose2DResult,
+        masks: list,
+        frames: list | None = None,
+    ) -> Body3DResult:
+        """
+        Run 3D body estimation on selected keyframes.
+        Args:
+            pose2d: 2D pose results (used for confidence weighting)
+            masks: Per-frame athlete masks from SegmentationAgent
+            frames: Raw BGR frames from IngestResult.frames
+        Returns:
+            Body3DResult with used=True and 3D joints if successful,
+            or Body3DResult(used=False) if disabled/unavailable (normal path).
+        """
+        if not self._enabled or self._estimator is None:
+            return _NOT_USED
+        if not frames:
+            return Body3DResult(
+                used=False, joints_3d=[], confidence=0.0,
+                notes="3D enabled but no frames provided",
+            )
+        try:
+            import cv2
+            # Subsample frames evenly for 3D (it's expensive per-image)
+            n_frames = len(frames)
+            step = max(1, n_frames // _MAX_3D_FRAMES)
+            selected_indices = list(range(0, n_frames, step))[:_MAX_3D_FRAMES]
+            joints_3d_per_frame: list[dict] = []
+            confidences: list[float] = []
+            for idx in selected_indices:
+                frame_bgr = frames[idx]
+                # SAM 3D Body expects RGB
+                frame_rgb = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2RGB)
+                outputs = self._estimator.process_one_image(frame_rgb)
+                # Extract MHR joint positions from outputs
+                # The model returns joints in the MHR (Momentum Human Rig) format
+                frame_joints = self._extract_joints(outputs, idx)
+                joints_3d_per_frame.append(frame_joints)
+                # Confidence from detection quality
+                conf = self._estimate_confidence(outputs)
+                confidences.append(conf)
+            # Apply light temporal smoothing to reduce jitter
+            joints_3d_smoothed = self._temporal_smooth(joints_3d_per_frame)
+            overall_conf = float(np.mean(confidences)) if confidences else 0.0
+            return Body3DResult(
+                used=True,
+                joints_3d=joints_3d_smoothed,
+                confidence=overall_conf,
+                notes=f"3D mesh recovery on {len(selected_indices)}/{n_frames} frames",
+            )
+        except Exception as e:
+            return Body3DResult(
+                used=False, joints_3d=[], confidence=0.0,
+                notes=f"3D inference failed: {e}",
+            )
+    def _extract_joints(self, outputs: dict, frame_idx: int) -> dict:
+        """
+        Extract 3D joint positions from SAM 3D Body outputs.
+        Maps MHR rig joints to a standardized dict format.
+        """
+        joints: dict = {"frame_index": frame_idx}
+        # SAM 3D Body outputs MHR model params including joint positions
+        # The exact key depends on the model output format
+        if hasattr(outputs, "joints_3d"):
+            joint_data = outputs.joints_3d
+        elif isinstance(outputs, dict) and "joints_3d" in outputs:
+            joint_data = outputs["joints_3d"]
+        elif isinstance(outputs, dict) and "pred_joints" in outputs:
+            joint_data = outputs["pred_joints"]
+        else:
+            # Fallback: extract from vertices/body model params
+            joint_data = None
+        if joint_data is not None:
+            if hasattr(joint_data, "cpu"):
+                joint_data = joint_data.cpu().numpy()
+            if isinstance(joint_data, np.ndarray):
+                # Map to named joints (MHR has standard SMPL-like ordering)
+                joint_names = [
+                    "pelvis", "left_hip", "right_hip", "spine1",
+                    "left_knee", "right_knee", "spine2",
+                    "left_ankle", "right_ankle", "spine3",
+                    "left_foot", "right_foot", "neck",
+                    "left_collar", "right_collar", "head",
+                    "left_shoulder", "right_shoulder",
+                    "left_elbow", "right_elbow",
+                    "left_wrist", "right_wrist",
+                ]
+                for i, name in enumerate(joint_names):
+                    if i < len(joint_data):
+                        pos = joint_data[i]
+                        joints[name] = {
+                            "x": float(pos[0]),
+                            "y": float(pos[1]),
+                            "z": float(pos[2]),
+                        }
+        return joints
+    def _estimate_confidence(self, outputs) -> float:
+        """Estimate confidence from the SAM 3D Body output quality."""
+        # If outputs have a confidence/score field, use it
+        if isinstance(outputs, dict):
+            if "confidence" in outputs:
+                return float(outputs["confidence"])
+            if "score" in outputs:
+                return float(outputs["score"])
+        # Default: assume reasonable confidence if we got outputs at all
+        return 0.75
+    def _temporal_smooth(
+        self, joints_3d: list[dict], alpha: float = 0.3
+    ) -> list[dict]:
+        """
+        Apply exponential moving average smoothing to 3D joint positions
+        to reduce per-frame jitter from single-image prediction.
+        """
+        if len(joints_3d) <= 1:
+            return joints_3d
+        smoothed = [joints_3d[0]]
+        for i in range(1, len(joints_3d)):
+            prev = smoothed[-1]
+            curr = joints_3d[i]
+            smooth_frame = {"frame_index": curr.get("frame_index", i)}
+            for key in curr:
+                if key == "frame_index":
+                    continue
+                if key in prev and isinstance(curr[key], dict) and isinstance(prev[key], dict):
+                    smooth_frame[key] = {
+                        "x": alpha * curr[key]["x"] + (1 - alpha) * prev[key]["x"],
+                        "y": alpha * curr[key]["y"] + (1 - alpha) * prev[key]["y"],
+                        "z": alpha * curr[key]["z"] + (1 - alpha) * prev[key]["z"],
+                    }
+                else:
+                    smooth_frame[key] = curr[key]
+            smoothed.append(smooth_frame)
+        return smoothed

formscout/agents/classifier.py ADDED Viewed

	@@ -0,0 +1,103 @@

+"""
+MovementClassifierAgent — identifies which FMS test is in the clip.
+Input:  IngestResult (keyframes), Pose2DResult (skeleton context)
+Output: MovementResult(test_name, side, confidence)
+Failure: returns MovementResult(test_name="unknown") — pipeline stops and asks for manual override.
+Model:  Qwen3-VL-8B-Instruct via llama.cpp (8B params, Apache-2.0).
+Gated:  No.
+"""
+from __future__ import annotations
+import json
+import logging
+from pathlib import Path
+from formscout import config
+from formscout.types import IngestResult, Pose2DResult, MovementResult
+from formscout.serving.llama_cpp import LlamaCppClient
+logger = logging.getLogger(__name__)
+_PROMPT_PATH = Path(__file__).parent / "prompts" / "c1_classifier.md"
+class MovementClassifierAgent:
+    """Classifies which FMS test is being performed via VLM or manual override."""
+    def __init__(self):
+        self._client = LlamaCppClient(port=config.LLAMA_CPP_PORT_VLM)
+        self._system_prompt = _PROMPT_PATH.read_text(encoding="utf-8")
+    def run(
+        self,
+        ingest: IngestResult,
+        pose2d: Pose2DResult | None = None,
+        manual_override: str | None = None,
+    ) -> MovementResult:
+        """
+        Classify the movement. If manual_override is provided, use it directly.
+        Otherwise, use VLM inference on keyframes.
+        """
+        if manual_override and manual_override != "unknown":
+            return MovementResult(
+                test_name=manual_override, side="na",
+                confidence=1.0, notes="manual override",
+            )
+        if not self._client.available:
+            return MovementResult(
+                test_name="unknown", side="na", confidence=0.0,
+                notes="VLM server unavailable — use manual override",
+            )
+        # Select keyframes for classification (3 evenly spaced)
+        n = len(ingest.frames)
+        indices = [0, n // 2, n - 1] if n >= 3 else list(range(n))
+        images = self._encode_frames(ingest.frames, indices)
+        prompt = f"{self._system_prompt}\n\nClassify this movement from the keyframes shown."
+        result = self._client.complete(prompt, images=images, max_tokens=256, temperature=0.1)
+        return self._parse_response(result)
+    def _encode_frames(self, frames: list, indices: list[int]) -> list[str]:
+        """Encode selected frames as base64 JPEG for the VLM."""
+        import cv2
+        import base64
+        encoded = []
+        for idx in indices:
+            if idx < len(frames):
+                _, buf = cv2.imencode(".jpg", frames[idx], [cv2.IMWRITE_JPEG_QUALITY, 80])
+                encoded.append(base64.b64encode(buf.tobytes()).decode())
+        return encoded
+    def _parse_response(self, result: dict) -> MovementResult:
+        """Parse VLM JSON response into MovementResult."""
+        if "error" in result:
+            return MovementResult(
+                test_name="unknown", side="na", confidence=0.0,
+                notes=f"VLM error: {result['error']}",
+            )
+        test = result.get("test", "unknown")
+        side = result.get("side", "na")
+        confidence = float(result.get("confidence", 0.0))
+        reason = result.get("reason", "")
+        valid_tests = {
+            "deep_squat", "hurdle_step", "inline_lunge",
+            "shoulder_mobility", "active_slr",
+            "trunk_stability_pushup", "rotary_stability", "unknown",
+        }
+        if test not in valid_tests:
+            test = "unknown"
+        if side not in ("left", "right", "na"):
+            side = "na"
+        return MovementResult(
+            test_name=test, side=side,
+            confidence=confidence, notes=reason,
+        )

formscout/agents/ingest.py CHANGED Viewed

@@ -1,91 +1,91 @@
-"""
-IngestAgent — decodes video, normalizes FPS, samples frames.
-Input:  video file path (str)
-Output: IngestResult(frames, fps, duration, n_people, width, height)
-Failure: returns IngestResult with confidence=0.0 and notes explaining the error.
-Params: 0 (no model — pure OpenCV).
-License: n/a.
-Gated: no.
-"""
-from __future__ import annotations
-import cv2
-from pathlib import Path
-from formscout.types import IngestResult
-from formscout import config
-class IngestAgent:
-    """Deterministic video ingestion — no model, just OpenCV decode + frame sampling."""
-    def run(self, video_path: str) -> IngestResult:
-        p = Path(video_path)
-        if not p.exists():
-            return IngestResult(
-                frames=[], fps=0.0, duration=0.0, n_people=0,
-                width=0, height=0, confidence=0.0,
-                notes=f"video not found: {video_path}",
-            )
-        try:
-            cap = cv2.VideoCapture(str(p))
-        except Exception as e:
-            return IngestResult(
-                frames=[], fps=0.0, duration=0.0, n_people=0,
-                width=0, height=0, confidence=0.0,
-                notes=f"failed to open video: {e}",
-            )
-        if not cap.isOpened():
-            return IngestResult(
-                frames=[], fps=0.0, duration=0.0, n_people=0,
-                width=0, height=0, confidence=0.0,
-                notes=f"could not open video: {video_path}",
-            )
-        fps = cap.get(cv2.CAP_PROP_FPS) or config.TARGET_FPS
-        total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
-        w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
-        h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
-        duration = total / fps if fps > 0 else 0.0
-        notes_parts: list[str] = []
-        if duration > config.MAX_DURATION_SEC:
-            notes_parts.append(
-                f"video is {duration:.1f}s (>{config.MAX_DURATION_SEC}s) — capping frames"
-            )
-        # Sample frames evenly, capped at MAX_FRAMES
-        step = max(1, total // config.MAX_FRAMES)
-        frames: list = []
-        idx = 0
-        while True:
-            ret, frame = cap.read()
-            if not ret:
-                break
-            if idx % step == 0:
-                frames.append(frame)
-            idx += 1
-            if len(frames) >= config.MAX_FRAMES:
-                break
-        cap.release()
-        if not frames:
-            return IngestResult(
-                frames=[], fps=fps, duration=duration, n_people=0,
-                width=w, height=h, confidence=0.0,
-                notes="no frames decoded",
-            )
-        return IngestResult(
-            frames=frames,
-            fps=fps,
-            duration=duration,
-            n_people=-1,  # unknown until segmentation/pose
-            width=w,
-            height=h,
-            confidence=1.0,
-            notes="; ".join(notes_parts) if notes_parts else "",
-        )

+"""
+IngestAgent — decodes video, normalizes FPS, samples frames.
+Input:  video file path (str)
+Output: IngestResult(frames, fps, duration, n_people, width, height)
+Failure: returns IngestResult with confidence=0.0 and notes explaining the error.
+Params: 0 (no model — pure OpenCV).
+License: n/a.
+Gated: no.
+"""
+from __future__ import annotations
+import cv2
+from pathlib import Path
+from formscout.types import IngestResult
+from formscout import config
+class IngestAgent:
+    """Deterministic video ingestion — no model, just OpenCV decode + frame sampling."""
+    def run(self, video_path: str) -> IngestResult:
+        p = Path(video_path)
+        if not p.exists():
+            return IngestResult(
+                frames=[], fps=0.0, duration=0.0, n_people=0,
+                width=0, height=0, confidence=0.0,
+                notes=f"video not found: {video_path}",
+            )
+        try:
+            cap = cv2.VideoCapture(str(p))
+        except Exception as e:
+            return IngestResult(
+                frames=[], fps=0.0, duration=0.0, n_people=0,
+                width=0, height=0, confidence=0.0,
+                notes=f"failed to open video: {e}",
+            )
+        if not cap.isOpened():
+            return IngestResult(
+                frames=[], fps=0.0, duration=0.0, n_people=0,
+                width=0, height=0, confidence=0.0,
+                notes=f"could not open video: {video_path}",
+            )
+        fps = cap.get(cv2.CAP_PROP_FPS) or config.TARGET_FPS
+        total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+        w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
+        h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
+        duration = total / fps if fps > 0 else 0.0
+        notes_parts: list[str] = []
+        if duration > config.MAX_DURATION_SEC:
+            notes_parts.append(
+                f"video is {duration:.1f}s (>{config.MAX_DURATION_SEC}s) — capping frames"
+            )
+        # Sample frames evenly, capped at MAX_FRAMES
+        step = max(1, total // config.MAX_FRAMES)
+        frames: list = []
+        idx = 0
+        while True:
+            ret, frame = cap.read()
+            if not ret:
+                break
+            if idx % step == 0:
+                frames.append(frame)
+            idx += 1
+            if len(frames) >= config.MAX_FRAMES:
+                break
+        cap.release()
+        if not frames:
+            return IngestResult(
+                frames=[], fps=fps, duration=duration, n_people=0,
+                width=w, height=h, confidence=0.0,
+                notes="no frames decoded",
+            )
+        return IngestResult(
+            frames=frames,
+            fps=fps,
+            duration=duration,
+            n_people=-1,  # unknown until segmentation/pose
+            width=w,
+            height=h,
+            confidence=1.0,
+            notes="; ".join(notes_parts) if notes_parts else "",
+        )

formscout/agents/judge.py ADDED Viewed

	@@ -0,0 +1,122 @@

+"""
+JudgeAgent — VLM-based final scorer with rationale, compensation tags, pain detection.
+Input:  BiomechFeatures, ScoreResult (rubric candidate), MovementResult, keyframes
+Output: JudgeResult(score, rationale, compensation_tags, corrective_hint, needs_human)
+Failure: returns JudgeResult(needs_human=True, score=None) when uncertain.
+Model:  Qwen3-VL-8B-Instruct via llama.cpp (8B params, Apache-2.0).
+Gated:  No.
+Safety: NEVER auto-scores pain. If any indication of pain/clearing test,
+        sets needs_human=True and score=None.
+"""
+from __future__ import annotations
+import json
+import logging
+from pathlib import Path
+from formscout import config
+from formscout.types import (
+    BiomechFeatures, ScoreResult, MovementResult,
+    IngestResult, JudgeResult,
+)
+from formscout.serving.llama_cpp import LlamaCppClient
+logger = logging.getLogger(__name__)
+_PROMPT_PATH = Path(__file__).parent / "prompts" / "c2_judge.md"
+class JudgeAgent:
+    """VLM judge that produces the final FMS score with rationale."""
+    def __init__(self):
+        self._client = LlamaCppClient(port=config.LLAMA_CPP_PORT_VLM)
+        self._system_prompt = _PROMPT_PATH.read_text(encoding="utf-8")
+    def run(
+        self,
+        features: BiomechFeatures,
+        rubric_score: ScoreResult,
+        movement: MovementResult,
+        ingest: IngestResult | None = None,
+    ) -> JudgeResult:
+        """
+        Produce final score. Falls back to rubric score if VLM unavailable.
+        """
+        if not config.ENABLE_JUDGE:
+            return self._fallback_from_rubric(rubric_score, features)
+        if not self._client.available:
+            logger.warning("JudgeAgent: VLM unavailable, using rubric score as final")
+            return self._fallback_from_rubric(rubric_score, features)
+        # Build context for the judge
+        context = {
+            "test": features.test_name,
+            "side": features.side,
+            "view": features.view,
+            "features": {"angles": features.angles, "alignments": features.alignments},
+            "candidate_score": rubric_score.score,
+            "candidate_confidence": rubric_score.confidence,
+            "exemplars": [],  # Phase 3: populated by RetrievalAgent
+        }
+        prompt = f"{self._system_prompt}\n\n{json.dumps(context, indent=2)}"
+        # Optionally include keyframes
+        images = None
+        if ingest and ingest.frames:
+            images = self._encode_keyframes(ingest.frames)
+        result = self._client.complete(prompt, images=images, max_tokens=512, temperature=0.1)
+        return self._parse_response(result)
+    def _encode_keyframes(self, frames: list) -> list[str]:
+        """Encode 3 keyframes for VLM context."""
+        import cv2
+        import base64
+        n = len(frames)
+        indices = [0, n // 2, n - 1] if n >= 3 else list(range(n))
+        encoded = []
+        for idx in indices:
+            _, buf = cv2.imencode(".jpg", frames[idx], [cv2.IMWRITE_JPEG_QUALITY, 70])
+            encoded.append(base64.b64encode(buf.tobytes()).decode())
+        return encoded
+    def _parse_response(self, result: dict) -> JudgeResult:
+        """Parse VLM JSON response into JudgeResult."""
+        if "error" in result:
+            return JudgeResult(
+                score=None, rationale=f"VLM error: {result['error']}",
+                compensation_tags=[], corrective_hint="",
+                confidence=0.0, needs_human=True,
+            )
+        needs_human = result.get("needs_human", False)
+        score = result.get("score") if not needs_human else None
+        if score is not None:
+            score = max(0, min(3, int(score)))
+        return JudgeResult(
+            score=score,
+            rationale=result.get("rationale", ""),
+            compensation_tags=result.get("compensation_tags", []),
+            corrective_hint=result.get("corrective_hint", ""),
+            confidence=float(result.get("confidence", 0.5)),
+            needs_human=needs_human,
+        )
+    def _fallback_from_rubric(self, rubric: ScoreResult, features: BiomechFeatures) -> JudgeResult:
+        """When VLM is unavailable, promote the rubric score as the final score."""
+        return JudgeResult(
+            score=rubric.score,
+            rationale=f"[rubric-only] {rubric.rationale}",
+            compensation_tags=[],
+            corrective_hint="",
+            confidence=rubric.confidence * 0.8,
+            needs_human=rubric.needs_human,
+            notes="VLM unavailable — rubric score used as final",
+        )

formscout/agents/pose2d.py CHANGED Viewed

@@ -1,95 +1,95 @@
-"""
-Pose2DAgent — 2D per-frame keypoint extraction using YOLO or Sapiens2 backends.
-Input:  IngestResult
-Output: Pose2DResult(keypoints per frame, fps, confidence)
-Failure: returns Pose2DResult with confidence=0.0 and notes.
-Model:  YOLO26l-Pose (AGPL-3.0, 25.9M params, mAP50 90.5, public).
-        Alt: YOLO26x-Pose (57.6M, mAP50 91.6) via config.YOLO_POSE_MODEL_HQ.
-        Fallback: Sapiens2 Pose (CC-BY-NC-4.0, ~0.6B, gated — access accepted).
-Gated: Primary no; fallback yes (accepted).
-"""
-from __future__ import annotations
-import numpy as np
-from formscout import config
-from formscout.types import IngestResult, Pose2DResult
-# COCO 17-keypoint names for downstream consumers
-COCO_KEYPOINTS = [
-    "nose", "left_eye", "right_eye", "left_ear", "right_ear",
-    "left_shoulder", "right_shoulder", "left_elbow", "right_elbow",
-    "left_wrist", "right_wrist", "left_hip", "right_hip",
-    "left_knee", "right_knee", "left_ankle", "right_ankle",
-]
-_model = None
-def _get_model():
-    """Load YOLO pose model once at module level."""
-    global _model
-    if _model is None:
-        try:
-            from ultralytics import YOLO
-            _model = YOLO(config.YOLO_POSE_MODEL)
-        except Exception as e:
-            raise RuntimeError(f"Failed to load YOLO pose model: {e}")
-    return _model
-class Pose2DAgent:
-    """Extracts 2D keypoints per frame from ingested video."""
-    def run(self, ingest: IngestResult) -> Pose2DResult:
-        if not ingest.frames:
-            return Pose2DResult(
-                keypoints=[], fps=ingest.fps,
-                confidence=0.0, notes="no frames in ingest",
-            )
-        try:
-            model = _get_model()
-        except RuntimeError as e:
-            return Pose2DResult(
-                keypoints=[{} for _ in ingest.frames],
-                fps=ingest.fps,
-                confidence=0.0,
-                notes=str(e),
-            )
-        keypoints_per_frame: list[dict] = []
-        total_conf = 0.0
-        n_detected = 0
-        for frame in ingest.frames:
-            try:
-                results = model(frame, verbose=False)
-                frame_kps: dict[int, dict] = {}
-                if results and results[0].keypoints is not None:
-                    kps = results[0].keypoints
-                    if kps.xy is not None and len(kps.xy) > 0:
-                        # Take highest-confidence person (index 0 after NMS sort)
-                        xy = kps.xy[0].cpu().numpy()      # (17, 2)
-                        conf = kps.conf[0].cpu().numpy()  # (17,)
-                        for j in range(len(xy)):
-                            frame_kps[j] = {
-                                "x": float(xy[j, 0]),
-                                "y": float(xy[j, 1]),
-                                "conf": float(conf[j]),
-                            }
-                        total_conf += float(conf.mean())
-                        n_detected += 1
-                keypoints_per_frame.append(frame_kps)
-            except Exception:
-                keypoints_per_frame.append({})
-        overall_conf = (total_conf / n_detected) if n_detected > 0 else 0.0
-        notes = "" if n_detected > 0 else "no person detected in any frame"
-        return Pose2DResult(
-            keypoints=keypoints_per_frame,
-            fps=ingest.fps,
-            confidence=overall_conf,
-            notes=notes,
-        )

+"""
+Pose2DAgent — 2D per-frame keypoint extraction using YOLO or Sapiens2 backends.
+Input:  IngestResult
+Output: Pose2DResult(keypoints per frame, fps, confidence)
+Failure: returns Pose2DResult with confidence=0.0 and notes.
+Model:  YOLO26l-Pose (AGPL-3.0, 25.9M params, mAP50 90.5, public).
+        Alt: YOLO26x-Pose (57.6M, mAP50 91.6) via config.YOLO_POSE_MODEL_HQ.
+        Fallback: Sapiens2 Pose (CC-BY-NC-4.0, ~0.6B, gated — access accepted).
+Gated: Primary no; fallback yes (accepted).
+"""
+from __future__ import annotations
+import numpy as np
+from formscout import config
+from formscout.types import IngestResult, Pose2DResult
+# COCO 17-keypoint names for downstream consumers
+COCO_KEYPOINTS = [
+    "nose", "left_eye", "right_eye", "left_ear", "right_ear",
+    "left_shoulder", "right_shoulder", "left_elbow", "right_elbow",
+    "left_wrist", "right_wrist", "left_hip", "right_hip",
+    "left_knee", "right_knee", "left_ankle", "right_ankle",
+]
+_model = None
+def _get_model():
+    """Load YOLO pose model once at module level."""
+    global _model
+    if _model is None:
+        try:
+            from ultralytics import YOLO
+            _model = YOLO(config.YOLO_POSE_MODEL)
+        except Exception as e:
+            raise RuntimeError(f"Failed to load YOLO pose model: {e}")
+    return _model
+class Pose2DAgent:
+    """Extracts 2D keypoints per frame from ingested video."""
+    def run(self, ingest: IngestResult) -> Pose2DResult:
+        if not ingest.frames:
+            return Pose2DResult(
+                keypoints=[], fps=ingest.fps,
+                confidence=0.0, notes="no frames in ingest",
+            )
+        try:
+            model = _get_model()
+        except RuntimeError as e:
+            return Pose2DResult(
+                keypoints=[{} for _ in ingest.frames],
+                fps=ingest.fps,
+                confidence=0.0,
+                notes=str(e),
+            )
+        keypoints_per_frame: list[dict] = []
+        total_conf = 0.0
+        n_detected = 0
+        for frame in ingest.frames:
+            try:
+                results = model(frame, verbose=False)
+                frame_kps: dict[int, dict] = {}
+                if results and results[0].keypoints is not None:
+                    kps = results[0].keypoints
+                    if kps.xy is not None and len(kps.xy) > 0:
+                        # Take highest-confidence person (index 0 after NMS sort)
+                        xy = kps.xy[0].cpu().numpy()      # (17, 2)
+                        conf = kps.conf[0].cpu().numpy()  # (17,)
+                        for j in range(len(xy)):
+                            frame_kps[j] = {
+                                "x": float(xy[j, 0]),
+                                "y": float(xy[j, 1]),
+                                "conf": float(conf[j]),
+                            }
+                        total_conf += float(conf.mean())
+                        n_detected += 1
+                keypoints_per_frame.append(frame_kps)
+            except Exception:
+                keypoints_per_frame.append({})
+        overall_conf = (total_conf / n_detected) if n_detected > 0 else 0.0
+        notes = "" if n_detected > 0 else "no person detected in any frame"
+        return Pose2DResult(
+            keypoints=keypoints_per_frame,
+            fps=ingest.fps,
+            confidence=overall_conf,
+            notes=notes,
+        )

formscout/agents/prompts/c1_classifier.md CHANGED Viewed

@@ -1,17 +1,17 @@
-You are an FMS movement classifier. You are shown a few keyframes and a skeleton montage from a single short clip of one person performing ONE Functional Movement Screen test. Identify which test it is and, for one-sided tests, which side is being assessed.
-The seven tests and their tells:
-- deep_squat: feet shoulder-width, a dowel/bar held overhead with both arms, a deep two-legged squat.
-- hurdle_step: stepping one leg over a low hurdle/cord while balancing on the other, dowel across shoulders.
-- inline_lunge: feet in a narrow heel-to-toe line, a lunge down the line, dowel held vertically behind the back.
-- shoulder_mobility: one hand reaching over the shoulder down the back, the other reaching up from below; fists measured.
-- active_slr: lying supine, one leg raised straight up while the other stays flat on the ground.
-- trunk_stability_pushup: prone push-up with hands high (near the head), body pressed up as one rigid unit.
-- rotary_stability: quadruped (hands+knees), same-side or opposite arm and leg extended then drawn together.
-- unknown: it does not clearly match any of the above, or the view is too poor to tell.
-Rules:
-- Prefer "unknown" over a low-confidence guess. A wrong test makes the whole score meaningless.
-- "side" is "left" or "right" for one-sided tests (hurdle_step, inline_lunge, shoulder_mobility, active_slr); use "na" for two-sided tests (deep_squat, trunk_stability_pushup, rotary_stability) and unknown.
-- Output ONLY this JSON object, nothing else:
-{"test": "<one of the labels>", "side": "left|right|na", "confidence": <0.0-1.0>, "reason": "<one short sentence>"}

+You are an FMS movement classifier. You are shown a few keyframes and a skeleton montage from a single short clip of one person performing ONE Functional Movement Screen test. Identify which test it is and, for one-sided tests, which side is being assessed.
+The seven tests and their tells:
+- deep_squat: feet shoulder-width, a dowel/bar held overhead with both arms, a deep two-legged squat.
+- hurdle_step: stepping one leg over a low hurdle/cord while balancing on the other, dowel across shoulders.
+- inline_lunge: feet in a narrow heel-to-toe line, a lunge down the line, dowel held vertically behind the back.
+- shoulder_mobility: one hand reaching over the shoulder down the back, the other reaching up from below; fists measured.
+- active_slr: lying supine, one leg raised straight up while the other stays flat on the ground.
+- trunk_stability_pushup: prone push-up with hands high (near the head), body pressed up as one rigid unit.
+- rotary_stability: quadruped (hands+knees), same-side or opposite arm and leg extended then drawn together.
+- unknown: it does not clearly match any of the above, or the view is too poor to tell.
+Rules:
+- Prefer "unknown" over a low-confidence guess. A wrong test makes the whole score meaningless.
+- "side" is "left" or "right" for one-sided tests (hurdle_step, inline_lunge, shoulder_mobility, active_slr); use "na" for two-sided tests (deep_squat, trunk_stability_pushup, rotary_stability) and unknown.
+- Output ONLY this JSON object, nothing else:
+{"test": "<one of the labels>", "side": "left|right|na", "confidence": <0.0-1.0>, "reason": "<one short sentence>"}

formscout/agents/prompts/c2_judge.md CHANGED Viewed

@@ -1,43 +1,43 @@
-You are an assistant scoring ONE Functional Movement Screen test from objective measurements. You are a SCREENING AID, not a clinician. You never diagnose and you never predict injury.
-You are given, as JSON:
-- test, side
-- view: "3d" (reliable angles) or "2d" (angles are camera-angle dependent — caveat them)
-- features: measured biomechanics for this test (angles in degrees, distances normalized)
-- candidate_score: a model's provisional 0-3 (corroboration, may be absent)
-- exemplars: physio-scored reference clips of the SAME test with their scores (anchors, may be empty)
-- a few keyframes / skeleton overlay for context
-FMS scoring scale (apply per side; the test score is the LOWER side):
-- 3: the movement is performed to criterion with no compensation.
-- 2: the movement is completed but with compensation / poor mechanics (or only with the allowed regression, e.g. deep_squat heels elevated).
-- 1: the person cannot perform the movement pattern even with the allowed regression.
-- 0: PAIN. You CANNOT see pain. Never assign 0 yourself.
-Per-test criteria to weigh (use the features as primary evidence):
-- deep_squat (3): femur below horizontal, torso roughly parallel to the tibia, knees tracking over the feet, dowel staying aligned over the feet, heels flat. (2): the same achieved only with heels elevated. (1): criteria unmet even with heels elevated.
-- hurdle_step / inline_lunge: minimal sway/loss of balance, knee/hip/ankle alignment maintained, no contact with the hurdle, dowel/posture stable. Compensation -> 2; failure to complete -> 1. Report L/R asymmetry.
-- shoulder_mobility: judge by the normalized inter-fist distance bands (per side). Report asymmetry.
-- active_slr: judge the raised-leg hip-flexion angle relative to the standard band; the down leg stays flat.
-- trunk_stability_pushup: the body must move as one rigid unit (low segment-angle variance through the press); sag/lag or needing the easier hand position -> 2.
-- rotary_stability: smooth contralateral (or the allowed unilateral) coordination with a stable trunk; loss of coordination/balance -> lower.
-Hard safety rules:
-- If there is any clearing-test context, visible pain, grimacing, or an aborted rep, set needs_human=true and score=null. Do not score it.
-- If view=="2d" on a depth/angle-critical test (deep_squat, inline_lunge, active_slr), include an explicit one-clause caveat that the angle is a 2D estimate dependent on camera position.
-- If the measurements and the candidate_score disagree by a point or more, lower your confidence and say so.
-- When the features are insufficient to decide, prefer needs_human=true over a confident guess.
-Reason from the features first; use exemplars to calibrate borderline cases; treat candidate_score as a second opinion, not the answer.
-Output ONLY this JSON object, nothing else:
-{
-  "test": "<label>",
-  "side": "left|right|na",
-  "score": <0-3 or null>,
-  "needs_human": <true|false>,
-  "rationale": "<2-4 sentences citing the specific deciding measurement(s)>",
-  "compensation_tags": ["<short tag>", "..."],
-  "corrective_hint": "<one generic FMS-style suggestion, or '' if needs_human>",
-  "confidence": <0.0-1.0>
-}

+You are an assistant scoring ONE Functional Movement Screen test from objective measurements. You are a SCREENING AID, not a clinician. You never diagnose and you never predict injury.
+You are given, as JSON:
+- test, side
+- view: "3d" (reliable angles) or "2d" (angles are camera-angle dependent — caveat them)
+- features: measured biomechanics for this test (angles in degrees, distances normalized)
+- candidate_score: a model's provisional 0-3 (corroboration, may be absent)
+- exemplars: physio-scored reference clips of the SAME test with their scores (anchors, may be empty)
+- a few keyframes / skeleton overlay for context
+FMS scoring scale (apply per side; the test score is the LOWER side):
+- 3: the movement is performed to criterion with no compensation.
+- 2: the movement is completed but with compensation / poor mechanics (or only with the allowed regression, e.g. deep_squat heels elevated).
+- 1: the person cannot perform the movement pattern even with the allowed regression.
+- 0: PAIN. You CANNOT see pain. Never assign 0 yourself.
+Per-test criteria to weigh (use the features as primary evidence):
+- deep_squat (3): femur below horizontal, torso roughly parallel to the tibia, knees tracking over the feet, dowel staying aligned over the feet, heels flat. (2): the same achieved only with heels elevated. (1): criteria unmet even with heels elevated.
+- hurdle_step / inline_lunge: minimal sway/loss of balance, knee/hip/ankle alignment maintained, no contact with the hurdle, dowel/posture stable. Compensation -> 2; failure to complete -> 1. Report L/R asymmetry.
+- shoulder_mobility: judge by the normalized inter-fist distance bands (per side). Report asymmetry.
+- active_slr: judge the raised-leg hip-flexion angle relative to the standard band; the down leg stays flat.
+- trunk_stability_pushup: the body must move as one rigid unit (low segment-angle variance through the press); sag/lag or needing the easier hand position -> 2.
+- rotary_stability: smooth contralateral (or the allowed unilateral) coordination with a stable trunk; loss of coordination/balance -> lower.
+Hard safety rules:
+- If there is any clearing-test context, visible pain, grimacing, or an aborted rep, set needs_human=true and score=null. Do not score it.
+- If view=="2d" on a depth/angle-critical test (deep_squat, inline_lunge, active_slr), include an explicit one-clause caveat that the angle is a 2D estimate dependent on camera position.
+- If the measurements and the candidate_score disagree by a point or more, lower your confidence and say so.
+- When the features are insufficient to decide, prefer needs_human=true over a confident guess.
+Reason from the features first; use exemplars to calibrate borderline cases; treat candidate_score as a second opinion, not the answer.
+Output ONLY this JSON object, nothing else:
+{
+  "test": "<label>",
+  "side": "left|right|na",
+  "score": <0-3 or null>,
+  "needs_human": <true|false>,
+  "rationale": "<2-4 sentences citing the specific deciding measurement(s)>",
+  "compensation_tags": ["<short tag>", "..."],
+  "corrective_hint": "<one generic FMS-style suggestion, or '' if needs_human>",
+  "confidence": <0.0-1.0>
+}

formscout/agents/report.py ADDED Viewed

	@@ -0,0 +1,139 @@

+"""
+ReportAgent — assembles per-test scorecard, composite, asymmetries.
+Input:  List of (MovementResult, BiomechFeatures, ScoreResult, JudgeResult) per test
+Output: ReportResult(per_test, composite, asymmetries, overlay_video_path, pdf_path)
+Failure: returns ReportResult with composite=None if any test unscored.
+Params: 0 (pure assembly — no model).
+License: n/a.
+Gated: no.
+"""
+from __future__ import annotations
+from formscout.types import (
+    MovementResult, BiomechFeatures, ScoreResult, JudgeResult, ReportResult,
+)
+from formscout import config
+# Bilateral tests that need L/R scoring
+BILATERAL_TESTS = {"hurdle_step", "inline_lunge", "shoulder_mobility", "active_slr"}
+class ReportAgent:
+    """Assembles the final screening report from all test results."""
+    def run(self, test_results: list[dict]) -> ReportResult:
+        """
+        Assemble the report.
+        Args:
+            test_results: list of dicts with keys:
+                - movement: MovementResult
+                - features: BiomechFeatures
+                - rubric_score: ScoreResult
+                - judge: JudgeResult
+                - side: str (for bilateral: "left" or "right")
+        """
+        per_test = []
+        asymmetries = []
+        low_confidence_flags = []
+        disagreement_flags = []
+        # Group bilateral tests by test_name
+        bilateral_groups: dict[str, list[dict]] = {}
+        unilateral: list[dict] = []
+        for entry in test_results:
+            test_name = entry["movement"].test_name
+            if test_name in BILATERAL_TESTS:
+                bilateral_groups.setdefault(test_name, []).append(entry)
+            else:
+                unilateral.append(entry)
+        # Process bilateral tests — take the lower score, emit asymmetry
+        for test_name, entries in bilateral_groups.items():
+            scores = []
+            for entry in entries:
+                judge = entry["judge"]
+                side = entry.get("side", entry["movement"].side)
+                score = judge.score if judge.score is not None else None
+                scores.append({"side": side, "score": score, "entry": entry})
+            # Find best entry per side
+            left = next((s for s in scores if s["side"] == "left"), None)
+            right = next((s for s in scores if s["side"] == "right"), None)
+            left_score = left["score"] if left else None
+            right_score = right["score"] if right else None
+            # Report lower
+            if left_score is not None and right_score is not None:
+                final_score = min(left_score, right_score)
+                delta = abs(left_score - right_score)
+                asymmetries.append({
+                    "test": test_name,
+                    "left_score": left_score,
+                    "right_score": right_score,
+                    "delta": delta,
+                })
+            elif left_score is not None:
+                final_score = left_score
+            elif right_score is not None:
+                final_score = right_score
+            else:
+                final_score = None
+            # Use the entry with the lower score for details
+            primary = (left["entry"] if left and (right is None or (left_score or 4) <= (right_score or 4))
+                      else right["entry"] if right else entries[0])
+            per_test.append({
+                "test_name": test_name,
+                "score": final_score,
+                "judge": primary["judge"],
+                "features": primary["features"],
+                "needs_human": primary["judge"].needs_human,
+            })
+            self._check_flags(primary, low_confidence_flags, disagreement_flags)
+        # Process unilateral tests
+        for entry in unilateral:
+            judge = entry["judge"]
+            per_test.append({
+                "test_name": entry["movement"].test_name,
+                "score": judge.score,
+                "judge": judge,
+                "features": entry["features"],
+                "needs_human": judge.needs_human,
+            })
+            self._check_flags(entry, low_confidence_flags, disagreement_flags)
+        # Composite — null if any test unscored
+        all_scores = [t["score"] for t in per_test]
+        composite = sum(all_scores) if all(s is not None for s in all_scores) else None
+        return ReportResult(
+            per_test=per_test,
+            composite=composite,
+            asymmetries=asymmetries,
+            overlay_video_path=None,  # Phase 4
+            pdf_path=None,  # Phase 4
+            low_confidence_flags=low_confidence_flags,
+            disagreement_flags=disagreement_flags,
+        )
+    def _check_flags(self, entry: dict, low_conf: list, disagree: list):
+        """Check quality gates and populate flag lists."""
+        judge = entry["judge"]
+        rubric = entry["rubric_score"]
+        test_name = entry["movement"].test_name
+        if judge.confidence < config.MIN_CONFIDENCE:
+            low_conf.append(f"{test_name}: judge confidence {judge.confidence:.2f}")
+        if (judge.score is not None and rubric.score is not None
+                and abs(judge.score - rubric.score) >= config.SCORE_DISAGREE_THRESH):
+            disagree.append(
+                f"{test_name}: rubric={rubric.score} vs judge={judge.score}"
+            )

formscout/config.py CHANGED Viewed

@@ -1,50 +1,50 @@
-"""
-FormScout pipeline configuration.
-All model IDs, thresholds, k-values, and feature flags live here.
-No scattered literals elsewhere in the codebase.
-"""
-from pathlib import Path
-ROOT = Path(__file__).parent.parent
-# ─── Model IDs ───────────────────────────────────────────────────────────────
-YOLO_POSE_MODEL = str(ROOT / "checkpoints" / "yolo26" / "yolo26l-pose.pt")
-YOLO_POSE_MODEL_HQ = str(ROOT / "checkpoints" / "yolo26" / "yolo26x-pose.pt")
-SAM_CHECKPOINT = "sam2.1_hiera_base_plus.pt"
-SAM_3D_CHECKPOINT = ROOT / "checkpoints" / "sam-3d-body-dinov3" / "model.ckpt"
-SAM_3D_HF_REPO = "facebook/sam-3d-body-dinov3"
-SAM_3D_MHR_PATH = ROOT / "checkpoints" / "sam-3d-body-dinov3" / "assets" / "mhr_model.pt"
-QWEN_VLM_GGUF = "Qwen3-VL-8B-Instruct-Q4_K_M.gguf"
-QWEN_EMBED_GGUF = "Qwen3-VL-Embedding-8B-Q4_K_M.gguf"
-STGCN_CHECKPOINT = ROOT / "checkpoints" / "stgcn_fms.pth"
-# ─── Pipeline flags ──────────────────────────────────────────────────────────
-ENABLE_3D = False           # SAM 3D Body — access granted Jun 2026, off until integrated
-ENABLE_STGCN = False        # Phase 3
-ENABLE_RAG = False          # Phase 3
-ENABLE_JUDGE = False        # Phase 2
-# ─── Thresholds ──────────────────────────────────────────────────────────────
-MIN_CONFIDENCE = 0.6
-SCORE_DISAGREE_THRESH = 1   # flag if |stgcn - judge| >= this
-RETRIEVAL_K = 3
-# ─── Video / Ingest ─────────────────────────────────────────────────────────
-TARGET_FPS = 30.0
-MAX_FRAMES = 300            # hard cap to avoid OOM
-MAX_DURATION_SEC = 60.0     # warn on longer videos
-# ─── Pose ────────────────────────────────────────────────────────────────────
-POSE_BACKEND = "yolo"       # "yolo" | "sapiens"
-POSE_CONF_THRESHOLD = 0.5
-NUM_KEYPOINTS = 17
-# ─── Biomechanics thresholds ────────────────────────────────────────────────
-DEEP_SQUAT_FEMUR_HORIZONTAL_DEG = 90.0
-DEEP_SQUAT_TORSO_TIBIA_MAX_DEG = 15.0
-DEEP_SQUAT_KNEE_TRACKING_MARGIN_PX = 20
-# ─── Serving (llama.cpp) ────────────────────────────────────────────────────
-LLAMA_CPP_HOST = "127.0.0.1"
-LLAMA_CPP_PORT_VLM = 8080
-LLAMA_CPP_PORT_EMBED = 8081

+"""
+FormScout pipeline configuration.
+All model IDs, thresholds, k-values, and feature flags live here.
+No scattered literals elsewhere in the codebase.
+"""
+from pathlib import Path
+ROOT = Path(__file__).parent.parent
+# ─── Model IDs ───────────────────────────────────────────────────────────────
+YOLO_POSE_MODEL = str(ROOT / "checkpoints" / "yolo26" / "yolo26l-pose.pt")
+YOLO_POSE_MODEL_HQ = str(ROOT / "checkpoints" / "yolo26" / "yolo26x-pose.pt")
+SAM_CHECKPOINT = "sam2.1_hiera_base_plus.pt"
+SAM_3D_CHECKPOINT = ROOT / "checkpoints" / "sam-3d-body-dinov3" / "model.ckpt"
+SAM_3D_HF_REPO = "facebook/sam-3d-body-dinov3"
+SAM_3D_MHR_PATH = ROOT / "checkpoints" / "sam-3d-body-dinov3" / "assets" / "mhr_model.pt"
+QWEN_VLM_GGUF = "Qwen3-VL-8B-Instruct-Q4_K_M.gguf"
+QWEN_EMBED_GGUF = "Qwen3-VL-Embedding-8B-Q4_K_M.gguf"
+STGCN_CHECKPOINT = ROOT / "checkpoints" / "stgcn_fms.pth"
+# ─── Pipeline flags ──────────────────────────────────────────────────────────
+ENABLE_3D = False           # SAM 3D Body — access granted Jun 2026, off until integrated
+ENABLE_STGCN = False        # Phase 3
+ENABLE_RAG = False          # Phase 3
+ENABLE_JUDGE = False        # Phase 2
+# ─── Thresholds ──────────────────────────────────────────────────────────────
+MIN_CONFIDENCE = 0.6
+SCORE_DISAGREE_THRESH = 1   # flag if |stgcn - judge| >= this
+RETRIEVAL_K = 3
+# ─── Video / Ingest ─────────────────────────────────────────────────────────
+TARGET_FPS = 30.0
+MAX_FRAMES = 300            # hard cap to avoid OOM
+MAX_DURATION_SEC = 60.0     # warn on longer videos
+# ─── Pose ────────────────────────────────────────────────────────────────────
+POSE_BACKEND = "yolo"       # "yolo" | "sapiens"
+POSE_CONF_THRESHOLD = 0.5
+NUM_KEYPOINTS = 17
+# ─── Biomechanics thresholds ────────────────────────────────────────────────
+DEEP_SQUAT_FEMUR_HORIZONTAL_DEG = 90.0
+DEEP_SQUAT_TORSO_TIBIA_MAX_DEG = 15.0
+DEEP_SQUAT_KNEE_TRACKING_MARGIN_PX = 20
+# ─── Serving (llama.cpp) ────────────────────────────────────────────────────
+LLAMA_CPP_HOST = "127.0.0.1"
+LLAMA_CPP_PORT_VLM = 8080
+LLAMA_CPP_PORT_EMBED = 8081

formscout/pipeline.py CHANGED Viewed

@@ -16,6 +16,10 @@ from formscout.agents.ingest import IngestAgent
 from formscout.agents.pose2d import Pose2DAgent
 from formscout.agents.body3d import Body3DAgent
 from formscout.agents.biomechanics import BiomechanicsAgent
 class Director:
@@ -29,11 +33,14 @@ class Director:
         self._pose2d = Pose2DAgent()
         self._body3d = Body3DAgent()
         self._biomechanics = BiomechanicsAgent()
     def run(self, video_path: str, test_name: str = "deep_squat", side: str = "na") -> PipelineState:
         """
         Run the full pipeline on a single video.
-        For Phase 1, test_name and side are passed explicitly (no classifier yet).
         """
         state = PipelineState(video_path=video_path)
@@ -53,11 +60,20 @@ class Director:
         frames = state.ingest.frames if state.ingest else []
         state.body3d = self._body3d.run(state.pose2d, masks, frames=frames)
-        # ─── Movement classification (Phase 1: manual) ───
-        state.movement = MovementResult(
-            test_name=test_name, side=side,
-            confidence=1.0, notes="manually specified (Phase 1)",
-        )
         # ─── Biomechanics ───
         state.features = self._biomechanics.run(
@@ -70,9 +86,25 @@ class Director:
                 f"biomechanics: low confidence ({state.features.confidence:.2f}) — physio review recommended"
             )
         # ─── Quality gates ───
-        # Gate: unknown test → stop
-        if state.movement.test_name == "unknown":
-            state.errors.append("movement classifier returned 'unknown' — manual override required")
         return state

 from formscout.agents.pose2d import Pose2DAgent
 from formscout.agents.body3d import Body3DAgent
 from formscout.agents.biomechanics import BiomechanicsAgent
+from formscout.agents.classifier import MovementClassifierAgent
+from formscout.agents.judge import JudgeAgent
+from formscout.agents.report import ReportAgent
+from formscout.rubric import score_test
 class Director:
         self._pose2d = Pose2DAgent()
         self._body3d = Body3DAgent()
         self._biomechanics = BiomechanicsAgent()
+        self._classifier = MovementClassifierAgent()
+        self._judge = JudgeAgent()
+        self._report = ReportAgent()
     def run(self, video_path: str, test_name: str = "deep_squat", side: str = "na") -> PipelineState:
         """
         Run the full pipeline on a single video.
+        test_name/side serve as manual override when provided (skips classifier).
         """
         state = PipelineState(video_path=video_path)
         frames = state.ingest.frames if state.ingest else []
         state.body3d = self._body3d.run(state.pose2d, masks, frames=frames)
+        # ─── Movement classification ───
+        if test_name and test_name != "unknown":
+            # Manual override
+            state.movement = MovementResult(
+                test_name=test_name, side=side,
+                confidence=1.0, notes="manually specified",
+            )
+        else:
+            state.movement = self._classifier.run(state.ingest, state.pose2d)
+        # Gate: unknown test → stop
+        if state.movement.test_name == "unknown":
+            state.errors.append("movement classifier returned 'unknown' — manual override required")
+            return state
         # ─── Biomechanics ───
         state.features = self._biomechanics.run(
                 f"biomechanics: low confidence ({state.features.confidence:.2f}) — physio review recommended"
             )
+        # ─── Rubric Score ───
+        rubric_result = score_test(state.features)
+        state.stgcn_score = rubric_result  # Reusing field for rubric until ST-GCN is built
+        # ─── Judge ───
+        state.judge = self._judge.run(
+            state.features, rubric_result, state.movement, state.ingest,
+        )
         # ─── Quality gates ───
+        # Gate: score disagreement
+        if (state.judge.score is not None and rubric_result.score is not None
+                and abs(state.judge.score - rubric_result.score) >= config.SCORE_DISAGREE_THRESH):
+            state.warnings.append(
+                f"score disagreement: rubric={rubric_result.score} vs judge={state.judge.score} — review recommended"
+            )
+        # Gate: needs_human
+        if state.judge.needs_human:
+            state.warnings.append("judge flagged needs_human — no auto-score emitted")
         return state

formscout/rubric/__init__.py CHANGED Viewed

	@@ -0,0 +1,32 @@

+"""
+FormScout rubric scorers — one pure-function scorer per FMS test.
+"""
+from formscout.rubric.deep_squat import score_deep_squat
+from formscout.rubric.hurdle_step import score_hurdle_step
+from formscout.rubric.inline_lunge import score_inline_lunge
+from formscout.rubric.shoulder_mobility import score_shoulder_mobility
+from formscout.rubric.active_slr import score_active_slr
+from formscout.rubric.trunk_stability_pushup import score_trunk_stability_pushup
+from formscout.rubric.rotary_stability import score_rotary_stability
+from formscout.types import BiomechFeatures, ScoreResult
+SCORERS = {
+    "deep_squat": score_deep_squat,
+    "hurdle_step": score_hurdle_step,
+    "inline_lunge": score_inline_lunge,
+    "shoulder_mobility": score_shoulder_mobility,
+    "active_slr": score_active_slr,
+    "trunk_stability_pushup": score_trunk_stability_pushup,
+    "rotary_stability": score_rotary_stability,
+}
+def score_test(features: BiomechFeatures) -> ScoreResult:
+    """Dispatch to the appropriate rubric scorer by test name."""
+    fn = SCORERS.get(features.test_name)
+    if fn is None:
+        return ScoreResult(
+            score=1, rationale=f"No rubric for test '{features.test_name}'",
+            confidence=0.0, notes="unknown test",
+        )
+    return fn(features)

formscout/rubric/active_slr.py ADDED Viewed

	@@ -0,0 +1,51 @@

+"""
+Active Straight-Leg Raise rubric scorer — pure function, no model calls.
+FMS ASLR Criteria (bilateral):
+- Score 3: raised leg malleolus past contralateral knee (>70°), down leg flat.
+- Score 2: malleolus between mid-thigh and knee (45-70°).
+- Score 1: malleolus below mid-thigh (<45°).
+- Score 0: PAIN — never auto-scored.
+"""
+from __future__ import annotations
+from formscout.types import BiomechFeatures, ScoreResult
+def score_active_slr(features: BiomechFeatures) -> ScoreResult:
+    """Pure rubric scorer for active straight-leg raise."""
+    angles = features.angles
+    alignments = features.alignments
+    has_angle = "raised_leg_angle_deg" in angles
+    if not has_angle:
+        return ScoreResult(
+            score=1, rationale="Insufficient data: leg raise angle not measurable",
+            confidence=0.3, notes="missing key measurements",
+        )
+    angle = angles["raised_leg_angle_deg"]
+    past_knee = alignments.get("past_contralateral_knee", False)
+    past_mid = alignments.get("past_mid_thigh", False)
+    down_flat = alignments.get("down_leg_flat", True)
+    rationale_parts = []
+    if past_knee and down_flat:
+        score = 3
+        rationale_parts.append(f"Raised leg at {angle:.0f}° (past contralateral knee)")
+    elif past_mid:
+        score = 2
+        rationale_parts.append(f"Raised leg at {angle:.0f}° (between mid-thigh and knee)")
+        if not down_flat:
+            rationale_parts.append("down leg lifted off surface")
+    else:
+        score = 1
+        rationale_parts.append(f"Raised leg only {angle:.0f}° (below mid-thigh)")
+    confidence = features.confidence * 0.9
+    return ScoreResult(
+        score=score, rationale="; ".join(rationale_parts),
+        confidence=confidence, notes="",
+    )

formscout/rubric/deep_squat.py CHANGED Viewed

@@ -1,113 +1,113 @@
-"""
-Deep Squat rubric scorer — pure function, no model calls.
-FMS Deep Squat Criteria:
-- Score 3: femur below horizontal, torso parallel to tibia, knees tracking
-           over feet, dowel over feet, heels flat.
-- Score 2: criteria met only with heels elevated.
-- Score 1: criteria unmet even with heels elevated.
-- Score 0: PAIN — never auto-scored by this function.
-Input:  BiomechFeatures for deep_squat
-Output: ScoreResult(score, rationale, confidence, needs_human)
-"""
-from __future__ import annotations
-import math
-from formscout.types import BiomechFeatures, ScoreResult
-from formscout import config
-def score_deep_squat(features: BiomechFeatures) -> ScoreResult:
-    """
-    Pure rubric scorer for deep squat.
-    Returns ScoreResult with score 1-3 based on biomechanical measurements.
-    Never assigns score 0 (pain) — that requires needs_human=True from JudgeAgent.
-    """
-    angles = features.angles
-    alignments = features.alignments
-    # Check if we have enough data to score
-    has_femur = any(
-        k in angles for k in ("left_femur_from_horizontal_deg", "right_femur_from_horizontal_deg")
-    )
-    has_torso_tibia = "torso_tibia_angle_deg" in angles
-    if not has_femur:
-        return ScoreResult(
-            score=1,
-            rationale="Insufficient data: femur angle not measurable",
-            confidence=0.3,
-            needs_human=False,
-            notes="missing femur measurements — defaulting to lowest passing score",
-        )
-    # Evaluate criteria
-    # Femur below horizontal: femur angle from horizontal > 90° means above horizontal
-    # In our measurement: angle is from horizontal, so < 90 means below horizontal
-    femur_angles = []
-    if "left_femur_from_horizontal_deg" in angles:
-        femur_angles.append(angles["left_femur_from_horizontal_deg"])
-    if "right_femur_from_horizontal_deg" in angles:
-        femur_angles.append(angles["right_femur_from_horizontal_deg"])
-    # Femur below horizontal means the thigh slopes down steeply (angle > ~60° from horizontal in image coords)
-    femur_below_horizontal = any(a > 60.0 for a in femur_angles) if femur_angles else False
-    # Torso parallel to tibia
-    torso_parallel_tibia = (
-        angles.get("torso_tibia_angle_deg", 999) <= config.DEEP_SQUAT_TORSO_TIBIA_MAX_DEG
-    )
-    # Knee tracking
-    knees_tracking = alignments.get("knees_tracking_over_feet", False)
-    # Dowel alignment
-    dowel_over_feet = alignments.get("dowel_over_feet", False)
-    # Heels
-    heels_elevated = alignments.get("heels_elevated", False)
-    # Scoring logic
-    all_criteria = femur_below_horizontal and torso_parallel_tibia and knees_tracking and dowel_over_feet
-    rationale_parts: list[str] = []
-    if all_criteria and not heels_elevated:
-        score = 3
-        rationale_parts.append("All criteria met with heels flat")
-    elif all_criteria and heels_elevated:
-        score = 2
-        rationale_parts.append("Criteria met only with heels elevated")
-    else:
-        # Check what failed
-        if not femur_below_horizontal:
-            rationale_parts.append("femur not below horizontal")
-        if not torso_parallel_tibia:
-            rationale_parts.append(
-                f"torso-tibia angle {angles.get('torso_tibia_angle_deg', '?')}° "
-                f"exceeds {config.DEEP_SQUAT_TORSO_TIBIA_MAX_DEG}° threshold"
-            )
-        if not knees_tracking:
-            rationale_parts.append("knees not tracking over feet")
-        if not dowel_over_feet:
-            rationale_parts.append("dowel not aligned over feet")
-        if heels_elevated:
-            score = 1
-            rationale_parts.append("criteria unmet even with heels elevated")
-        else:
-            # They might score 2 with heel elevation — but without it, still 1
-            score = 1
-            rationale_parts.append("criteria unmet with heels flat")
-    confidence = features.confidence * (0.9 if has_torso_tibia else 0.6)
-    return ScoreResult(
-        score=score,
-        rationale="; ".join(rationale_parts),
-        confidence=confidence,
-        needs_human=False,
-        notes="",
-    )

+"""
+Deep Squat rubric scorer — pure function, no model calls.
+FMS Deep Squat Criteria:
+- Score 3: femur below horizontal, torso parallel to tibia, knees tracking
+           over feet, dowel over feet, heels flat.
+- Score 2: criteria met only with heels elevated.
+- Score 1: criteria unmet even with heels elevated.
+- Score 0: PAIN — never auto-scored by this function.
+Input:  BiomechFeatures for deep_squat
+Output: ScoreResult(score, rationale, confidence, needs_human)
+"""
+from __future__ import annotations
+import math
+from formscout.types import BiomechFeatures, ScoreResult
+from formscout import config
+def score_deep_squat(features: BiomechFeatures) -> ScoreResult:
+    """
+    Pure rubric scorer for deep squat.
+    Returns ScoreResult with score 1-3 based on biomechanical measurements.
+    Never assigns score 0 (pain) — that requires needs_human=True from JudgeAgent.
+    """
+    angles = features.angles
+    alignments = features.alignments
+    # Check if we have enough data to score
+    has_femur = any(
+        k in angles for k in ("left_femur_from_horizontal_deg", "right_femur_from_horizontal_deg")
+    )
+    has_torso_tibia = "torso_tibia_angle_deg" in angles
+    if not has_femur:
+        return ScoreResult(
+            score=1,
+            rationale="Insufficient data: femur angle not measurable",
+            confidence=0.3,
+            needs_human=False,
+            notes="missing femur measurements — defaulting to lowest passing score",
+        )
+    # Evaluate criteria
+    # Femur below horizontal: femur angle from horizontal > 90° means above horizontal
+    # In our measurement: angle is from horizontal, so < 90 means below horizontal
+    femur_angles = []
+    if "left_femur_from_horizontal_deg" in angles:
+        femur_angles.append(angles["left_femur_from_horizontal_deg"])
+    if "right_femur_from_horizontal_deg" in angles:
+        femur_angles.append(angles["right_femur_from_horizontal_deg"])
+    # Femur below horizontal means the thigh slopes down steeply (angle > ~60° from horizontal in image coords)
+    femur_below_horizontal = any(a > 60.0 for a in femur_angles) if femur_angles else False
+    # Torso parallel to tibia
+    torso_parallel_tibia = (
+        angles.get("torso_tibia_angle_deg", 999) <= config.DEEP_SQUAT_TORSO_TIBIA_MAX_DEG
+    )
+    # Knee tracking
+    knees_tracking = alignments.get("knees_tracking_over_feet", False)
+    # Dowel alignment
+    dowel_over_feet = alignments.get("dowel_over_feet", False)
+    # Heels
+    heels_elevated = alignments.get("heels_elevated", False)
+    # Scoring logic
+    all_criteria = femur_below_horizontal and torso_parallel_tibia and knees_tracking and dowel_over_feet
+    rationale_parts: list[str] = []
+    if all_criteria and not heels_elevated:
+        score = 3
+        rationale_parts.append("All criteria met with heels flat")
+    elif all_criteria and heels_elevated:
+        score = 2
+        rationale_parts.append("Criteria met only with heels elevated")
+    else:
+        # Check what failed
+        if not femur_below_horizontal:
+            rationale_parts.append("femur not below horizontal")
+        if not torso_parallel_tibia:
+            rationale_parts.append(
+                f"torso-tibia angle {angles.get('torso_tibia_angle_deg', '?')}° "
+                f"exceeds {config.DEEP_SQUAT_TORSO_TIBIA_MAX_DEG}° threshold"
+            )
+        if not knees_tracking:
+            rationale_parts.append("knees not tracking over feet")
+        if not dowel_over_feet:
+            rationale_parts.append("dowel not aligned over feet")
+        if heels_elevated:
+            score = 1
+            rationale_parts.append("criteria unmet even with heels elevated")
+        else:
+            # They might score 2 with heel elevation — but without it, still 1
+            score = 1
+            rationale_parts.append("criteria unmet with heels flat")
+    confidence = features.confidence * (0.9 if has_torso_tibia else 0.6)
+    return ScoreResult(
+        score=score,
+        rationale="; ".join(rationale_parts),
+        confidence=confidence,
+        needs_human=False,
+        notes="",
+    )

formscout/rubric/hurdle_step.py ADDED Viewed

	@@ -0,0 +1,60 @@

+"""
+Hurdle Step rubric scorer — pure function, no model calls.
+FMS Hurdle Step Criteria (bilateral — score each side, report lower):
+- Score 3: hips/knees/ankles aligned, minimal trunk movement, dowel/posture stable,
+           no contact with hurdle.
+- Score 2: movement completed with compensation (trunk lean, loss of alignment).
+- Score 1: contact with hurdle, loss of balance, or inability to maintain alignment.
+- Score 0: PAIN — never auto-scored.
+"""
+from __future__ import annotations
+from formscout.types import BiomechFeatures, ScoreResult
+def score_hurdle_step(features: BiomechFeatures) -> ScoreResult:
+    """Pure rubric scorer for hurdle step."""
+    angles = features.angles
+    alignments = features.alignments
+    has_hip_flex = "step_hip_flexion_deg" in angles
+    if not has_hip_flex:
+        return ScoreResult(
+            score=1, rationale="Insufficient data: hip flexion not measurable",
+            confidence=0.3, notes="missing key measurements",
+        )
+    trunk_stable = alignments.get("trunk_stable", False)
+    stance_extended = alignments.get("stance_knee_extended", False)
+    hip_flex = angles.get("step_hip_flexion_deg", 0)
+    rationale_parts = []
+    # Score 3: good hip flexion, trunk stable, stance solid
+    if hip_flex > 90 and trunk_stable and stance_extended:
+        score = 3
+        rationale_parts.append("Hip flexion adequate, trunk stable, stance knee extended")
+    elif hip_flex > 70 or (trunk_stable and stance_extended):
+        score = 2
+        if not trunk_stable:
+            rationale_parts.append("trunk lean detected")
+        if not stance_extended:
+            rationale_parts.append("stance knee flexion")
+        if hip_flex <= 90:
+            rationale_parts.append(f"hip flexion {hip_flex:.0f}° (borderline)")
+        rationale_parts.insert(0, "Movement completed with compensation")
+    else:
+        score = 1
+        rationale_parts.append("Unable to maintain alignment")
+        if not trunk_stable:
+            rationale_parts.append("significant trunk lean")
+        if not stance_extended:
+            rationale_parts.append("stance knee collapse")
+    confidence = features.confidence * 0.85
+    return ScoreResult(
+        score=score, rationale="; ".join(rationale_parts),
+        confidence=confidence, notes="",
+    )

formscout/rubric/inline_lunge.py ADDED Viewed

	@@ -0,0 +1,58 @@

+"""
+In-Line Lunge rubric scorer — pure function, no model calls.
+FMS In-Line Lunge Criteria (bilateral):
+- Score 3: dowel contacts maintained, no torso movement, knee touches behind heel.
+- Score 2: movement completed with compensation (trunk lean, loss of balance).
+- Score 1: loss of balance, inability to maintain foot contact or posture.
+- Score 0: PAIN — never auto-scored.
+"""
+from __future__ import annotations
+from formscout.types import BiomechFeatures, ScoreResult
+def score_inline_lunge(features: BiomechFeatures) -> ScoreResult:
+    """Pure rubric scorer for in-line lunge."""
+    angles = features.angles
+    alignments = features.alignments
+    has_knee = "front_knee_flexion_deg" in angles
+    if not has_knee:
+        return ScoreResult(
+            score=1, rationale="Insufficient data: knee flexion not measurable",
+            confidence=0.3, notes="missing key measurements",
+        )
+    knee_flex = angles.get("front_knee_flexion_deg", 180)
+    trunk_upright = alignments.get("trunk_upright", False)
+    knee_over_ankle = alignments.get("knee_over_ankle", False)
+    rationale_parts = []
+    # Good lunge: knee flexion < 90° (deep), trunk upright, knee aligned
+    deep_enough = knee_flex < 100
+    if deep_enough and trunk_upright and knee_over_ankle:
+        score = 3
+        rationale_parts.append("Deep lunge with trunk upright and knee aligned")
+    elif deep_enough or (trunk_upright and knee_over_ankle):
+        score = 2
+        if not trunk_upright:
+            rationale_parts.append(f"trunk lean {angles.get('trunk_lean_from_vertical_deg', '?')}°")
+        if not knee_over_ankle:
+            rationale_parts.append("knee drifts past ankle")
+        if not deep_enough:
+            rationale_parts.append(f"knee flexion {knee_flex:.0f}° (insufficient depth)")
+        rationale_parts.insert(0, "Completed with compensation")
+    else:
+        score = 1
+        rationale_parts.append("Unable to complete lunge pattern")
+        if not deep_enough:
+            rationale_parts.append(f"knee flexion only {knee_flex:.0f}°")
+    confidence = features.confidence * 0.85
+    return ScoreResult(
+        score=score, rationale="; ".join(rationale_parts),
+        confidence=confidence, notes="",
+    )

formscout/rubric/rotary_stability.py ADDED Viewed

	@@ -0,0 +1,56 @@

+"""
+Rotary Stability rubric scorer — pure function, no model calls.
+FMS Rotary Stability Criteria:
+- Score 3: unilateral (same-side) arm/leg extension with trunk stable,
+           elbow/knee touch performed smoothly.
+- Score 2: contralateral (opposite) arm/leg extension performed with trunk stable.
+- Score 1: inability to maintain trunk stability during contralateral pattern.
+- Score 0: PAIN (spinal flexion clearing test) — never auto-scored.
+"""
+from __future__ import annotations
+from formscout.types import BiomechFeatures, ScoreResult
+def score_rotary_stability(features: BiomechFeatures) -> ScoreResult:
+    """Pure rubric scorer for rotary stability."""
+    angles = features.angles
+    alignments = features.alignments
+    has_data = "trunk_stability_std_px" in angles or "shoulder_level_diff_px" in angles
+    if not has_data:
+        return ScoreResult(
+            score=1, rationale="Insufficient data: trunk stability not measurable",
+            confidence=0.3, notes="missing key measurements",
+        )
+    trunk_stable = alignments.get("trunk_stable", False)
+    shoulders_level = alignments.get("shoulders_level", False)
+    hips_level = alignments.get("hips_level", False)
+    rationale_parts = []
+    # Without video classification of ipsi vs contra, assume contralateral (safer)
+    if trunk_stable and shoulders_level and hips_level:
+        score = 2  # Assume contralateral unless classifier says ipsilateral
+        rationale_parts.append("Trunk stable during extension, shoulders and hips level")
+        rationale_parts.append("scored as contralateral pattern (default)")
+    elif trunk_stable or (shoulders_level and hips_level):
+        score = 2
+        if not trunk_stable:
+            rationale_parts.append("minor trunk instability")
+        rationale_parts.insert(0, "Contralateral pattern with minor compensation")
+    else:
+        score = 1
+        std = angles.get("trunk_stability_std_px", 0)
+        rationale_parts.append(f"Trunk instability detected (std {std:.1f}px)")
+        if not shoulders_level:
+            rationale_parts.append("shoulder asymmetry during extension")
+    confidence = features.confidence * 0.75  # Lower confidence — hard to assess from 2D
+    return ScoreResult(
+        score=score, rationale="; ".join(rationale_parts),
+        confidence=confidence, notes="ipsi/contra distinction requires VLM classifier",
+    )

formscout/rubric/shoulder_mobility.py ADDED Viewed

	@@ -0,0 +1,46 @@

+"""
+Shoulder Mobility rubric scorer — pure function, no model calls.
+FMS Shoulder Mobility Criteria (bilateral):
+- Score 3: fists within one hand-length of each other.
+- Score 2: fists within 1.5 hand-lengths.
+- Score 1: fists more than 1.5 hand-lengths apart.
+- Score 0: PAIN (clearing test) — never auto-scored.
+"""
+from __future__ import annotations
+from formscout.types import BiomechFeatures, ScoreResult
+def score_shoulder_mobility(features: BiomechFeatures) -> ScoreResult:
+    """Pure rubric scorer for shoulder mobility."""
+    alignments = features.alignments
+    angles = features.angles
+    has_measure = "inter_fist_normalized" in angles
+    if not has_measure:
+        return ScoreResult(
+            score=1, rationale="Insufficient data: inter-fist distance not measurable",
+            confidence=0.3, notes="missing key measurements",
+        )
+    norm_dist = angles["inter_fist_normalized"]
+    within_one = alignments.get("fists_within_one_hand", False)
+    within_1_5 = alignments.get("fists_within_1_5_hand", False)
+    if within_one:
+        score = 3
+        rationale = f"Fists within one hand-length (normalized distance {norm_dist:.2f})"
+    elif within_1_5:
+        score = 2
+        rationale = f"Fists within 1.5 hand-lengths (normalized distance {norm_dist:.2f})"
+    else:
+        score = 1
+        rationale = f"Fists beyond 1.5 hand-lengths apart (normalized distance {norm_dist:.2f})"
+    confidence = features.confidence * 0.9
+    return ScoreResult(
+        score=score, rationale=rationale,
+        confidence=confidence, notes="",
+    )

formscout/rubric/trunk_stability_pushup.py ADDED Viewed

	@@ -0,0 +1,55 @@

+"""
+Trunk Stability Push-Up rubric scorer — pure function, no model calls.
+FMS Trunk Stability Push-Up Criteria:
+- Score 3: body moves as one unit (rigid) with hands at forehead level (men)
+           or chin level (women). No sag or segment lag.
+- Score 2: body moves as one unit but with hands at chin (men) or clavicle (women).
+- Score 1: unable to perform with hands lowered; body sags or segments.
+- Score 0: PAIN (spinal extension clearing test) — never auto-scored.
+"""
+from __future__ import annotations
+from formscout.types import BiomechFeatures, ScoreResult
+def score_trunk_stability_pushup(features: BiomechFeatures) -> ScoreResult:
+    """Pure rubric scorer for trunk stability push-up."""
+    angles = features.angles
+    alignments = features.alignments
+    has_data = "max_sag_px" in angles
+    if not has_data:
+        return ScoreResult(
+            score=1, rationale="Insufficient data: trunk rigidity not measurable",
+            confidence=0.3, notes="missing key measurements",
+        )
+    body_rigid = alignments.get("body_rigid", False)
+    no_sag = alignments.get("no_sag", False)
+    hands_high = alignments.get("hands_at_forehead", False)
+    rationale_parts = []
+    if body_rigid and hands_high:
+        score = 3
+        rationale_parts.append("Body rigid as one unit, hands at forehead position")
+    elif body_rigid or no_sag:
+        score = 2
+        if not hands_high:
+            rationale_parts.append("rigid body but hands in lower position")
+        else:
+            rationale_parts.append("minor trunk variance detected")
+        rationale_parts.insert(0, "Completed with regression")
+    else:
+        score = 1
+        sag = angles.get("max_sag_px", 0)
+        variance = angles.get("trunk_variance_px", 0)
+        rationale_parts.append(f"Body sag detected ({sag:.0f}px), variance {variance:.1f}px")
+    confidence = features.confidence * 0.8
+    return ScoreResult(
+        score=score, rationale="; ".join(rationale_parts),
+        confidence=confidence, notes="",
+    )

formscout/run.py CHANGED Viewed

@@ -1,75 +1,84 @@
-"""
-FormScout headless CLI entrypoint.
-Usage: python -m formscout.run sample.mp4
-"""
-from __future__ import annotations
-import sys
-import json
-from pathlib import Path
-from formscout.pipeline import Director
-from formscout.rubric.deep_squat import score_deep_squat
-def main():
-    if len(sys.argv) < 2:
-        print("Usage: python -m formscout.run <video_path> [test_name] [side]")
-        sys.exit(1)
-    video_path = sys.argv[1]
-    test_name = sys.argv[2] if len(sys.argv) > 2 else "deep_squat"
-    side = sys.argv[3] if len(sys.argv) > 3 else "na"
-    print(f"FormScout — processing: {video_path}")
-    print(f"  Test: {test_name}, Side: {side}")
-    print()
-    director = Director()
-    state = director.run(video_path, test_name=test_name, side=side)
-    # Print pipeline state
-    if state.errors:
-        print("ERRORS:")
-        for e in state.errors:
-            print(f"  ✗ {e}")
-        print()
-    if state.warnings:
-        print("WARNINGS:")
-        for w in state.warnings:
-            print(f"  ⚠ {w}")
-        print()
-    if state.ingest:
-        print(f"Ingest: {len(state.ingest.frames)} frames, {state.ingest.fps:.1f}fps, "
-              f"{state.ingest.duration:.1f}s, {state.ingest.width}x{state.ingest.height}")
-    if state.pose2d:
-        n_detected = sum(1 for kps in state.pose2d.keypoints if kps)
-        print(f"Pose2D: {n_detected}/{len(state.pose2d.keypoints)} frames with detections, "
-              f"confidence={state.pose2d.confidence:.2f}")
-    if state.body3d:
-        print(f"Body3D: used={state.body3d.used}")
-    if state.features:
-        print(f"Biomechanics: view={state.features.view}, "
-              f"confidence={state.features.confidence:.2f}")
-        if state.features.angles:
-            print(f"  Angles: {json.dumps({k: round(v, 1) for k, v in state.features.angles.items()}, indent=4)}")
-        if state.features.alignments:
-            print(f"  Alignments: {json.dumps(state.features.alignments, indent=4)}")
-    # Score via rubric
-    if state.features and test_name == "deep_squat":
-        score_result = score_deep_squat(state.features)
-        print(f"\nSCORE: {score_result.score}/3")
-        print(f"  Rationale: {score_result.rationale}")
-        print(f"  Confidence: {score_result.confidence:.2f}")
-        if score_result.needs_human:
-            print("  ⚠ NEEDS HUMAN REVIEW")
-if __name__ == "__main__":
-    main()

+"""
+FormScout headless CLI entrypoint.
+Usage: python -m formscout.run sample.mp4
+"""
+from __future__ import annotations
+import sys
+import json
+from pathlib import Path
+from formscout.pipeline import Director
+from formscout.rubric import score_test
+def main():
+    if len(sys.argv) < 2:
+        print("Usage: python -m formscout.run <video_path> [test_name] [side]")
+        sys.exit(1)
+    video_path = sys.argv[1]
+    test_name = sys.argv[2] if len(sys.argv) > 2 else "deep_squat"
+    side = sys.argv[3] if len(sys.argv) > 3 else "na"
+    print(f"FormScout — processing: {video_path}")
+    print(f"  Test: {test_name}, Side: {side}")
+    print()
+    director = Director()
+    state = director.run(video_path, test_name=test_name, side=side)
+    # Print pipeline state
+    if state.errors:
+        print("ERRORS:")
+        for e in state.errors:
+            print(f"  ✗ {e}")
+        print()
+    if state.warnings:
+        print("WARNINGS:")
+        for w in state.warnings:
+            print(f"  ⚠ {w}")
+        print()
+    if state.ingest:
+        print(f"Ingest: {len(state.ingest.frames)} frames, {state.ingest.fps:.1f}fps, "
+              f"{state.ingest.duration:.1f}s, {state.ingest.width}x{state.ingest.height}")
+    if state.pose2d:
+        n_detected = sum(1 for kps in state.pose2d.keypoints if kps)
+        print(f"Pose2D: {n_detected}/{len(state.pose2d.keypoints)} frames with detections, "
+              f"confidence={state.pose2d.confidence:.2f}")
+    if state.body3d:
+        print(f"Body3D: used={state.body3d.used}")
+    if state.features:
+        print(f"Biomechanics: view={state.features.view}, "
+              f"confidence={state.features.confidence:.2f}")
+        if state.features.angles:
+            print(f"  Angles: {json.dumps({k: round(v, 1) for k, v in state.features.angles.items()}, indent=4)}")
+        if state.features.alignments:
+            print(f"  Alignments: {json.dumps(state.features.alignments, indent=4)}")
+    # Score via rubric
+    if state.features and test_name == "deep_squat":
+        score_result = score_test(state.features)
+        print(f"\nSCORE: {score_result.score}/3")
+        print(f"  Rationale: {score_result.rationale}")
+        print(f"  Confidence: {score_result.confidence:.2f}")
+        if score_result.needs_human:
+            print("  ⚠ NEEDS HUMAN REVIEW")
+    # Judge result
+    if state.judge:
+        print(f"\nJUDGE: score={state.judge.score}, needs_human={state.judge.needs_human}")
+        print(f"  Rationale: {state.judge.rationale}")
+        if state.judge.compensation_tags:
+            print(f"  Compensations: {state.judge.compensation_tags}")
+        if state.judge.corrective_hint:
+            print(f"  Corrective: {state.judge.corrective_hint}")
+if __name__ == "__main__":
+    main()

formscout/serving/llama_cpp.py ADDED Viewed

	@@ -0,0 +1,134 @@

+"""
+llama.cpp HTTP client wrapper for FormScout.
+Wraps the llama.cpp server's /completion and /embedding endpoints.
+Falls back gracefully when the server is unavailable.
+Model: Qwen3-VL-8B-Instruct (Q4_K_M GGUF) for VLM inference.
+Model: Qwen3-VL-Embedding-8B (Q4_K_M GGUF) for embeddings.
+Params: 8B each (shared backbone).
+License: Apache-2.0.
+"""
+from __future__ import annotations
+import base64
+import json
+import logging
+from pathlib import Path
+from typing import Any
+import requests
+from formscout import config
+logger = logging.getLogger(__name__)
+_TIMEOUT = 120  # seconds — VLM can be slow
+class LlamaCppClient:
+    """HTTP client for a llama.cpp server instance."""
+    def __init__(self, host: str | None = None, port: int | None = None):
+        self.host = host or config.LLAMA_CPP_HOST
+        self.port = port or config.LLAMA_CPP_PORT_VLM
+        self.base_url = f"http://{self.host}:{self.port}"
+    @property
+    def available(self) -> bool:
+        """Check if the server is reachable."""
+        try:
+            r = requests.get(f"{self.base_url}/health", timeout=5)
+            return r.status_code == 200
+        except (requests.ConnectionError, requests.Timeout):
+            return False
+    def complete(
+        self,
+        prompt: str,
+        images: list[str] | None = None,
+        max_tokens: int = 512,
+        temperature: float = 0.1,
+        stop: list[str] | None = None,
+    ) -> dict[str, Any]:
+        """
+        Send a completion request. Returns parsed JSON if the response is JSON,
+        otherwise returns {"text": raw_text}.
+        Args:
+            prompt: The text prompt (system + user combined).
+            images: Optional list of base64-encoded images or file paths.
+            max_tokens: Max generation tokens.
+            temperature: Sampling temperature.
+            stop: Stop sequences.
+        """
+        payload: dict[str, Any] = {
+            "prompt": prompt,
+            "n_predict": max_tokens,
+            "temperature": temperature,
+            "stop": stop or ["\n\n"],
+        }
+        # Add images for multimodal (Qwen3-VL via llama.cpp mmproj)
+        if images:
+            image_data = []
+            for img in images:
+                if Path(img).exists():
+                    with open(img, "rb") as f:
+                        image_data.append({"data": base64.b64encode(f.read()).decode()})
+                else:
+                    # Assume already base64
+                    image_data.append({"data": img})
+            payload["image_data"] = image_data
+        try:
+            r = requests.post(
+                f"{self.base_url}/completion",
+                json=payload,
+                timeout=_TIMEOUT,
+            )
+            r.raise_for_status()
+            result = r.json()
+            content = result.get("content", "")
+            # Try to parse as JSON
+            try:
+                return json.loads(content)
+            except (json.JSONDecodeError, TypeError):
+                return {"text": content}
+        except requests.ConnectionError:
+            return {"error": "llama.cpp server not available", "text": ""}
+        except requests.Timeout:
+            return {"error": "llama.cpp server timeout", "text": ""}
+        except Exception as e:
+            return {"error": str(e), "text": ""}
+class EmbeddingClient:
+    """HTTP client for the llama.cpp embedding server."""
+    def __init__(self, host: str | None = None, port: int | None = None):
+        self.host = host or config.LLAMA_CPP_HOST
+        self.port = port or config.LLAMA_CPP_PORT_EMBED
+        self.base_url = f"http://{self.host}:{self.port}"
+    @property
+    def available(self) -> bool:
+        try:
+            r = requests.get(f"{self.base_url}/health", timeout=5)
+            return r.status_code == 200
+        except (requests.ConnectionError, requests.Timeout):
+            return False
+    def embed(self, text: str) -> list[float] | None:
+        """Get embedding vector for text. Returns None on failure."""
+        try:
+            r = requests.post(
+                f"{self.base_url}/embedding",
+                json={"content": text},
+                timeout=30,
+            )
+            r.raise_for_status()
+            data = r.json()
+            return data.get("embedding")
+        except Exception:
+            return None

formscout/tracing.py CHANGED Viewed

@@ -1,69 +1,69 @@
-"""
-Structured per-agent I/O tracing for FormScout.
-Records every agent's input/output as JSON-serializable dicts.
-Used for the Sharing-is-Caring badge (publish full trace to Hub).
-"""
-from __future__ import annotations
-import json
-import time
-from dataclasses import asdict, is_dataclass
-from pathlib import Path
-from typing import Any
-class TraceRecord:
-    """A single agent execution record."""
-    def __init__(self, agent_name: str, input_data: Any, output_data: Any, duration_ms: float):
-        self.agent_name = agent_name
-        self.input_summary = self._summarize(input_data)
-        self.output_summary = self._summarize(output_data)
-        self.duration_ms = duration_ms
-        self.timestamp = time.time()
-    def _summarize(self, data: Any) -> dict:
-        """Convert dataclass or dict to JSON-safe summary."""
-        if is_dataclass(data) and not isinstance(data, type):
-            d = asdict(data)
-            # Don't serialize raw frames (numpy arrays)
-            if "frames" in d:
-                d["frames"] = f"[{len(d['frames'])} frames]"
-            return d
-        if isinstance(data, dict):
-            return data
-        return {"value": str(data)}
-    def to_dict(self) -> dict:
-        return {
-            "agent": self.agent_name,
-            "timestamp": self.timestamp,
-            "duration_ms": self.duration_ms,
-            "input": self.input_summary,
-            "output": self.output_summary,
-        }
-class PipelineTrace:
-    """Collects trace records for a full pipeline run."""
-    def __init__(self):
-        self.records: list[TraceRecord] = []
-        self.start_time = time.time()
-    def add(self, record: TraceRecord):
-        self.records.append(record)
-    def to_dict(self) -> dict:
-        return {
-            "total_duration_ms": (time.time() - self.start_time) * 1000,
-            "n_agents": len(self.records),
-            "agents": [r.to_dict() for r in self.records],
-        }
-    def save(self, path: str | Path):
-        """Save trace as JSON."""
-        p = Path(path)
-        p.parent.mkdir(parents=True, exist_ok=True)
-        with open(p, "w") as f:
-            json.dump(self.to_dict(), f, indent=2, default=str)

+"""
+Structured per-agent I/O tracing for FormScout.
+Records every agent's input/output as JSON-serializable dicts.
+Used for the Sharing-is-Caring badge (publish full trace to Hub).
+"""
+from __future__ import annotations
+import json
+import time
+from dataclasses import asdict, is_dataclass
+from pathlib import Path
+from typing import Any
+class TraceRecord:
+    """A single agent execution record."""
+    def __init__(self, agent_name: str, input_data: Any, output_data: Any, duration_ms: float):
+        self.agent_name = agent_name
+        self.input_summary = self._summarize(input_data)
+        self.output_summary = self._summarize(output_data)
+        self.duration_ms = duration_ms
+        self.timestamp = time.time()
+    def _summarize(self, data: Any) -> dict:
+        """Convert dataclass or dict to JSON-safe summary."""
+        if is_dataclass(data) and not isinstance(data, type):
+            d = asdict(data)
+            # Don't serialize raw frames (numpy arrays)
+            if "frames" in d:
+                d["frames"] = f"[{len(d['frames'])} frames]"
+            return d
+        if isinstance(data, dict):
+            return data
+        return {"value": str(data)}
+    def to_dict(self) -> dict:
+        return {
+            "agent": self.agent_name,
+            "timestamp": self.timestamp,
+            "duration_ms": self.duration_ms,
+            "input": self.input_summary,
+            "output": self.output_summary,
+        }
+class PipelineTrace:
+    """Collects trace records for a full pipeline run."""
+    def __init__(self):
+        self.records: list[TraceRecord] = []
+        self.start_time = time.time()
+    def add(self, record: TraceRecord):
+        self.records.append(record)
+    def to_dict(self) -> dict:
+        return {
+            "total_duration_ms": (time.time() - self.start_time) * 1000,
+            "n_agents": len(self.records),
+            "agents": [r.to_dict() for r in self.records],
+        }
+    def save(self, path: str | Path):
+        """Save trace as JSON."""
+        p = Path(path)
+        p.parent.mkdir(parents=True, exist_ok=True)
+        with open(p, "w") as f:
+            json.dump(self.to_dict(), f, indent=2, default=str)