Spaces:
Running
Running
| # Underdog Lab Implementation Plan | |
| _Date: June 12, 2026_ | |
| ## Objective | |
| Ship a polished Gradio Space by June 15, 2026 that demonstrates a small local language model converting football narratives into validated semantic factors, while a deterministic statistical engine owns all numerical probabilities. | |
| Success means: | |
| - the application runs end to end in a clean Hugging Face Space; | |
| - the runtime model is at most 4B parameters; | |
| - no cloud inference API is required; | |
| - scenario outputs are grammar-constrained and validated; | |
| - historical challenge mode is playable; | |
| - base and fine-tuned extraction quality are measured on a frozen test set; | |
| - the model, dataset or traces, field notes, demo, and social post are published before submission. | |
| ## Non-Goals Before Submission | |
| - Proving superiority over bookmaker markets | |
| - Live-data ingestion pipelines | |
| - Dixon-Coles or large ML ensembles | |
| - World Cup tournament simulation | |
| - Coding-agent benchmark | |
| - Betting recommendations | |
| - Hyperparameter sweeps | |
| ## User Experience | |
| ### Primary Flow | |
| 1. User selects a hidden historical match. | |
| 2. App shows teams, venue, competition context, and baseline probabilities without revealing the result. | |
| 3. User writes a hypothetical or pre-match scenario. | |
| 4. Small model extracts typed semantic factors. | |
| 5. Validator rejects unsupported output and resolves team references. | |
| 6. Rule engine maps factors to bounded model adjustments. | |
| 7. Probability bars animate from baseline to adjusted forecast. | |
| 8. User submits home/draw/away probabilities. | |
| 9. App reveals the historical result. | |
| 10. App compares the baseline, adjusted model, and user with log loss and Brier score. | |
| ### Supporting Views | |
| - **Scenario Stress Test:** vary one factor's severity and visualize monotonic probability movement. | |
| - **Tiny Model Lab:** base-versus-tuned metrics, latency, example failures, model size, and training details. | |
| - **How It Works:** concise explanation of semantic extraction, deterministic rules, and Poisson forecasting. | |
| ## System Architecture | |
| ```text | |
| Gradio application | |
| -> historical match repository | |
| -> baseline forecast service | |
| -> scenario extractor interface | |
| -> llama.cpp base or fine-tuned GGUF | |
| -> grammar-constrained JSON | |
| -> Pydantic validation | |
| -> semantic factor normalizer | |
| -> deterministic adjustment rules | |
| -> Poisson forecast service | |
| -> scoring service | |
| -> visual components | |
| -> append-only local trace logger | |
| ``` | |
| The extractor must be swappable. The rest of the application must work with: | |
| - a deterministic mock extractor for tests; | |
| - the base SmolLM3-3B model; | |
| - the fine-tuned SmolLM3-3B model. | |
| ## Repository Structure | |
| ```text | |
| app.py | |
| pyproject.toml | |
| requirements.txt | |
| README.md | |
| src/underdog_lab/ | |
| config.py | |
| domain.py | |
| data/ | |
| repository.py | |
| schemas.py | |
| forecasting/ | |
| poisson.py | |
| adjustments.py | |
| scoring.py | |
| scenarios/ | |
| taxonomy.py | |
| schemas.py | |
| extractor.py | |
| llama_cpp_extractor.py | |
| mock_extractor.py | |
| grammar.gbnf | |
| prompts.py | |
| ui/ | |
| app.py | |
| components.py | |
| charts.py | |
| theme.py | |
| telemetry/ | |
| traces.py | |
| scripts/ | |
| prepare_matches.py | |
| generate_synthetic_data.py | |
| evaluate_extractor.py | |
| export_gguf.py | |
| smoke_test_space.py | |
| training/ | |
| modal_train.py | |
| configs/ | |
| qlora.yaml | |
| data/ | |
| raw/ | |
| processed/ | |
| matches.parquet | |
| team_strengths.parquet | |
| scenarios/ | |
| train.jsonl | |
| validation.jsonl | |
| test.jsonl | |
| traces/ | |
| models/ | |
| README.md | |
| tests/ | |
| unit/ | |
| property/ | |
| integration/ | |
| fixtures/ | |
| docs/ | |
| ``` | |
| ## Domain Contracts | |
| ### Match Record | |
| ```python | |
| class MatchRecord(BaseModel): | |
| match_id: str | |
| kickoff_date: date | |
| competition: str | |
| home_team: str | |
| away_team: str | |
| neutral_venue: bool | |
| home_goals: int | |
| away_goals: int | |
| baseline_home_xg: float | |
| baseline_away_xg: float | |
| context: str | |
| reveal_notes: str | None = None | |
| ``` | |
| `home_goals` and `away_goals` must remain hidden from the UI until the user commits a forecast. | |
| ### Factor Taxonomy | |
| Freeze the taxonomy before generating data: | |
| ```text | |
| key_attacker_unavailable | |
| key_defender_unavailable | |
| goalkeeper_unavailable | |
| multiple_starters_unavailable | |
| squad_rotation | |
| fatigue_disadvantage | |
| rest_advantage | |
| travel_disadvantage | |
| altitude_disadvantage | |
| heat_disadvantage | |
| home_advantage | |
| neutral_venue | |
| defensive_game_state | |
| must_win_incentive | |
| ``` | |
| Classification-only outputs: | |
| ```text | |
| unsupported_claim | |
| ambiguous_claim | |
| irrelevant_text | |
| ``` | |
| ### Extraction Schema | |
| ```python | |
| class ExtractedFactor(BaseModel): | |
| factor_type: FactorType | |
| team: Literal["home", "away", "both", "unknown"] | |
| severity: float = Field(ge=0.0, le=1.0) | |
| certainty: float = Field(ge=0.0, le=1.0) | |
| evidence: str | |
| class ScenarioExtraction(BaseModel): | |
| factors: list[ExtractedFactor] = Field(max_length=6) | |
| unsupported_claims: list[str] = Field(default_factory=list) | |
| ambiguities: list[str] = Field(default_factory=list) | |
| ``` | |
| The model never emits expected-goal deltas or probabilities. | |
| ## Deterministic Adjustment Rules | |
| Store the mapping in versioned Python data or YAML. Example starting ranges: | |
| | Factor | Target | Maximum effect at severity 1.0 | | |
| |---|---|---:| | |
| | key_attacker_unavailable | affected attack | -12% | | |
| | key_defender_unavailable | opponent attack | +8% | | |
| | goalkeeper_unavailable | opponent attack | +10% | | |
| | multiple_starters_unavailable | affected attack/defence | -8% / opponent +6% | | |
| | squad_rotation | affected attack/defence | -6% / opponent +4% | | |
| | fatigue_disadvantage | affected attack/defence | -5% / opponent +3% | | |
| | rest_advantage | affected attack/defence | +4% / opponent -2% | | |
| | travel_disadvantage | affected attack | -4% | | |
| | altitude_disadvantage | affected attack/defence | -5% / opponent +3% | | |
| | heat_disadvantage | affected attack | -3% | | |
| | home_advantage | home attack | +6% | | |
| | neutral_venue | home attack | -6% | | |
| | defensive_game_state | both attacks | -6% | | |
| | must_win_incentive | affected attack/defence | +5% / opponent +2% | | |
| These values are product assumptions, not learned truths. Label them clearly and version them as `ruleset_v1`. Clamp final expected goals to a safe range such as `[0.15, 4.0]`. | |
| ## Forecasting Engine | |
| ### Baseline | |
| Use precomputed expected goals per match. For a score grid from 0 to 8 goals: | |
| ```text | |
| P(home=i, away=j) = Poisson(i; lambda_home) * Poisson(j; lambda_away) | |
| ``` | |
| Sum score cells into home-win, draw, and away-win probabilities. Normalize after truncation. | |
| ### Scenario Adjustment | |
| 1. Deduplicate equivalent factors. | |
| 2. Ignore `unknown` team assignments unless deterministically resolvable. | |
| 3. Multiply effects by `severity * certainty`. | |
| 4. Apply bounded multiplicative adjustments. | |
| 5. Clamp expected goals. | |
| 6. Recompute the score matrix and 1X2 probabilities. | |
| ### Scoring | |
| Implement: | |
| - multiclass log loss with probability clipping; | |
| - multiclass Brier score; | |
| - optional simple points for game presentation only. | |
| Scientific tables use log loss and Brier score, never game points alone. | |
| ## Data Plan | |
| ### Historical Challenge Set | |
| Prepare 20–30 matches, prioritizing variety rather than only famous upsets: | |
| - favorites winning comfortably; | |
| - draws; | |
| - narrow favorites losing; | |
| - neutral-venue tournament matches; | |
| - high-scoring and low-scoring matches. | |
| Do not expose result-bearing language in `context`. Famous showcase matches must be excluded from extraction training examples and the frozen extractor test set. | |
| ### Extraction Dataset | |
| Target: | |
| - 600–800 synthetic training examples; | |
| - 50 validation examples; | |
| - 80–100 frozen test examples. | |
| Balance factor categories and include: | |
| - single and multiple factors; | |
| - pronouns and team-name references; | |
| - negation; | |
| - contradictions; | |
| - irrelevant supporter commentary; | |
| - unsupported facts; | |
| - prompt-injection attempts; | |
| - paraphrase groups; | |
| - severity ladders. | |
| Manually review every validation and test item. Store provenance and split before training. | |
| Example record: | |
| ```json | |
| { | |
| "id": "scenario-0042", | |
| "home_team": "Argentina", | |
| "away_team": "Saudi Arabia", | |
| "text": "Argentina's first-choice striker is confirmed out.", | |
| "expected": { | |
| "factors": [ | |
| { | |
| "factor_type": "key_attacker_unavailable", | |
| "team": "home", | |
| "severity": 1.0, | |
| "certainty": 1.0 | |
| } | |
| ], | |
| "unsupported_claims": [], | |
| "ambiguities": [] | |
| } | |
| } | |
| ``` | |
| ## Constrained Inference | |
| Use `llama.cpp` for both base and tuned models. Enforce the JSON structure with GBNF grammar. | |
| Runtime settings: | |
| - temperature: `0` or near zero; | |
| - short maximum output; | |
| - fixed system prompt; | |
| - one retry only for semantic validation failures; | |
| - deterministic empty extraction fallback; | |
| - cache common demo scenarios. | |
| Grammar guarantees syntax, not semantic correctness. Pydantic and domain validation remain mandatory. | |
| ## Fine-Tuning Plan | |
| ### Base Model | |
| Start with `HuggingFaceTB/SmolLM3-3B` or its supported GGUF conversion. | |
| ### Method | |
| Run one QLoRA supervised fine-tune on Modal: | |
| - rank: 16; | |
| - alpha: 32; | |
| - dropout: 0.05; | |
| - learning rate: approximately `2e-4`; | |
| - 2–3 epochs; | |
| - sequence length sized to the compact extraction task; | |
| - early stopping or checkpoint selection using validation semantic metrics. | |
| Exact settings may change for compatibility, but do not run a broad sweep. | |
| ### Artifacts | |
| Publish: | |
| - LoRA adapter; | |
| - merged model if licensing and storage allow; | |
| - GGUF runtime model; | |
| - dataset or anonymized trace dataset; | |
| - model card with taxonomy, training method, limitations, and evaluation. | |
| ### Decision Gate: June 13 Evening | |
| Ship the tuned model only if it: | |
| - improves factor micro-F1 by a meaningful amount; | |
| - does not regress team attribution or unsupported-claim detection materially; | |
| - passes every behavioral property test; | |
| - fits Space memory and startup constraints; | |
| - has acceptable median latency. | |
| Otherwise ship the base model and report the negative result honestly. Do not risk the submission for the Well-Tuned badge. | |
| ## Evaluation Plan | |
| ### Extraction Metrics | |
| - factor micro-F1 and macro-F1; | |
| - team attribution accuracy; | |
| - severity mean absolute error; | |
| - certainty mean absolute error; | |
| - unsupported-claim precision, recall, and F1; | |
| - ambiguity detection F1; | |
| - exact semantic match rate; | |
| - paraphrase consistency; | |
| - median and p95 latency. | |
| Because constrained decoding guarantees schema shape, schema-validity rate is a runtime health metric, not the main fine-tuning result. | |
| ### Behavioral Property Tests | |
| - An unavailable attacker never improves the affected attack. | |
| - Higher severity never produces a smaller adjustment for the same factor. | |
| - Unsupported or irrelevant text produces no forecast adjustment. | |
| - Equivalent paraphrases map to equivalent factor sets within tolerance. | |
| - Home/away references remain correctly attributed. | |
| - Duplicate factors do not stack without bounds. | |
| - Contradictory factors are flagged or handled deterministically. | |
| - Forecast probabilities are finite, nonnegative, and sum to one. | |
| - Hidden results cannot enter the pre-reveal state. | |
| ### Integration Tests | |
| - Select match -> extract -> adjust -> score -> reveal. | |
| - Extractor timeout returns a usable fallback state. | |
| - Invalid model output cannot reach the forecast engine. | |
| - Base and tuned adapters satisfy the same interface. | |
| - Space starts without external inference credentials. | |
| ## Gradio UI Plan | |
| ### Screen 1: Challenge | |
| - Match card with competition and venue context | |
| - Baseline probability bars | |
| - Scenario textbox with three examples | |
| - Extracted-factor chips with severity and certainty | |
| - Before/after probability animation | |
| - User probability controls constrained to sum to 100% | |
| - Commit and reveal action | |
| - Score comparison | |
| ### Screen 2: Stress Test | |
| - Match selector | |
| - Factor selector | |
| - Team selector | |
| - Severity slider | |
| - Live probability curve | |
| - Plain-language explanation of deterministic assumptions | |
| ### Screen 3: Tiny Model Lab | |
| - Base versus tuned metric table | |
| - Confusion by factor category | |
| - Example successes and failures | |
| - Model size, quantization, latency, and Modal training compute | |
| - Links to model, dataset, traces, and field notes | |
| ### Visual Direction | |
| Use a broadcast-analysis aesthetic rather than default Gradio controls: | |
| - dark pitch-inspired background; | |
| - strong team-color probability bars; | |
| - compact match cards; | |
| - visible “baseline” and “scenario” states; | |
| - restrained animation; | |
| - mobile-safe layout. | |
| Polish the primary challenge screen before secondary tabs. | |
| ## Delivery Schedule | |
| ### June 12: Submission-Safe Vertical Slice | |
| - Scaffold repository and dependencies. | |
| - Implement schemas, taxonomy, rules, Poisson model, and scoring. | |
| - Prepare at least 10 historical matches. | |
| - Implement mock extractor and base-model adapter. | |
| - Add GBNF grammar and validation. | |
| - Build minimal Challenge screen. | |
| - Deploy the first working Space. | |
| - Start synthetic-data generation only after taxonomy freeze. | |
| Exit criteria: a clean user can complete one challenge on the deployed Space. | |
| ### June 13: Evaluation and Fine-Tuning | |
| - Finalize 20–30 matches. | |
| - Freeze and manually review extractor test set. | |
| - Benchmark the base model. | |
| - Run one QLoRA job on Modal. | |
| - Build reveal, scoring, and stress-test features. | |
| - Implement behavioral and integration tests. | |
| - Evaluate tuned versus base model. | |
| - Apply the decision gate. | |
| Exit criteria: selected runtime model is known and the complete core experience works. | |
| ### June 14: Deployment and Submission Assets | |
| - Merge and quantize the selected model. | |
| - Publish model and dataset/trace artifacts. | |
| - Verify llama.cpp inference in the Space. | |
| - Build Tiny Model Lab. | |
| - Finish custom styling. | |
| - Test cold start, mobile layout, and failure paths. | |
| - Record the 60–90 second demo. | |
| - Draft field notes and social post. | |
| Exit criteria: release candidate is frozen and all submission URLs exist. | |
| ### June 15: Buffer and Submit | |
| - Fix only critical defects. | |
| - Re-run smoke tests from a clean session. | |
| - Publish field notes and social post. | |
| - Submit Space, demo, and social links. | |
| - Tag the release commit. | |
| No new features on June 15. | |
| ## Acceptance Criteria | |
| ### Must Have | |
| - Hosted Gradio Space | |
| - Runtime model <=4B | |
| - llama.cpp inference | |
| - No cloud inference dependency | |
| - Historical challenge playable end to end | |
| - Grammar-constrained semantic extraction | |
| - Deterministic validated adjustments | |
| - Probability visualization and scoring | |
| - At least 20 historical matches | |
| - Automated unit and integration tests | |
| - Demo video and social post | |
| ### Should Have | |
| - Fine-tuned model with measured semantic improvement | |
| - Stress-test slider | |
| - Tiny Model Lab | |
| - Published model and dataset/traces | |
| - Field notes | |
| - Custom visual styling | |
| ### Could Have | |
| - Live World Cup fixture card | |
| - Forecast Courtroom presentation | |
| - Community leaderboard | |
| - More sophisticated baseline calibration | |
| ## Risk Register | |
| | Risk | Trigger | Mitigation | | |
| |---|---|---| | |
| | Space cannot load 3B GGUF | OOM or startup timeout | Quantize more aggressively, reduce context, lazy-load, keep mock/demo fallback | | |
| | llama.cpp lacks model compatibility | Conversion or inference failure | Test base GGUF on June 12; switch to a known-supported <=4B model immediately | | |
| | Fine-tune underperforms | No semantic gain on frozen test | Ship base model; document result | | |
| | Synthetic labels are noisy | Low reviewer agreement | Shrink training set and improve labels rather than train longer | | |
| | Demo feels like hindsight | Showcase relies on known upset | Hide result and use only pre-match or explicitly hypothetical context | | |
| | UI looks like a chat wrapper | Scenario box dominates | Make factors and probability movement the central visual artifact | | |
| | Cloud dependency breaks Off the Grid | Runtime calls external endpoint | Keep all inference in the Space and bundle/cache required artifacts | | |
| | Deadline pressure | Core incomplete by June 13 | Drop secondary tabs and fine-tune before dropping deployable core | | |
| ## Release Checklist | |
| - [ ] Registration and organization access confirmed | |
| - [ ] Space created under hackathon organization | |
| - [ ] Model parameter total documented | |
| - [ ] Model license checked | |
| - [ ] Dataset licenses and attribution documented | |
| - [ ] No result leakage in challenge context | |
| - [ ] Base benchmark saved | |
| - [ ] Tuned benchmark saved or negative result documented | |
| - [ ] GGUF and llama.cpp versions pinned | |
| - [ ] Clean Space restart passes | |
| - [ ] Mobile smoke test passes | |
| - [ ] Model card published | |
| - [ ] Dataset or trace card published | |
| - [ ] Field notes published | |
| - [ ] Demo video uploaded | |
| - [ ] Social post published | |
| - [ ] Submission completed before deadline | |
| ## Final Pitch | |
| > Underdog Lab fine-tunes a 3B local model to turn messy football narratives into validated forecasting evidence. A transparent statistical engine converts those factors into probabilities, letting users stress-test scenarios, challenge the model, and learn why calibrated forecasts are more useful than confident guesses. | |