Spaces:
Running
09 β Risk Register & Mitigations
Ranked by likelihood Γ impact. Top of list = address first.
R1 β Reward curve goes flat (HIGH likelihood, HIGH impact)
Symptom: After 100 GRPO steps, mean episode reward stays at baseline (~0.25).
Causes:
- Reward signal too sparse
- Per-step shaping too small relative to terminal reward
- Rollout parsing broken (model outputs gibberish, parser silently fails)
- KL coefficient (Ξ²) too high β policy can't move
Mitigations:
- Sanity-check rollout parser: print 5 random completions + parsed actions
- Verify shaping rewards firing: log per-step reward by action type
- Reduce Ξ² to 0.01
- Increase shaping reward magnitude (Γ2)
- Simplify rubric: drop InfoGain temporarily, use only FieldMatch
- Pre-warm with SFT on synthetic "ask first" trajectories (1-2 epochs)
Time to detect: 15 min (smoke test of 100 steps) Time to fix: 30-60 min
R2 β Reward hacking (HIGH likelihood, MEDIUM impact)
Symptom: Reward curve climbs but qualitative outputs are gibberish/repetitive.
Likely hacks:
- Always ask same generic question 6 times then submit empty plan
- Submit JSON with all profile field keys but garbage values
- Output the same action token over and over
Mitigations:
- Duplicate-Q penalty (already in plan)
- HallucinationCheckRubric (already in plan)
- FormatCheck Gate with strict schema (already in plan)
- Add EntropyRubric: penalize repeated actions (component if needed)
- Manual inspection of 10 trained outputs every 100 steps
Time to detect: 100 GRPO steps + manual inspection (15 min) Time to fix: 30 min (add penalty component)
R3 β Colab session times out mid-training (MEDIUM, MEDIUM)
Symptom: Long training run gets killed by Colab free-tier session limits.
Mitigations:
- Save LoRA checkpoint every 100 steps
- Always run training in resumable form (TRL supports resume from checkpoint)
- Plan training in 100-step chunks, not one mega-run
- Have second Google account ready for backup
Time to detect: live Time to fix: 5 min (resume from last checkpoint)
R4 β HF Space build fails (MEDIUM, HIGH)
Symptom: git push space main succeeds but Space build errors out.
Common causes:
- Dockerfile issues (missing deps, wrong Python version)
- pyproject.toml resolution failure
- HF Space hardware mismatch
Mitigations:
- Test Docker build LOCALLY before pushing:
docker build -t clarify-rl . && docker run -p 8000:8000 clarify-rl - Mirror EXACT Dockerfile from working SRE env (which we know builds)
- Push minimal stub Space FIRST (just FastAPI hello world), confirm builds, then layer on env
- Keep Space build logs open in browser tab while pushing
Time to detect: 5-10 min (HF build logs) Time to fix: 15-30 min (Docker iteration)
R5 β Validator rejects submission (LOW likelihood, FATAL impact)
Symptom: Auto-validator marks submission incomplete; never reaches human judges.
Mitigations:
- Run through every item in
docs/07-deployment.mdchecklist - 1-hour pre-deadline buffer for fixes
- Test ALL deliverable links from incognito browser
- Make sure plots are committed as files, not just in notebook outputs
Time to detect: post-submission (TOO LATE β must validate before) Time to fix: depends on what's missing
R6 β Training takes too long on T4 (LOW, MEDIUM)
Symptom: 600 GRPO steps take >2 hours; eats into Day 2 schedule.
Mitigations:
- Use Unsloth (we already are)
- Use 4-bit quantization (we already are)
- Reduce max_seq_length to 2048 if needed
- Reduce num_generations to 2 (instead of 4)
- Stop at 300 steps if curve is good β quality > quantity
Time to detect: 30 min into training (extrapolate) Time to fix: tune config, restart from checkpoint
R7 β Rubric doesn't separate good from bad (LOW, HIGH) β β VERIFIED OK
Symptom: Even oracle policy gets ~0.5; even random policy gets ~0.5.
Causes:
- Weights wrong, components average out
- FormatCheck too lenient
- HallucinationCheck too punitive
Mitigations:
- Run sanity policies BEFORE training:
- Random: should get ~0.20
- Oracle (asks all critical Qs, perfect plan): should get ~0.95
- Blank plan: should get 0.0
- If gap is small, retune weights and component logic before training
Current status: Oracle scores ~0.89 via smoke_env.py (FormatCheck=1.0, FieldMatch=1.0, InfoGain=1.0, Efficiency=0.5, Hallucination=0.75). Gap is healthy.
Time to detect: 10 min (sanity script) Time to fix: 30-60 min
R8 β Profile generator produces unsolvable scenarios (LOW, MEDIUM) β β MITIGATED
Symptom: Even oracle can't get high score on some scenarios.
Causes:
- Field vocabulary too sparse β user simulator returns wrong field
- Critical fields not always present
- Request template too vague to even hint at task type
Mitigations:
- Validate generator: 100 random scenarios β oracle scores them β all should be β₯0.7
- Add task_type hint to every request template (subtle, e.g. "dinner" β restaurant)
- Ensure FIELD_KEYWORDS covers all profile fields
Fix applied: scenarios.py now always includes required_keys in the profile for medium/hard difficulty. Hard range adjusted to (6,7) to match actual field pool sizes (max 7).
Time to detect: 5 min (sanity check) Time to fix: 15-30 min
R9 β One team member becomes unavailable (LOW, HIGH)
Symptom: Anurag or Kanan can't continue (illness, technical issues, lost device).
Mitigations:
- Both can git-push to both remotes
- Both have HF + GitHub credentials
- Both have Colab access
- Pair-program critical sections (env, rubric)
Time to detect: live Time to fix: depends, but project should continue
R10 β Last-minute organizational changes (LOW, VARIABLE)
Symptom: Submission form changes, deadline shifts, theme reinterpretations announced.
Mitigations:
- Monitor Discord every 2 hours
- Both team members on Discord notifications
- Have a Plan B for each deliverable (video OR blog, not both required)
Fallback Plans (graceful degradation)
If we run out of time:
- Cut difficulty levels: Ship only "medium" task β still scores well on Storytelling
- Cut task types: Ship 3 of 5 task types instead of all 5
- Cut training: Use Unsloth pre-trained on synthetic SFT data, skip GRPO. Worse story but still ships.
- Cut video: Ship blog post only.
- Cut blog: Ship video only.
The core ship is: HF Space + Colab + plots + README. Everything else is bonus.
Risk Score Summary
| ID | Risk | L | I | Score |
|---|---|---|---|---|
| R1 | Reward curve flat | H | H | 9 |
| R2 | Reward hacking | H | M | 6 |
| R3 | Colab timeout | M | M | 4 |
| R4 | HF Space build fail | M | H | 6 |
| R5 | Validator rejection | L | F | 5 |
| R6 | Training too slow | L | M | 2 |
| R7 | Rubric doesn't separate | L | H | 3 |
| R8 | Bad scenarios | L | M | 2 |
| R9 | Team member down | L | H | 3 |
| R10 | Org changes | L | V | 1 |
L=likelihood, I=impact, F=fatal.
Top 3 to actively mitigate during build: R1, R2, R4.