Spaces:
Running
Running
File size: 7,133 Bytes
2414d31 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | # 09 β Risk Register & Mitigations
Ranked by likelihood Γ impact. Top of list = address first.
## R1 β Reward curve goes flat (HIGH likelihood, HIGH impact)
**Symptom**: After 100 GRPO steps, mean episode reward stays at baseline (~0.25).
**Causes**:
- Reward signal too sparse
- Per-step shaping too small relative to terminal reward
- Rollout parsing broken (model outputs gibberish, parser silently fails)
- KL coefficient (Ξ²) too high β policy can't move
**Mitigations**:
- Sanity-check rollout parser: print 5 random completions + parsed actions
- Verify shaping rewards firing: log per-step reward by action type
- Reduce Ξ² to 0.01
- Increase shaping reward magnitude (Γ2)
- Simplify rubric: drop InfoGain temporarily, use only FieldMatch
- Pre-warm with SFT on synthetic "ask first" trajectories (1-2 epochs)
**Time to detect**: 15 min (smoke test of 100 steps)
**Time to fix**: 30-60 min
## R2 β Reward hacking (HIGH likelihood, MEDIUM impact)
**Symptom**: Reward curve climbs but qualitative outputs are gibberish/repetitive.
**Likely hacks**:
- Always ask same generic question 6 times then submit empty plan
- Submit JSON with all profile field keys but garbage values
- Output the same action token over and over
**Mitigations**:
- Duplicate-Q penalty (already in plan)
- HallucinationCheckRubric (already in plan)
- FormatCheck Gate with strict schema (already in plan)
- Add EntropyRubric: penalize repeated actions (component if needed)
- Manual inspection of 10 trained outputs every 100 steps
**Time to detect**: 100 GRPO steps + manual inspection (15 min)
**Time to fix**: 30 min (add penalty component)
## R3 β Colab session times out mid-training (MEDIUM, MEDIUM)
**Symptom**: Long training run gets killed by Colab free-tier session limits.
**Mitigations**:
- Save LoRA checkpoint every 100 steps
- Always run training in resumable form (TRL supports resume from checkpoint)
- Plan training in 100-step chunks, not one mega-run
- Have second Google account ready for backup
**Time to detect**: live
**Time to fix**: 5 min (resume from last checkpoint)
## R4 β HF Space build fails (MEDIUM, HIGH)
**Symptom**: `git push space main` succeeds but Space build errors out.
**Common causes**:
- Dockerfile issues (missing deps, wrong Python version)
- pyproject.toml resolution failure
- HF Space hardware mismatch
**Mitigations**:
- Test Docker build LOCALLY before pushing: `docker build -t clarify-rl . && docker run -p 8000:8000 clarify-rl`
- Mirror EXACT Dockerfile from working SRE env (which we know builds)
- Push minimal stub Space FIRST (just FastAPI hello world), confirm builds, then layer on env
- Keep Space build logs open in browser tab while pushing
**Time to detect**: 5-10 min (HF build logs)
**Time to fix**: 15-30 min (Docker iteration)
## R5 β Validator rejects submission (LOW likelihood, FATAL impact)
**Symptom**: Auto-validator marks submission incomplete; never reaches human judges.
**Mitigations**:
- Run through every item in `docs/07-deployment.md` checklist
- 1-hour pre-deadline buffer for fixes
- Test ALL deliverable links from incognito browser
- Make sure plots are committed as files, not just in notebook outputs
**Time to detect**: post-submission (TOO LATE β must validate before)
**Time to fix**: depends on what's missing
## R6 β Training takes too long on T4 (LOW, MEDIUM)
**Symptom**: 600 GRPO steps take >2 hours; eats into Day 2 schedule.
**Mitigations**:
- Use Unsloth (we already are)
- Use 4-bit quantization (we already are)
- Reduce max_seq_length to 2048 if needed
- Reduce num_generations to 2 (instead of 4)
- Stop at 300 steps if curve is good β quality > quantity
**Time to detect**: 30 min into training (extrapolate)
**Time to fix**: tune config, restart from checkpoint
## R7 β Rubric doesn't separate good from bad (LOW, HIGH) β β
VERIFIED OK
**Symptom**: Even oracle policy gets ~0.5; even random policy gets ~0.5.
**Causes**:
- Weights wrong, components average out
- FormatCheck too lenient
- HallucinationCheck too punitive
**Mitigations**:
- Run sanity policies BEFORE training:
- Random: should get ~0.20
- Oracle (asks all critical Qs, perfect plan): should get ~0.95
- Blank plan: should get 0.0
- If gap is small, retune weights and component logic before training
**Current status**: Oracle scores ~0.89 via `smoke_env.py` (FormatCheck=1.0, FieldMatch=1.0, InfoGain=1.0, Efficiency=0.5, Hallucination=0.75). Gap is healthy.
**Time to detect**: 10 min (sanity script)
**Time to fix**: 30-60 min
## R8 β Profile generator produces unsolvable scenarios (LOW, MEDIUM) β β
MITIGATED
**Symptom**: Even oracle can't get high score on some scenarios.
**Causes**:
- Field vocabulary too sparse β user simulator returns wrong field
- Critical fields not always present
- Request template too vague to even hint at task type
**Mitigations**:
- Validate generator: 100 random scenarios β oracle scores them β all should be β₯0.7
- Add task_type hint to every request template (subtle, e.g. "dinner" β restaurant)
- Ensure FIELD_KEYWORDS covers all profile fields
**Fix applied**: `scenarios.py` now always includes `required_keys` in the profile for medium/hard difficulty. Hard range adjusted to (6,7) to match actual field pool sizes (max 7).
**Time to detect**: 5 min (sanity check)
**Time to fix**: 15-30 min
## R9 β One team member becomes unavailable (LOW, HIGH)
**Symptom**: Anurag or Kanan can't continue (illness, technical issues, lost device).
**Mitigations**:
- Both can git-push to both remotes
- Both have HF + GitHub credentials
- Both have Colab access
- Pair-program critical sections (env, rubric)
**Time to detect**: live
**Time to fix**: depends, but project should continue
## R10 β Last-minute organizational changes (LOW, VARIABLE)
**Symptom**: Submission form changes, deadline shifts, theme reinterpretations announced.
**Mitigations**:
- Monitor Discord every 2 hours
- Both team members on Discord notifications
- Have a Plan B for each deliverable (video OR blog, not both required)
## Fallback Plans (graceful degradation)
If we run out of time:
1. **Cut difficulty levels**: Ship only "medium" task β still scores well on Storytelling
2. **Cut task types**: Ship 3 of 5 task types instead of all 5
3. **Cut training**: Use Unsloth pre-trained on synthetic SFT data, skip GRPO. Worse story but still ships.
4. **Cut video**: Ship blog post only.
5. **Cut blog**: Ship video only.
The core ship is: **HF Space + Colab + plots + README**. Everything else is bonus.
## Risk Score Summary
| ID | Risk | L | I | Score |
|----|------|---|---|-------|
| R1 | Reward curve flat | H | H | 9 |
| R2 | Reward hacking | H | M | 6 |
| R3 | Colab timeout | M | M | 4 |
| R4 | HF Space build fail | M | H | 6 |
| R5 | Validator rejection | L | F | 5 |
| R6 | Training too slow | L | M | 2 |
| R7 | Rubric doesn't separate | L | H | 3 |
| R8 | Bad scenarios | L | M | 2 |
| R9 | Team member down | L | H | 3 |
| R10 | Org changes | L | V | 1 |
L=likelihood, I=impact, F=fatal.
**Top 3 to actively mitigate during build**: R1, R2, R4.
|