File size: 7,133 Bytes
2414d31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
# 09 β€” Risk Register & Mitigations

Ranked by likelihood Γ— impact. Top of list = address first.

## R1 β€” Reward curve goes flat (HIGH likelihood, HIGH impact)

**Symptom**: After 100 GRPO steps, mean episode reward stays at baseline (~0.25).

**Causes**:
- Reward signal too sparse
- Per-step shaping too small relative to terminal reward
- Rollout parsing broken (model outputs gibberish, parser silently fails)
- KL coefficient (Ξ²) too high β†’ policy can't move

**Mitigations**:
- Sanity-check rollout parser: print 5 random completions + parsed actions
- Verify shaping rewards firing: log per-step reward by action type
- Reduce Ξ² to 0.01
- Increase shaping reward magnitude (Γ—2)
- Simplify rubric: drop InfoGain temporarily, use only FieldMatch
- Pre-warm with SFT on synthetic "ask first" trajectories (1-2 epochs)

**Time to detect**: 15 min (smoke test of 100 steps)
**Time to fix**: 30-60 min

## R2 β€” Reward hacking (HIGH likelihood, MEDIUM impact)

**Symptom**: Reward curve climbs but qualitative outputs are gibberish/repetitive.

**Likely hacks**:
- Always ask same generic question 6 times then submit empty plan
- Submit JSON with all profile field keys but garbage values
- Output the same action token over and over

**Mitigations**:
- Duplicate-Q penalty (already in plan)
- HallucinationCheckRubric (already in plan)
- FormatCheck Gate with strict schema (already in plan)
- Add EntropyRubric: penalize repeated actions (component if needed)
- Manual inspection of 10 trained outputs every 100 steps

**Time to detect**: 100 GRPO steps + manual inspection (15 min)
**Time to fix**: 30 min (add penalty component)

## R3 β€” Colab session times out mid-training (MEDIUM, MEDIUM)

**Symptom**: Long training run gets killed by Colab free-tier session limits.

**Mitigations**:
- Save LoRA checkpoint every 100 steps
- Always run training in resumable form (TRL supports resume from checkpoint)
- Plan training in 100-step chunks, not one mega-run
- Have second Google account ready for backup

**Time to detect**: live
**Time to fix**: 5 min (resume from last checkpoint)

## R4 β€” HF Space build fails (MEDIUM, HIGH)

**Symptom**: `git push space main` succeeds but Space build errors out.

**Common causes**:
- Dockerfile issues (missing deps, wrong Python version)
- pyproject.toml resolution failure
- HF Space hardware mismatch

**Mitigations**:
- Test Docker build LOCALLY before pushing: `docker build -t clarify-rl . && docker run -p 8000:8000 clarify-rl`
- Mirror EXACT Dockerfile from working SRE env (which we know builds)
- Push minimal stub Space FIRST (just FastAPI hello world), confirm builds, then layer on env
- Keep Space build logs open in browser tab while pushing

**Time to detect**: 5-10 min (HF build logs)
**Time to fix**: 15-30 min (Docker iteration)

## R5 β€” Validator rejects submission (LOW likelihood, FATAL impact)

**Symptom**: Auto-validator marks submission incomplete; never reaches human judges.

**Mitigations**:
- Run through every item in `docs/07-deployment.md` checklist
- 1-hour pre-deadline buffer for fixes
- Test ALL deliverable links from incognito browser
- Make sure plots are committed as files, not just in notebook outputs

**Time to detect**: post-submission (TOO LATE β€” must validate before)
**Time to fix**: depends on what's missing

## R6 β€” Training takes too long on T4 (LOW, MEDIUM)

**Symptom**: 600 GRPO steps take >2 hours; eats into Day 2 schedule.

**Mitigations**:
- Use Unsloth (we already are)
- Use 4-bit quantization (we already are)
- Reduce max_seq_length to 2048 if needed
- Reduce num_generations to 2 (instead of 4)
- Stop at 300 steps if curve is good β€” quality > quantity

**Time to detect**: 30 min into training (extrapolate)
**Time to fix**: tune config, restart from checkpoint

## R7 β€” Rubric doesn't separate good from bad (LOW, HIGH) β€” βœ… VERIFIED OK

**Symptom**: Even oracle policy gets ~0.5; even random policy gets ~0.5.

**Causes**:
- Weights wrong, components average out
- FormatCheck too lenient
- HallucinationCheck too punitive

**Mitigations**:
- Run sanity policies BEFORE training:
  - Random: should get ~0.20
  - Oracle (asks all critical Qs, perfect plan): should get ~0.95
  - Blank plan: should get 0.0
- If gap is small, retune weights and component logic before training

**Current status**: Oracle scores ~0.89 via `smoke_env.py` (FormatCheck=1.0, FieldMatch=1.0, InfoGain=1.0, Efficiency=0.5, Hallucination=0.75). Gap is healthy.

**Time to detect**: 10 min (sanity script)
**Time to fix**: 30-60 min

## R8 β€” Profile generator produces unsolvable scenarios (LOW, MEDIUM) β€” βœ… MITIGATED

**Symptom**: Even oracle can't get high score on some scenarios.

**Causes**:
- Field vocabulary too sparse β†’ user simulator returns wrong field
- Critical fields not always present
- Request template too vague to even hint at task type

**Mitigations**:
- Validate generator: 100 random scenarios β†’ oracle scores them β†’ all should be β‰₯0.7
- Add task_type hint to every request template (subtle, e.g. "dinner" β†’ restaurant)
- Ensure FIELD_KEYWORDS covers all profile fields

**Fix applied**: `scenarios.py` now always includes `required_keys` in the profile for medium/hard difficulty. Hard range adjusted to (6,7) to match actual field pool sizes (max 7).

**Time to detect**: 5 min (sanity check)
**Time to fix**: 15-30 min

## R9 β€” One team member becomes unavailable (LOW, HIGH)

**Symptom**: Anurag or Kanan can't continue (illness, technical issues, lost device).

**Mitigations**:
- Both can git-push to both remotes
- Both have HF + GitHub credentials
- Both have Colab access
- Pair-program critical sections (env, rubric)

**Time to detect**: live
**Time to fix**: depends, but project should continue

## R10 β€” Last-minute organizational changes (LOW, VARIABLE)

**Symptom**: Submission form changes, deadline shifts, theme reinterpretations announced.

**Mitigations**:
- Monitor Discord every 2 hours
- Both team members on Discord notifications
- Have a Plan B for each deliverable (video OR blog, not both required)

## Fallback Plans (graceful degradation)

If we run out of time:

1. **Cut difficulty levels**: Ship only "medium" task β€” still scores well on Storytelling
2. **Cut task types**: Ship 3 of 5 task types instead of all 5
3. **Cut training**: Use Unsloth pre-trained on synthetic SFT data, skip GRPO. Worse story but still ships.
4. **Cut video**: Ship blog post only.
5. **Cut blog**: Ship video only.

The core ship is: **HF Space + Colab + plots + README**. Everything else is bonus.

## Risk Score Summary

| ID | Risk | L | I | Score |
|----|------|---|---|-------|
| R1 | Reward curve flat | H | H | 9 |
| R2 | Reward hacking | H | M | 6 |
| R3 | Colab timeout | M | M | 4 |
| R4 | HF Space build fail | M | H | 6 |
| R5 | Validator rejection | L | F | 5 |
| R6 | Training too slow | L | M | 2 |
| R7 | Rubric doesn't separate | L | H | 3 |
| R8 | Bad scenarios | L | M | 2 |
| R9 | Team member down | L | H | 3 |
| R10 | Org changes | L | V | 1 |

L=likelihood, I=impact, F=fatal.

**Top 3 to actively mitigate during build**: R1, R2, R4.