Spaces:

agarwalanu3103
/

clarify-rl

Running

App Files Files Community

clarify-rl / docs /09-risks.md

Anurag Agarwal

ClarifyRL: initial HF Space deploy

2414d31 17 days ago

preview code

raw

history blame contribute delete

7.13 kB

09 — Risk Register & Mitigations

Ranked by likelihood × impact. Top of list = address first.

R1 — Reward curve goes flat (HIGH likelihood, HIGH impact)

Symptom: After 100 GRPO steps, mean episode reward stays at baseline (~0.25).

Causes:

Reward signal too sparse
Per-step shaping too small relative to terminal reward
Rollout parsing broken (model outputs gibberish, parser silently fails)
KL coefficient (β) too high → policy can't move

Mitigations:

Sanity-check rollout parser: print 5 random completions + parsed actions
Verify shaping rewards firing: log per-step reward by action type
Reduce β to 0.01
Increase shaping reward magnitude (×2)
Simplify rubric: drop InfoGain temporarily, use only FieldMatch
Pre-warm with SFT on synthetic "ask first" trajectories (1-2 epochs)

Time to detect: 15 min (smoke test of 100 steps) Time to fix: 30-60 min

R2 — Reward hacking (HIGH likelihood, MEDIUM impact)

Symptom: Reward curve climbs but qualitative outputs are gibberish/repetitive.

Likely hacks:

Always ask same generic question 6 times then submit empty plan
Submit JSON with all profile field keys but garbage values
Output the same action token over and over

Mitigations:

Duplicate-Q penalty (already in plan)
HallucinationCheckRubric (already in plan)
FormatCheck Gate with strict schema (already in plan)
Add EntropyRubric: penalize repeated actions (component if needed)
Manual inspection of 10 trained outputs every 100 steps

Time to detect: 100 GRPO steps + manual inspection (15 min) Time to fix: 30 min (add penalty component)

R3 — Colab session times out mid-training (MEDIUM, MEDIUM)

Symptom: Long training run gets killed by Colab free-tier session limits.

Mitigations:

Save LoRA checkpoint every 100 steps
Always run training in resumable form (TRL supports resume from checkpoint)
Plan training in 100-step chunks, not one mega-run
Have second Google account ready for backup

Time to detect: live Time to fix: 5 min (resume from last checkpoint)

R4 — HF Space build fails (MEDIUM, HIGH)

Symptom: git push space main succeeds but Space build errors out.

Common causes:

Dockerfile issues (missing deps, wrong Python version)
pyproject.toml resolution failure
HF Space hardware mismatch

Mitigations:

Test Docker build LOCALLY before pushing: docker build -t clarify-rl . && docker run -p 8000:8000 clarify-rl
Mirror EXACT Dockerfile from working SRE env (which we know builds)
Push minimal stub Space FIRST (just FastAPI hello world), confirm builds, then layer on env
Keep Space build logs open in browser tab while pushing

Time to detect: 5-10 min (HF build logs) Time to fix: 15-30 min (Docker iteration)

R5 — Validator rejects submission (LOW likelihood, FATAL impact)

Symptom: Auto-validator marks submission incomplete; never reaches human judges.

Mitigations:

Run through every item in docs/07-deployment.md checklist
1-hour pre-deadline buffer for fixes
Test ALL deliverable links from incognito browser
Make sure plots are committed as files, not just in notebook outputs

Time to detect: post-submission (TOO LATE — must validate before) Time to fix: depends on what's missing

R6 — Training takes too long on T4 (LOW, MEDIUM)

Symptom: 600 GRPO steps take >2 hours; eats into Day 2 schedule.

Mitigations:

Use Unsloth (we already are)
Use 4-bit quantization (we already are)
Reduce max_seq_length to 2048 if needed
Reduce num_generations to 2 (instead of 4)
Stop at 300 steps if curve is good — quality > quantity

Time to detect: 30 min into training (extrapolate) Time to fix: tune config, restart from checkpoint

R7 — Rubric doesn't separate good from bad (LOW, HIGH) — ✅ VERIFIED OK

Symptom: Even oracle policy gets ~0.5; even random policy gets ~0.5.

Causes:

Weights wrong, components average out
FormatCheck too lenient
HallucinationCheck too punitive

Mitigations:

Run sanity policies BEFORE training:
- Random: should get ~0.20
- Oracle (asks all critical Qs, perfect plan): should get ~0.95
- Blank plan: should get 0.0
If gap is small, retune weights and component logic before training

Current status: Oracle scores ~0.89 via smoke_env.py (FormatCheck=1.0, FieldMatch=1.0, InfoGain=1.0, Efficiency=0.5, Hallucination=0.75). Gap is healthy.

Time to detect: 10 min (sanity script) Time to fix: 30-60 min

R8 — Profile generator produces unsolvable scenarios (LOW, MEDIUM) — ✅ MITIGATED

Symptom: Even oracle can't get high score on some scenarios.

Causes:

Field vocabulary too sparse → user simulator returns wrong field
Critical fields not always present
Request template too vague to even hint at task type

Mitigations:

Validate generator: 100 random scenarios → oracle scores them → all should be ≥0.7
Add task_type hint to every request template (subtle, e.g. "dinner" → restaurant)
Ensure FIELD_KEYWORDS covers all profile fields

Fix applied: scenarios.py now always includes required_keys in the profile for medium/hard difficulty. Hard range adjusted to (6,7) to match actual field pool sizes (max 7).

Time to detect: 5 min (sanity check) Time to fix: 15-30 min

R9 — One team member becomes unavailable (LOW, HIGH)

Symptom: Anurag or Kanan can't continue (illness, technical issues, lost device).

Mitigations:

Both can git-push to both remotes
Both have HF + GitHub credentials
Both have Colab access
Pair-program critical sections (env, rubric)

Time to detect: live Time to fix: depends, but project should continue

R10 — Last-minute organizational changes (LOW, VARIABLE)

Symptom: Submission form changes, deadline shifts, theme reinterpretations announced.

Mitigations:

Monitor Discord every 2 hours
Both team members on Discord notifications
Have a Plan B for each deliverable (video OR blog, not both required)

Fallback Plans (graceful degradation)

If we run out of time:

Cut difficulty levels: Ship only "medium" task — still scores well on Storytelling
Cut task types: Ship 3 of 5 task types instead of all 5
Cut training: Use Unsloth pre-trained on synthetic SFT data, skip GRPO. Worse story but still ships.
Cut video: Ship blog post only.
Cut blog: Ship video only.

The core ship is: HF Space + Colab + plots + README. Everything else is bonus.

Risk Score Summary

ID	Risk	L	I	Score
R1	Reward curve flat	H	H	9
R2	Reward hacking	H	M	6
R3	Colab timeout	M	M	4
R4	HF Space build fail	M	H	6
R5	Validator rejection	L	F	5
R6	Training too slow	L	M	2
R7	Rubric doesn't separate	L	H	3
R8	Bad scenarios	L	M	2
R9	Team member down	L	H	3
R10	Org changes	L	V	1

L=likelihood, I=impact, F=fatal.

Top 3 to actively mitigate during build: R1, R2, R4.