Spaces:

agarwalanu3103
/

clarify-rl

Running

File size: 7,133 Bytes

2414d31

# 09 — Risk Register & Mitigations

Ranked by likelihood × impact. Top of list = address first.

## R1 — Reward curve goes flat (HIGH likelihood, HIGH impact)

**Symptom**: After 100 GRPO steps, mean episode reward stays at baseline (~0.25).

**Causes**:
- Reward signal too sparse
- Per-step shaping too small relative to terminal reward
- Rollout parsing broken (model outputs gibberish, parser silently fails)
- KL coefficient (β) too high → policy can't move

**Mitigations**:
- Sanity-check rollout parser: print 5 random completions + parsed actions
- Verify shaping rewards firing: log per-step reward by action type
- Reduce β to 0.01
- Increase shaping reward magnitude (×2)
- Simplify rubric: drop InfoGain temporarily, use only FieldMatch
- Pre-warm with SFT on synthetic "ask first" trajectories (1-2 epochs)

**Time to detect**: 15 min (smoke test of 100 steps)
**Time to fix**: 30-60 min

## R2 — Reward hacking (HIGH likelihood, MEDIUM impact)

**Symptom**: Reward curve climbs but qualitative outputs are gibberish/repetitive.

**Likely hacks**:
- Always ask same generic question 6 times then submit empty plan
- Submit JSON with all profile field keys but garbage values
- Output the same action token over and over

**Mitigations**:
- Duplicate-Q penalty (already in plan)
- HallucinationCheckRubric (already in plan)
- FormatCheck Gate with strict schema (already in plan)
- Add EntropyRubric: penalize repeated actions (component if needed)
- Manual inspection of 10 trained outputs every 100 steps

**Time to detect**: 100 GRPO steps + manual inspection (15 min)
**Time to fix**: 30 min (add penalty component)

## R3 — Colab session times out mid-training (MEDIUM, MEDIUM)

**Symptom**: Long training run gets killed by Colab free-tier session limits.

**Mitigations**:
- Save LoRA checkpoint every 100 steps
- Always run training in resumable form (TRL supports resume from checkpoint)
- Plan training in 100-step chunks, not one mega-run
- Have second Google account ready for backup

**Time to detect**: live
**Time to fix**: 5 min (resume from last checkpoint)

## R4 — HF Space build fails (MEDIUM, HIGH)

**Symptom**: `git push space main` succeeds but Space build errors out.

**Common causes**:
- Dockerfile issues (missing deps, wrong Python version)
- pyproject.toml resolution failure
- HF Space hardware mismatch

**Mitigations**:
- Test Docker build LOCALLY before pushing: `docker build -t clarify-rl . && docker run -p 8000:8000 clarify-rl`
- Mirror EXACT Dockerfile from working SRE env (which we know builds)
- Push minimal stub Space FIRST (just FastAPI hello world), confirm builds, then layer on env
- Keep Space build logs open in browser tab while pushing

**Time to detect**: 5-10 min (HF build logs)
**Time to fix**: 15-30 min (Docker iteration)

## R5 — Validator rejects submission (LOW likelihood, FATAL impact)

**Symptom**: Auto-validator marks submission incomplete; never reaches human judges.

**Mitigations**:
- Run through every item in `docs/07-deployment.md` checklist
- 1-hour pre-deadline buffer for fixes
- Test ALL deliverable links from incognito browser
- Make sure plots are committed as files, not just in notebook outputs

**Time to detect**: post-submission (TOO LATE — must validate before)
**Time to fix**: depends on what's missing

## R6 — Training takes too long on T4 (LOW, MEDIUM)

**Symptom**: 600 GRPO steps take >2 hours; eats into Day 2 schedule.

**Mitigations**:
- Use Unsloth (we already are)
- Use 4-bit quantization (we already are)
- Reduce max_seq_length to 2048 if needed
- Reduce num_generations to 2 (instead of 4)
- Stop at 300 steps if curve is good — quality > quantity

**Time to detect**: 30 min into training (extrapolate)
**Time to fix**: tune config, restart from checkpoint

## R7 — Rubric doesn't separate good from bad (LOW, HIGH) — ✅ VERIFIED OK

**Symptom**: Even oracle policy gets ~0.5; even random policy gets ~0.5.

**Causes**:
- Weights wrong, components average out
- FormatCheck too lenient
- HallucinationCheck too punitive

**Mitigations**:
- Run sanity policies BEFORE training:
  - Random: should get ~0.20
  - Oracle (asks all critical Qs, perfect plan): should get ~0.95
  - Blank plan: should get 0.0
- If gap is small, retune weights and component logic before training

**Current status**: Oracle scores ~0.89 via `smoke_env.py` (FormatCheck=1.0, FieldMatch=1.0, InfoGain=1.0, Efficiency=0.5, Hallucination=0.75). Gap is healthy.

**Time to detect**: 10 min (sanity script)
**Time to fix**: 30-60 min

## R8 — Profile generator produces unsolvable scenarios (LOW, MEDIUM) — ✅ MITIGATED

**Symptom**: Even oracle can't get high score on some scenarios.

**Causes**:
- Field vocabulary too sparse → user simulator returns wrong field
- Critical fields not always present
- Request template too vague to even hint at task type

**Mitigations**:
- Validate generator: 100 random scenarios → oracle scores them → all should be ≥0.7
- Add task_type hint to every request template (subtle, e.g. "dinner" → restaurant)
- Ensure FIELD_KEYWORDS covers all profile fields

**Fix applied**: `scenarios.py` now always includes `required_keys` in the profile for medium/hard difficulty. Hard range adjusted to (6,7) to match actual field pool sizes (max 7).

**Time to detect**: 5 min (sanity check)
**Time to fix**: 15-30 min

## R9 — One team member becomes unavailable (LOW, HIGH)

**Symptom**: Anurag or Kanan can't continue (illness, technical issues, lost device).

**Mitigations**:
- Both can git-push to both remotes
- Both have HF + GitHub credentials
- Both have Colab access
- Pair-program critical sections (env, rubric)

**Time to detect**: live
**Time to fix**: depends, but project should continue

## R10 — Last-minute organizational changes (LOW, VARIABLE)

**Symptom**: Submission form changes, deadline shifts, theme reinterpretations announced.

**Mitigations**:
- Monitor Discord every 2 hours
- Both team members on Discord notifications
- Have a Plan B for each deliverable (video OR blog, not both required)

## Fallback Plans (graceful degradation)

If we run out of time:

1. **Cut difficulty levels**: Ship only "medium" task — still scores well on Storytelling
2. **Cut task types**: Ship 3 of 5 task types instead of all 5
3. **Cut training**: Use Unsloth pre-trained on synthetic SFT data, skip GRPO. Worse story but still ships.
4. **Cut video**: Ship blog post only.
5. **Cut blog**: Ship video only.

The core ship is: **HF Space + Colab + plots + README**. Everything else is bonus.

## Risk Score Summary

| ID | Risk | L | I | Score |
|----|------|---|---|-------|
| R1 | Reward curve flat | H | H | 9 |
| R2 | Reward hacking | H | M | 6 |
| R3 | Colab timeout | M | M | 4 |
| R4 | HF Space build fail | M | H | 6 |
| R5 | Validator rejection | L | F | 5 |
| R6 | Training too slow | L | M | 2 |
| R7 | Rubric doesn't separate | L | H | 3 |
| R8 | Bad scenarios | L | M | 2 |
| R9 | Team member down | L | H | 3 |
| R10 | Org changes | L | V | 1 |

L=likelihood, I=impact, F=fatal.

**Top 3 to actively mitigate during build**: R1, R2, R4.