clarify-rl / docs /08-timeline.md
Anurag Agarwal
ClarifyRL: initial HF Space deploy
2414d31

08 β€” Hour-by-hour Timeline

48-hour sprint. Hard deadline: Apr 26, 5:00 PM IST.

Day 1 β€” Saturday, Apr 25

7:00–9:00 AM β€” Reporting & Opening (mandatory attendance)

  • 7:00 AM check-in at Scaler Electronic City campus
  • Breakfast (until 9 AM)
  • Opening ceremony β€” listen for: Cursor credits, HF credits ($30?), final theme clarifications

9:00–10:00 AM β€” Setup

  • Both team members:
    • Verify Colab account works, T4 reachable
    • Verify HF account, get on-site credits redeemed
    • Run hf auth login + hf jobs hardware to confirm t4-small access works (per opening-deck p.78-80)
    • Verify GitHub access
  • Anurag: create empty clarify-rl HF Space (CPU basic, public, Docker)
  • Kanan: create empty clarify-rl GitHub repo

10:00 AM–12:00 PM β€” Phase 1: Scaffold βœ… DONE

  • All scaffold files in clarify-rl/:
    • openenv.yaml, pyproject.toml, Dockerfile, init.py, models.py, client.py, server/init.py
  • βœ… Done. Move directly to Phase 2.

12:00–2:30 PM β€” Phase 2: Core environment (Anurag-led) βœ… DONE

  • server/scenarios.py β€” procedural scenario generation
  • server/user_simulator.py β€” rule-based Qβ†’A
  • server/clarify_environment.py β€” MCPEnvironment subclass with 3 tools + step_async
  • server/app.py β€” FastAPI app via create_app(...)
  • Local smoke test: uvicorn server.app:app + all 3 smoke scripts pass
  • Lunch can be eaten while building (ladoos, biryani provided)

2:30–4:00 PM β€” Phase 3: Rubrics + per-step grader (Anurag-led) βœ… DONE

  • server/rubrics.py β€” 5 Rubric subclasses, composed via Sequential + Gate + WeightedSum
  • server/grader.py β€” per-step shaping reward + plan parser
  • Wire rubric into ClarifyEnvironment.__init__
  • Verified: oracle policy β†’ ~0.89, random β†’ low, blank plan β†’ 0

4:00–5:30 PM β€” Phase 4: Deploy + baseline eval ← NEXT

  • Push to HF Space, verify public access incognito
  • inference.py β€” baseline eval script (Qwen2.5-1.5B-Instruct via HF Inference API)
  • Run baseline on 100 held-out scenarios β†’ save outputs/baseline.json
  • Generate scenarios/eval_held_out.json (frozen seeds 10000-10099)

5:30–7:00 PM β€” Phase 5: Training notebook (Kanan-led, in parallel)

While Anurag builds env: Kanan forks openenv_wordle_grpo.ipynb (TRL official, opening-deck p.73) and adapts it.

  • training/train_grpo.ipynb β€” fork of openenv_wordle_grpo.ipynb β†’ swap env client to ClarifyClient, swap reward fn to ours
  • Verify on Colab T4 first; switch to HF Jobs t4-small once smoke run passes
  • Mock rollout with dummy env until real env is ready

7:00–9:00 PM β€” Phase 6: First real training run

  • Dinner overlap (gulab jamun expected)
  • Run 100 GRPO steps (Colab T4 first, HF Jobs t4-small once stable)
  • Inspect reward curve: should trend up
  • Eyeball 4 random rollouts (sample-inspection cadence per 06-training-plan.md) β€” reward up but degenerate generations = reward hacking, stop & fix
  • If flat: debug (likely rubric or rollout parsing)

9:00–11:00 PM β€” Phase 7: Iterate

  • If first run looked good: continue with 300 more steps
  • If reward hacking observed: tighten rubric, restart
  • Generate first preliminary plots

11:00 PM–1:00 AM β€” Phase 8: Overnight prep

  • Start the LONG training run (600 steps)
  • One person monitors checkpoints, the other sleeps in shifts
  • Begin README.md draft (use 07-deployment.md as template)

Day 2 β€” Sunday, Apr 26

1:00–7:00 AM β€” Sleep shifts + monitoring

  • Long run completes (~90min on T4) β†’ if Colab kills session, resume from checkpoint
  • Both should have at least 4-5 hours rest

7:00–9:00 AM β€” Phase 9: Final eval + plots

  • Run trained model on held-out 100 scenarios
  • Save outputs/trained.json
  • Generate ALL 5 plots: reward_curve, loss_curve, per_task_bars, per_component, eval_dist
  • Commit plots/*.png to repo

9:00–11:00 AM β€” Phase 10: Demo content (Kanan-led)

  • Record demo screen captures (baseline vs trained traces)
  • Edit 2-min video (OBS β†’ ffmpeg β†’ upload YouTube unlisted)
  • OR write blog.md HF blog post (parallel option)
  • Both ideally β€” videos pop in pitches

11:00 AM–1:00 PM β€” Phase 11: Polish + README

  • Finalize root README.md with:
    • Pitch + headline metric
    • Embedded plots with captions
    • Result table
    • Architecture diagram (mermaid or ASCII)
    • All 4 deliverable links
  • Polish blog.md if writing one
  • Final commit

1:00–3:00 PM β€” Phase 12: Final validation sweep

  • Run through docs/07-deployment.md validation checklist (every line)
  • Test HF Space from incognito
  • Re-run Colab notebook end-to-end on fresh runtime
  • Verify all README links open correctly
  • Fix any breakage

3:00–4:30 PM β€” Phase 13: Submit

  • Fill Google Form with all 4 URLs
  • Double-check each URL works from incognito
  • Confirm submission

4:30–5:00 PM β€” BUFFER (do not skip)

  • Re-verify submission record
  • Take screenshots of submission confirmation
  • Final coffee. Breathe.

5:00 PM β€” DEADLINE 🚨

No more changes accepted under any circumstances.

5:00 PM onward β€” Closing & networking

Team Split β€” Bhole Chature

Phase Anurag Kanan
1 scaffold review scaffold review
2 core env (clarify_environment.py) scenarios.py + user_simulator.py
3 rubrics.py + grader.py unit tests for env + rubric
4 HF Space deploy inference.py + baseline eval
5 bug-fix env on issues train_grpo.ipynb (Colab)
6 monitor first training run inspect outputs, sample logs
7 iterate on env/rubric iterate on training config
8 overnight checkpoint mgmt sleep shift A
9 sleep shift B final eval + plots
10 README + repo polish demo video + blog.md
11 architecture writeup submission link audit
12 final validation sweep final validation sweep
13 hits submit screenshot + confirm

Both pair on critical risk items: training debug, validation sweep.

Anti-bus-factor Rule

Both team members must know how to:

  • Push to HF Space + GitHub
  • Restart the Colab runtime
  • Run the env locally
  • Run the eval script

If one is stuck/AFK at submission time, the other should be able to ship.

Energy Management

  • Snack/coffee every 2 hours
  • 5-min walks each hour
  • No alcohol/Red Bull spirals
  • Hydration > caffeine