Spaces:

DevanshuDon
/

exec-assist

Sleeping

App Files Files Community

DevanshuDon commited on 27 days ago

Commit

d73a92d

verified ·

1 Parent(s): 7c9156d

Upload 3 files

Browse files

Files changed (4) hide show

.gitattributes +1 -0
README.md +249 -0
results.json +1167 -0
training_results.png +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1 @@


1	+ training_results.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,249 @@

+---
+title: ExecAssist
+emoji: 📧
+colorFrom: indigo
+colorTo: blue
+sdk: docker
+app_port: 7860
+pinned: false
+license: mit
+tags:
+  - openenv
+  - rl
+  - executive-assistant
+  - grpo
+  - trl
+---
+# ExecAssist — Executive Assistant Environment
+An OpenEnv environment where AI agents learn to manage email and calendar like a human executive assistant. Built for the **OpenEnv Hackathon (Apr 2026)** under Theme #3.2 — Personalized Tasks.
+**Live environment:** https://devanshudon-exec-assist.hf.space
+**Mini-blog:** [`blog_post.md`](./blog_post.md)
+**Training notebook:** [`train_colab.ipynb`](./train_colab.ipynb)
+---
+## 🏆 Headline result
+Trained `Qwen2.5-0.5B-Instruct` with TRL GRPO for 3 epochs (270 steps, ~90 min on free Colab T4):
+| Task   | Baseline (untrained 0.5B) | Trained (GRPO) | Improvement |
+|--------|---------------------------|----------------|-------------|
+| Easy   | 0.345                     | **0.995**      | **+188%**   |
+| Medium | 0.227                     | **0.745**      | **+228%**   |
+| Hard   | 0.249                     | **0.737**      | **+196%**   |
+After training, **9 out of 10 samples on the easy task scored a perfect 1.0** — the model learned the task structure, not just statistics.
+![Training results: bar chart shows trained model dramatically outperforms baseline across all three tasks; line plot shows reward climbing from ~0.2 to ~0.7 over 270 steps](./training_results.png)
+*Left: mean reward by task, baseline vs. trained (n=10 samples per task). Right: GRPO batch reward over 270 training steps — first-quartile mean 0.390, last-quartile mean 0.648.*
+---
+## Problem & motivation
+Every executive's morning inbox is the same problem on repeat: read incoming requests, write a polite reply, find a calendar slot that doesn't clash, propose alternatives if it does. It's not hard for a human — it's just fiddly. And LLMs are surprisingly bad at it because the task fuses **three separate skills**:
+1. Structured calendar reasoning (no double-booking, within working hours, sensible duration)
+2. Professional written tone (greeting, closing, polite framing, appropriate detail)
+3. Conflict resolution (recognize the conflict, propose 2–3 alternatives, explain professionally)
+ExecAssist is designed to teach all three in a single training loop. The reward function is a weighted blend of three independent graders plus four anti-reward-hacking penalties, making it hard to game by exploiting any single signal.
+---
+## Tasks
+| Task | Difficulty | Description | Reward weighting |
+|------|-----------|-------------|-------------------|
+| **Easy** | 1 email, clear availability | Draft polite reply + book meeting in open slot | 50% email + 50% scheduling |
+| **Medium** | 1 email, calendar conflict | Identify conflict + propose 2–3 alternatives + explain professionally | 30% email + 40% conflict + 30% scheduling |
+| **Hard** | 3 emails, multi-party coordination, priority conflicts | Prioritize, reschedule, notify all parties | 34% email + 33% scheduling + 33% conflict |
+All scores are deterministic and bounded to [0, 1].
+---
+## Environment design
+### Observation space
+```python
+{
+  "task": "easy" | "medium" | "hard",
+  "description": str,
+  "emails": [{"sender", "subject", "body", "priority", "timestamp"}, ...],
+  "calendar": {
+    "existing_meetings": [{"id", "participants", "start_time", "end_time", "subject", "priority"}, ...],
+    "working_hours": {"monday": "9-17", ...},
+    "executive_name": str
+  },
+  "contacts": {email: {"name", "email", "timezone", "title"}, ...},
+  "action_required": str
+}
+```
+### Action space
+```python
+{
+  "email_reply": str,
+  "calendar_action": "book" | "propose_alternatives" | "reschedule" | "decline",
+  "meeting_details": {
+    "participants": [str, ...],
+    "start_time": "ISO-8601",
+    "end_time": "ISO-8601",
+    "subject": str,
+    "location": str | None,
+    "proposed_alternatives": [...] | None
+  }
+}
+```
+### Reward function (multiple independent graders)
+| Component | Range | What it checks |
+|-----------|-------|----------------|
+| **Email quality** | 0–1 | Politeness markers, greeting/closing, sufficient detail (20+ words), professional tone, optional LLM-judge for nuance |
+| **Scheduling correctness** | 0–1 | No double-booking, within working hours, appropriate duration (15min–2hrs), all participants included |
+| **Conflict resolution** | 0–1 | Recognizes conflicts, proposes 2–3 alternatives, explains professionally, prioritizes correctly |
+### Anti-reward-hacking penalties
+- Short email (`< 20` words): **−0.30**
+- Missing `meeting_details`: **−0.40**
+- Generic / templated phrasing: **−0.10**
+- Overly long email (`> 1500` chars): **−0.15**
+These were added because GRPO will find shortcuts. During the first run, the model briefly collapsed to a single short safe response — the penalties + KL regularization fixed it cleanly.
+---
+## API endpoints
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/reset?task=easy\|medium\|hard` | POST | Start new episode, returns observation |
+| `/step` | POST | Submit action, returns observation/reward/done/info |
+| `/state` | GET | Current state |
+| `/tasks` | GET | List all tasks |
+| `/health` | GET | Health check |
+| `/metadata` | GET | Environment info |
+| `/schema` | GET | Action / observation / state schemas |
+Full interactive docs: https://devanshudon-exec-assist.hf.space/docs
+---
+## Setup & usage
+### Run the environment locally
+```bash
+git clone https://huggingface.co/spaces/DevanshuDon/exec-assist
+cd exec-assist
+pip install -r requirements.txt
+uvicorn server.app:app --port 8000
+# open http://127.0.0.1:8000/docs
+```
+### Reproduce the baseline
+```bash
+export APIBASEURL=https://openrouter.ai/api/v1
+export MODELNAME=nvidia/nemotron-3-super-120b-a12b:free
+export HFTOKEN=your-openrouter-key
+python inference.py
+```
+Expected output (structured `[START] / [STEP] / [END]` logs as required):
+```
+[START] task=easy env=exec-assist model=...
+[STEP] step=1 action=assistant(easy) reward=0.32 done=true error=null
+[END] success=false steps=1 score=0.315 rewards=0.32
+```
+### Run the trained model
+Open `train_colab.ipynb` in Google Colab, set runtime → T4 GPU, Run All. Total time ~50 min including evaluation. Outputs `training_results.png` and `results.json`.
+### Docker
+```bash
+docker build -t exec-assist .
+docker run -p 7860:7860 exec-assist
+```
+---
+## Training pipeline
+**Stack:** TRL `GRPOTrainer` + HuggingFace Transformers, Qwen2.5-0.5B-Instruct, free Colab T4.
+**Approach:** Pre-collect 90 scenarios from the deployed HF Space (30 each across easy/medium/hard) into a `Dataset`, with each scenario stored as a column. The reward function receives the scenario as a kwarg and scores deterministically without API calls during training. This decouples training from environment latency.
+**Hyperparameters (final, working config):**
+```python
+GRPOConfig(
+    learning_rate=1e-6,           # critical — 5e-6 caused collapse
+    per_device_train_batch_size=2,
+    gradient_accumulation_steps=4,
+    num_generations=8,            # diversity within group
+    num_train_epochs=3,
+    beta=0.1,                     # KL penalty — prevents mode collapse
+    fp16=False, bf16=False,       # fp32 for stable gradients
+    gradient_checkpointing=True,
+)
+```
+**The collapse and the fix.** Initial run (1 epoch, lr=5e-6, beta=0) collapsed: trained model scored exactly 0.2 on every prompt regardless of input — the model found a single safe response and stopped exploring. Adding the KL term (`beta=0.1`), reducing the learning rate, and increasing `num_generations` from 4 to 8 produced the clean training curve shown above.
+**Anti-reward-hacking observations during training:** GRPO did try to game several signals — outputting only the JSON without an email body (caught by short-email penalty), proposing booking times outside working hours (caught by scheduling check), repeating the prompt back as a "reply" (caught by generic-phrasing detector). Each penalty was triggered during early steps and disappeared as training progressed.
+---
+## Repository structure
+```
+exec-assist/
+├── server/
+│   ├── app.py            # FastAPI app + environment logic
+│   ├── models.py         # Pydantic Action/Observation/State models
+│   └── data.py           # Scenario generation, scoring functions, LLM judge
+├── client.py             # EnvClient wrapper (Gym-style)
+├── inference.py          # Baseline inference (required, structured logs)
+├── train_colab.ipynb     # GRPO training notebook
+├── training_results.png  # Training curves + baseline-vs-trained
+├── results.json          # Raw evaluation data + 270-step training log
+├── blog_post.md          # Mini-blog write-up
+├── openenv.yaml          # OpenEnv manifest
+├── Dockerfile            # Python 3.10, port 7860
+├── requirements.txt
+└── README.md             # This file
+```
+---
+## Compliance checklist
+- ✅ Built on **OpenEnv** (latest release, `openenv-core>=0.2.0`)
+- ✅ Real-world task simulation (not games or toys)
+- ✅ Full OpenEnv spec — typed Pydantic models for Action/Observation/State, `step()`/`reset()`/`state()` endpoints, `openenv.yaml` manifest
+- ✅ **3 tasks** with deterministic graders, scores in [0, 1], easy → medium → hard difficulty progression
+- ✅ Meaningful reward function with **partial-progress signal** + anti-reward-hacking penalties
+- ✅ **Baseline inference script** (`inference.py`) using OpenAI client, reads `APIBASEURL`/`MODELNAME`/`HFTOKEN`, structured `[START]/[STEP]/[END]` logs
+- ✅ **Training script** (TRL GRPO) with reproducible Colab notebook
+- ✅ **Real training evidence** — reward curves, baseline vs. trained, before/after numbers (above)
+- ✅ Deployed to **HuggingFace Space** with Docker
+- ✅ Working **Dockerfile** (Python 3.10), `docker build && docker run` works
+- ✅ README with environment description, action/observation spaces, setup, baseline scores
+---
+## Author
+**Devanshu** ([@DevanshuDon](https://huggingface.co/DevanshuDon)) — built for OpenEnv Hackathon, April 2026.

results.json ADDED Viewed

	@@ -0,0 +1,1167 @@

+{
+  "baseline": {
+    "easy": [
+      0.2,
+      0.65,
+      0.65,
+      0.65,
+      0.65,
+      0.2,
+      0.2,
+      0.0,
+      0.04999999999999999,
+      0.2
+    ],
+    "medium": [
+      0.15714285714285714,
+      0.11428571428571427,
+      0.15714285714285714,
+      0.15714285714285714,
+      0.15714285714285714,
+      0.5471428571428572,
+      0.5471428571428572,
+      0.15714285714285714,
+      0.11428571428571427,
+      0.15714285714285714
+    ],
+    "hard": [
+      0.14785714285714285,
+      0.14785714285714285,
+      0.10071428571428576,
+      0.14785714285714285,
+      0.5498571428571429,
+      0.0,
+      0.14785714285714285,
+      0.14785714285714285,
+      0.5498571428571429,
+      0.5498571428571429
+    ]
+  },
+  "trained": {
+    "easy": [
+      1.0,
+      1.0,
+      1.0,
+      1.0,
+      1.0,
+      1.0,
+      1.0,
+      1.0,
+      1.0,
+      0.95
+    ],
+    "medium": [
+      0.7571428571428571,
+      0.6842857142857143,
+      0.7571428571428571,
+      0.7571428571428571,
+      0.8171428571428572,
+      0.7571428571428571,
+      0.7271428571428571,
+      0.7571428571428571,
+      0.7121428571428572,
+      0.7271428571428571
+    ],
+    "hard": [
+      0.6518571428571429,
+      0.6688571428571428,
+      0.7407142857142858,
+      0.7028571428571428,
+      0.8010000000000002,
+      0.7878571428571429,
+      0.7878571428571429,
+      0.7878571428571429,
+      0.7878571428571429,
+      0.6518571428571429
+    ]
+  },
+  "training_log": [
+    {
+      "step": 1,
+      "reward": 0.25410714745521545
+    },
+    {
+      "step": 2,
+      "reward": 0.036964286118745804
+    },
+    {
+      "step": 3,
+      "reward": 0.12232142686843872
+    },
+    {
+      "step": 4,
+      "reward": 0.14007142186164856
+    },
+    {
+      "step": 5,
+      "reward": 0.10798214375972748
+    },
+    {
+      "step": 6,
+      "reward": 0.26973217725753784
+    },
+    {
+      "step": 7,
+      "reward": 0.21267856657505035
+    },
+    {
+      "step": 8,
+      "reward": 0.2703035771846771
+    },
+    {
+      "step": 9,
+      "reward": 0.28862500190734863
+    },
+    {
+      "step": 10,
+      "reward": 0.4040178656578064
+    },
+    {
+      "step": 11,
+      "reward": 0.29374998807907104
+    },
+    {
+      "step": 12,
+      "reward": 0.3059999942779541
+    },
+    {
+      "step": 13,
+      "reward": 0.3656250238418579
+    },
+    {
+      "step": 14,
+      "reward": 0.40044641494750977
+    },
+    {
+      "step": 15,
+      "reward": 0.12955357134342194
+    },
+    {
+      "step": 16,
+      "reward": 0.408482164144516
+    },
+    {
+      "step": 17,
+      "reward": 0.46098214387893677
+    },
+    {
+      "step": 18,
+      "reward": 0.49732139706611633
+    },
+    {
+      "step": 19,
+      "reward": 0.4546428620815277
+    },
+    {
+      "step": 20,
+      "reward": 0.4602678418159485
+    },
+    {
+      "step": 21,
+      "reward": 0.4337499737739563
+    },
+    {
+      "step": 22,
+      "reward": 0.579464316368103
+    },
+    {
+      "step": 23,
+      "reward": 0.5285714268684387
+    },
+    {
+      "step": 24,
+      "reward": 0.27405357360839844
+    },
+    {
+      "step": 25,
+      "reward": 0.44980356097221375
+    },
+    {
+      "step": 26,
+      "reward": 0.5672321319580078
+    },
+    {
+      "step": 27,
+      "reward": 0.16444642841815948
+    },
+    {
+      "step": 28,
+      "reward": 0.4348214268684387
+    },
+    {
+      "step": 29,
+      "reward": 0.35455358028411865
+    },
+    {
+      "step": 30,
+      "reward": 0.40312498807907104
+    },
+    {
+      "step": 31,
+      "reward": 0.5915178060531616
+    },
+    {
+      "step": 32,
+      "reward": 0.42767855525016785
+    },
+    {
+      "step": 33,
+      "reward": 0.4596250057220459
+    },
+    {
+      "step": 34,
+      "reward": 0.357142835855484
+    },
+    {
+      "step": 35,
+      "reward": 0.34437501430511475
+    },
+    {
+      "step": 36,
+      "reward": 0.21607142686843872
+    },
+    {
+      "step": 37,
+      "reward": 0.23191072046756744
+    },
+    {
+      "step": 38,
+      "reward": 0.4008482098579407
+    },
+    {
+      "step": 39,
+      "reward": 0.08360714465379715
+    },
+    {
+      "step": 40,
+      "reward": 0.28391069173812866
+    },
+    {
+      "step": 41,
+      "reward": 0.39952677488327026
+    },
+    {
+      "step": 42,
+      "reward": 0.3128303587436676
+    },
+    {
+      "step": 43,
+      "reward": 0.5379464626312256
+    },
+    {
+      "step": 44,
+      "reward": 0.6946428418159485
+    },
+    {
+      "step": 45,
+      "reward": 0.48794642090797424
+    },
+    {
+      "step": 46,
+      "reward": 0.5474107265472412
+    },
+    {
+      "step": 47,
+      "reward": 0.3933393061161041
+    },
+    {
+      "step": 48,
+      "reward": 0.565625011920929
+    },
+    {
+      "step": 49,
+      "reward": 0.4785892963409424
+    },
+    {
+      "step": 50,
+      "reward": 0.46958038210868835
+    },
+    {
+      "step": 51,
+      "reward": 0.36991071701049805
+    },
+    {
+      "step": 52,
+      "reward": 0.2941964268684387
+    },
+    {
+      "step": 53,
+      "reward": 0.4853571653366089
+    },
+    {
+      "step": 54,
+      "reward": 0.23973214626312256
+    },
+    {
+      "step": 55,
+      "reward": 0.7227678298950195
+    },
+    {
+      "step": 56,
+      "reward": 0.4734821319580078
+    },
+    {
+      "step": 57,
+      "reward": 0.5602678656578064
+    },
+    {
+      "step": 58,
+      "reward": 0.5581071376800537
+    },
+    {
+      "step": 59,
+      "reward": 0.6915178298950195
+    },
+    {
+      "step": 60,
+      "reward": 0.328830361366272
+    },
+    {
+      "step": 61,
+      "reward": 0.3727678656578064
+    },
+    {
+      "step": 62,
+      "reward": 0.4290178418159485
+    },
+    {
+      "step": 63,
+      "reward": 0.5301785469055176
+    },
+    {
+      "step": 64,
+      "reward": 0.29794642329216003
+    },
+    {
+      "step": 65,
+      "reward": 0.5418839454650879
+    },
+    {
+      "step": 66,
+      "reward": 0.5446428656578064
+    },
+    {
+      "step": 67,
+      "reward": 0.30937498807907104
+    },
+    {
+      "step": 68,
+      "reward": 0.571696400642395
+    },
+    {
+      "step": 69,
+      "reward": 0.5544642806053162
+    },
+    {
+      "step": 70,
+      "reward": 0.6167857646942139
+    },
+    {
+      "step": 71,
+      "reward": 0.45669642090797424
+    },
+    {
+      "step": 72,
+      "reward": 0.31955355405807495
+    },
+    {
+      "step": 73,
+      "reward": 0.5181249976158142
+    },
+    {
+      "step": 74,
+      "reward": 0.4415178596973419
+    },
+    {
+      "step": 75,
+      "reward": 0.5451785326004028
+    },
+    {
+      "step": 76,
+      "reward": 0.3028392791748047
+    },
+    {
+      "step": 77,
+      "reward": 0.2091071605682373
+    },
+    {
+      "step": 78,
+      "reward": 0.536339282989502
+    },
+    {
+      "step": 79,
+      "reward": 0.2366071492433548
+    },
+    {
+      "step": 80,
+      "reward": 0.3268928527832031
+    },
+    {
+      "step": 81,
+      "reward": 0.5390892624855042
+    },
+    {
+      "step": 82,
+      "reward": 0.4825892746448517
+    },
+    {
+      "step": 83,
+      "reward": 0.46875
+    },
+    {
+      "step": 84,
+      "reward": 0.7821428775787354
+    },
+    {
+      "step": 85,
+      "reward": 0.4580357074737549
+    },
+    {
+      "step": 86,
+      "reward": 0.5209821462631226
+    },
+    {
+      "step": 87,
+      "reward": 0.4017857313156128
+    },
+    {
+      "step": 88,
+      "reward": 0.660178542137146
+    },
+    {
+      "step": 89,
+      "reward": 0.5458393096923828
+    },
+    {
+      "step": 90,
+      "reward": 0.7919642925262451
+    },
+    {
+      "step": 91,
+      "reward": 0.4300000071525574
+    },
+    {
+      "step": 92,
+      "reward": 0.501964271068573
+    },
+    {
+      "step": 93,
+      "reward": 0.6446428298950195
+    },
+    {
+      "step": 94,
+      "reward": 0.5094642639160156
+    },
+    {
+      "step": 95,
+      "reward": 0.5647678375244141
+    },
+    {
+      "step": 96,
+      "reward": 0.6352678537368774
+    },
+    {
+      "step": 97,
+      "reward": 0.5024999976158142
+    },
+    {
+      "step": 98,
+      "reward": 0.515874981880188
+    },
+    {
+      "step": 99,
+      "reward": 0.46294644474983215
+    },
+    {
+      "step": 100,
+      "reward": 0.8723214268684387
+    },
+    {
+      "step": 101,
+      "reward": 0.5212500095367432
+    },
+    {
+      "step": 102,
+      "reward": 0.671875
+    },
+    {
+      "step": 103,
+      "reward": 0.5864999890327454
+    },
+    {
+      "step": 104,
+      "reward": 0.6749999523162842
+    },
+    {
+      "step": 105,
+      "reward": 0.5629464387893677
+    },
+    {
+      "step": 106,
+      "reward": 0.5281071662902832
+    },
+    {
+      "step": 107,
+      "reward": 0.6936607360839844
+    },
+    {
+      "step": 108,
+      "reward": 0.6465713977813721
+    },
+    {
+      "step": 109,
+      "reward": 0.5022321343421936
+    },
+    {
+      "step": 110,
+      "reward": 0.5313928127288818
+    },
+    {
+      "step": 111,
+      "reward": 0.6238213777542114
+    },
+    {
+      "step": 112,
+      "reward": 0.6399999856948853
+    },
+    {
+      "step": 113,
+      "reward": 0.7440178394317627
+    },
+    {
+      "step": 114,
+      "reward": 0.5431250333786011
+    },
+    {
+      "step": 115,
+      "reward": 0.6102678775787354
+    },
+    {
+      "step": 116,
+      "reward": 0.6504464149475098
+    },
+    {
+      "step": 117,
+      "reward": 0.7581071257591248
+    },
+    {
+      "step": 118,
+      "reward": 0.6492946147918701
+    },
+    {
+      "step": 119,
+      "reward": 0.6843750476837158
+    },
+    {
+      "step": 120,
+      "reward": 0.5536428689956665
+    },
+    {
+      "step": 121,
+      "reward": 0.653249979019165
+    },
+    {
+      "step": 122,
+      "reward": 0.5297499895095825
+    },
+    {
+      "step": 123,
+      "reward": 0.6578571796417236
+    },
+    {
+      "step": 124,
+      "reward": 0.8348214626312256
+    },
+    {
+      "step": 125,
+      "reward": 0.349785715341568
+    },
+    {
+      "step": 126,
+      "reward": 0.7781250476837158
+    },
+    {
+      "step": 127,
+      "reward": 0.8968750238418579
+    },
+    {
+      "step": 128,
+      "reward": 0.8093750476837158
+    },
+    {
+      "step": 129,
+      "reward": 0.6137499809265137
+    },
+    {
+      "step": 130,
+      "reward": 0.8910714387893677
+    },
+    {
+      "step": 131,
+      "reward": 0.4497321546077728
+    },
+    {
+      "step": 132,
+      "reward": 0.43910714983940125
+    },
+    {
+      "step": 133,
+      "reward": 0.48475003242492676
+    },
+    {
+      "step": 134,
+      "reward": 0.90625
+    },
+    {
+      "step": 135,
+      "reward": 0.6499999761581421
+    },
+    {
+      "step": 136,
+      "reward": 0.4575803279876709
+    },
+    {
+      "step": 137,
+      "reward": 0.5043749809265137
+    },
+    {
+      "step": 138,
+      "reward": 0.5194821357727051
+    },
+    {
+      "step": 139,
+      "reward": 0.6681874990463257
+    },
+    {
+      "step": 140,
+      "reward": 0.6075000166893005
+    },
+    {
+      "step": 141,
+      "reward": 0.5226786136627197
+    },
+    {
+      "step": 142,
+      "reward": 0.544910728931427
+    },
+    {
+      "step": 143,
+      "reward": 0.3448214530944824
+    },
+    {
+      "step": 144,
+      "reward": 0.5093749761581421
+    },
+    {
+      "step": 145,
+      "reward": 0.5335178375244141
+    },
+    {
+      "step": 146,
+      "reward": 0.5901785492897034
+    },
+    {
+      "step": 147,
+      "reward": 0.6714285612106323
+    },
+    {
+      "step": 148,
+      "reward": 0.6086249351501465
+    },
+    {
+      "step": 149,
+      "reward": 0.4005535840988159
+    },
+    {
+      "step": 150,
+      "reward": 0.4816071391105652
+    },
+    {
+      "step": 151,
+      "reward": 0.5088571310043335
+    },
+    {
+      "step": 152,
+      "reward": 0.8410714268684387
+    },
+    {
+      "step": 153,
+      "reward": 0.661517858505249
+    },
+    {
+      "step": 154,
+      "reward": 0.4182142913341522
+    },
+    {
+      "step": 155,
+      "reward": 0.5462499856948853
+    },
+    {
+      "step": 156,
+      "reward": 0.5656249523162842
+    },
+    {
+      "step": 157,
+      "reward": 0.6449106931686401
+    },
+    {
+      "step": 158,
+      "reward": 0.8441964387893677
+    },
+    {
+      "step": 159,
+      "reward": 0.6388392448425293
+    },
+    {
+      "step": 160,
+      "reward": 0.3429464101791382
+    },
+    {
+      "step": 161,
+      "reward": 0.4982143044471741
+    },
+    {
+      "step": 162,
+      "reward": 0.4846428632736206
+    },
+    {
+      "step": 163,
+      "reward": 0.4471428394317627
+    },
+    {
+      "step": 164,
+      "reward": 0.6001249551773071
+    },
+    {
+      "step": 165,
+      "reward": 0.735714316368103
+    },
+    {
+      "step": 166,
+      "reward": 0.641964316368103
+    },
+    {
+      "step": 167,
+      "reward": 0.6100000143051147
+    },
+    {
+      "step": 168,
+      "reward": 0.7066963911056519
+    },
+    {
+      "step": 169,
+      "reward": 0.6348214149475098
+    },
+    {
+      "step": 170,
+      "reward": 0.5228928327560425
+    },
+    {
+      "step": 171,
+      "reward": 0.5739464163780212
+    },
+    {
+      "step": 172,
+      "reward": 0.6174107193946838
+    },
+    {
+      "step": 173,
+      "reward": 0.5413392782211304
+    },
+    {
+      "step": 174,
+      "reward": 0.5052499771118164
+    },
+    {
+      "step": 175,
+      "reward": 0.5122321248054504
+    },
+    {
+      "step": 176,
+      "reward": 0.2723214328289032
+    },
+    {
+      "step": 177,
+      "reward": 0.796875
+    },
+    {
+      "step": 178,
+      "reward": 0.5441964864730835
+    },
+    {
+      "step": 179,
+      "reward": 0.578125
+    },
+    {
+      "step": 180,
+      "reward": 0.5441964268684387
+    },
+    {
+      "step": 181,
+      "reward": 0.4543749988079071
+    },
+    {
+      "step": 182,
+      "reward": 0.31626784801483154
+    },
+    {
+      "step": 183,
+      "reward": 0.6285713911056519
+    },
+    {
+      "step": 184,
+      "reward": 0.6952678561210632
+    },
+    {
+      "step": 185,
+      "reward": 0.484375
+    },
+    {
+      "step": 186,
+      "reward": 0.5447321534156799
+    },
+    {
+      "step": 187,
+      "reward": 0.6228570938110352
+    },
+    {
+      "step": 188,
+      "reward": 0.5247321128845215
+    },
+    {
+      "step": 189,
+      "reward": 0.6542679071426392
+    },
+    {
+      "step": 190,
+      "reward": 0.5883928537368774
+    },
+    {
+      "step": 191,
+      "reward": 0.5099107027053833
+    },
+    {
+      "step": 192,
+      "reward": 0.49196428060531616
+    },
+    {
+      "step": 193,
+      "reward": 0.6783928871154785
+    },
+    {
+      "step": 194,
+      "reward": 0.8446428775787354
+    },
+    {
+      "step": 195,
+      "reward": 0.27032142877578735
+    },
+    {
+      "step": 196,
+      "reward": 0.5037678480148315
+    },
+    {
+      "step": 197,
+      "reward": 0.7468750476837158
+    },
+    {
+      "step": 198,
+      "reward": 0.5247321128845215
+    },
+    {
+      "step": 199,
+      "reward": 0.5624642968177795
+    },
+    {
+      "step": 200,
+      "reward": 0.7284821271896362
+    },
+    {
+      "step": 201,
+      "reward": 0.5191071033477783
+    },
+    {
+      "step": 202,
+      "reward": 0.6629464030265808
+    },
+    {
+      "step": 203,
+      "reward": 0.5663928985595703
+    },
+    {
+      "step": 204,
+      "reward": 0.5843750238418579
+    },
+    {
+      "step": 205,
+      "reward": 0.6022321581840515
+    },
+    {
+      "step": 206,
+      "reward": 0.40528571605682373
+    },
+    {
+      "step": 207,
+      "reward": 0.3661428689956665
+    },
+    {
+      "step": 208,
+      "reward": 0.5410714149475098
+    },
+    {
+      "step": 209,
+      "reward": 0.8133928775787354
+    },
+    {
+      "step": 210,
+      "reward": 0.6950892806053162
+    },
+    {
+      "step": 211,
+      "reward": 0.6952678561210632
+    },
+    {
+      "step": 212,
+      "reward": 0.6540178656578064
+    },
+    {
+      "step": 213,
+      "reward": 0.642464280128479
+    },
+    {
+      "step": 214,
+      "reward": 0.37055355310440063
+    },
+    {
+      "step": 215,
+      "reward": 0.7700892686843872
+    },
+    {
+      "step": 216,
+      "reward": 0.6158928871154785
+    },
+    {
+      "step": 217,
+      "reward": 0.8133928775787354
+    },
+    {
+      "step": 218,
+      "reward": 0.5236964225769043
+    },
+    {
+      "step": 219,
+      "reward": 0.5915178656578064
+    },
+    {
+      "step": 220,
+      "reward": 0.5193749666213989
+    },
+    {
+      "step": 221,
+      "reward": 0.956250011920929
+    },
+    {
+      "step": 222,
+      "reward": 0.8441964387893677
+    },
+    {
+      "step": 223,
+      "reward": 0.5497321486473083
+    },
+    {
+      "step": 224,
+      "reward": 0.6256250143051147
+    },
+    {
+      "step": 225,
+      "reward": 0.637946367263794
+    },
+    {
+      "step": 226,
+      "reward": 0.6213749647140503
+    },
+    {
+      "step": 227,
+      "reward": 0.7883928418159485
+    },
+    {
+      "step": 228,
+      "reward": 0.6973214149475098
+    },
+    {
+      "step": 229,
+      "reward": 0.84375
+    },
+    {
+      "step": 230,
+      "reward": 0.6660535335540771
+    },
+    {
+      "step": 231,
+      "reward": 0.9249999523162842
+    },
+    {
+      "step": 232,
+      "reward": 0.5973213911056519
+    },
+    {
+      "step": 233,
+      "reward": 0.5616071224212646
+    },
+    {
+      "step": 234,
+      "reward": 0.6325000524520874
+    },
+    {
+      "step": 235,
+      "reward": 0.5837500095367432
+    },
+    {
+      "step": 236,
+      "reward": 0.4354107081890106
+    },
+    {
+      "step": 237,
+      "reward": 0.8660714626312256
+    },
+    {
+      "step": 238,
+      "reward": 0.6324999928474426
+    },
+    {
+      "step": 239,
+      "reward": 0.5628928542137146
+    },
+    {
+      "step": 240,
+      "reward": 0.6083928346633911
+    },
+    {
+      "step": 241,
+      "reward": 0.42701786756515503
+    },
+    {
+      "step": 242,
+      "reward": 0.890625
+    },
+    {
+      "step": 243,
+      "reward": 0.8290178775787354
+    },
+    {
+      "step": 244,
+      "reward": 0.6953214406967163
+    },
+    {
+      "step": 245,
+      "reward": 0.5653928518295288
+    },
+    {
+      "step": 246,
+      "reward": 0.6873035430908203
+    },
+    {
+      "step": 247,
+      "reward": 0.5102678537368774
+    },
+    {
+      "step": 248,
+      "reward": 0.5462678670883179
+    },
+    {
+      "step": 249,
+      "reward": 0.9468749761581421
+    },
+    {
+      "step": 250,
+      "reward": 0.4848214387893677
+    },
+    {
+      "step": 251,
+      "reward": 0.7349107265472412
+    },
+    {
+      "step": 252,
+      "reward": 0.5615178346633911
+    },
+    {
+      "step": 253,
+      "reward": 0.859375
+    },
+    {
+      "step": 254,
+      "reward": 0.706250011920929
+    },
+    {
+      "step": 255,
+      "reward": 0.7140178680419922
+    },
+    {
+      "step": 256,
+      "reward": 0.42228570580482483
+    },
+    {
+      "step": 257,
+      "reward": 0.6294910907745361
+    },
+    {
+      "step": 258,
+      "reward": 0.5381250381469727
+    },
+    {
+      "step": 259,
+      "reward": 0.6176071166992188
+    },
+    {
+      "step": 260,
+      "reward": 0.7239285707473755
+    },
+    {
+      "step": 261,
+      "reward": 0.5584821701049805
+    },
+    {
+      "step": 262,
+      "reward": 0.6481249928474426
+    },
+    {
+      "step": 263,
+      "reward": 0.7474821209907532
+    },
+    {
+      "step": 264,
+      "reward": 0.7473214268684387
+    },
+    {
+      "step": 265,
+      "reward": 0.5647231936454773
+    },
+    {
+      "step": 266,
+      "reward": 0.6187499761581421
+    },
+    {
+      "step": 267,
+      "reward": 0.5382499694824219
+    },
+    {
+      "step": 268,
+      "reward": 0.6027321219444275
+    },
+    {
+      "step": 269,
+      "reward": 0.6174999475479126
+    },
+    {
+      "step": 270,
+      "reward": 0.8348214626312256
+    }
+  ],
+  "config": {
+    "model": "Qwen/Qwen2.5-0.5B-Instruct",
+    "n_per_task": 30,
+    "num_generations": 8,
+    "epochs": 3,
+    "lr": 1e-06
+  }
+}

training_results.png ADDED Viewed

Git LFS Details

SHA256: ccf993c0d7a261553a8f5cb86a1f2e0395a4d0d7ed7a691747071e47466aea57
Pointer size: 131 Bytes
Size of remote file: 162 kB