DevanshuDon commited on
Commit
d73a92d
·
verified ·
1 Parent(s): 7c9156d

Upload 3 files

Browse files
Files changed (4) hide show
  1. .gitattributes +1 -0
  2. README.md +249 -0
  3. results.json +1167 -0
  4. training_results.png +3 -0
.gitattributes ADDED
@@ -0,0 +1 @@
 
 
1
+ training_results.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,249 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: ExecAssist
3
+ emoji: 📧
4
+ colorFrom: indigo
5
+ colorTo: blue
6
+ sdk: docker
7
+ app_port: 7860
8
+ pinned: false
9
+ license: mit
10
+ tags:
11
+ - openenv
12
+ - rl
13
+ - executive-assistant
14
+ - grpo
15
+ - trl
16
+ ---
17
+
18
+ # ExecAssist — Executive Assistant Environment
19
+
20
+ An OpenEnv environment where AI agents learn to manage email and calendar like a human executive assistant. Built for the **OpenEnv Hackathon (Apr 2026)** under Theme #3.2 — Personalized Tasks.
21
+
22
+ **Live environment:** https://devanshudon-exec-assist.hf.space
23
+ **Mini-blog:** [`blog_post.md`](./blog_post.md)
24
+ **Training notebook:** [`train_colab.ipynb`](./train_colab.ipynb)
25
+
26
+ ---
27
+
28
+ ## 🏆 Headline result
29
+
30
+ Trained `Qwen2.5-0.5B-Instruct` with TRL GRPO for 3 epochs (270 steps, ~90 min on free Colab T4):
31
+
32
+ | Task | Baseline (untrained 0.5B) | Trained (GRPO) | Improvement |
33
+ |--------|---------------------------|----------------|-------------|
34
+ | Easy | 0.345 | **0.995** | **+188%** |
35
+ | Medium | 0.227 | **0.745** | **+228%** |
36
+ | Hard | 0.249 | **0.737** | **+196%** |
37
+
38
+ After training, **9 out of 10 samples on the easy task scored a perfect 1.0** — the model learned the task structure, not just statistics.
39
+
40
+ ![Training results: bar chart shows trained model dramatically outperforms baseline across all three tasks; line plot shows reward climbing from ~0.2 to ~0.7 over 270 steps](./training_results.png)
41
+
42
+ *Left: mean reward by task, baseline vs. trained (n=10 samples per task). Right: GRPO batch reward over 270 training steps — first-quartile mean 0.390, last-quartile mean 0.648.*
43
+
44
+ ---
45
+
46
+ ## Problem & motivation
47
+
48
+ Every executive's morning inbox is the same problem on repeat: read incoming requests, write a polite reply, find a calendar slot that doesn't clash, propose alternatives if it does. It's not hard for a human — it's just fiddly. And LLMs are surprisingly bad at it because the task fuses **three separate skills**:
49
+
50
+ 1. Structured calendar reasoning (no double-booking, within working hours, sensible duration)
51
+ 2. Professional written tone (greeting, closing, polite framing, appropriate detail)
52
+ 3. Conflict resolution (recognize the conflict, propose 2–3 alternatives, explain professionally)
53
+
54
+ ExecAssist is designed to teach all three in a single training loop. The reward function is a weighted blend of three independent graders plus four anti-reward-hacking penalties, making it hard to game by exploiting any single signal.
55
+
56
+ ---
57
+
58
+ ## Tasks
59
+
60
+ | Task | Difficulty | Description | Reward weighting |
61
+ |------|-----------|-------------|-------------------|
62
+ | **Easy** | 1 email, clear availability | Draft polite reply + book meeting in open slot | 50% email + 50% scheduling |
63
+ | **Medium** | 1 email, calendar conflict | Identify conflict + propose 2–3 alternatives + explain professionally | 30% email + 40% conflict + 30% scheduling |
64
+ | **Hard** | 3 emails, multi-party coordination, priority conflicts | Prioritize, reschedule, notify all parties | 34% email + 33% scheduling + 33% conflict |
65
+
66
+ All scores are deterministic and bounded to [0, 1].
67
+
68
+ ---
69
+
70
+ ## Environment design
71
+
72
+ ### Observation space
73
+
74
+ ```python
75
+ {
76
+ "task": "easy" | "medium" | "hard",
77
+ "description": str,
78
+ "emails": [{"sender", "subject", "body", "priority", "timestamp"}, ...],
79
+ "calendar": {
80
+ "existing_meetings": [{"id", "participants", "start_time", "end_time", "subject", "priority"}, ...],
81
+ "working_hours": {"monday": "9-17", ...},
82
+ "executive_name": str
83
+ },
84
+ "contacts": {email: {"name", "email", "timezone", "title"}, ...},
85
+ "action_required": str
86
+ }
87
+ ```
88
+
89
+ ### Action space
90
+
91
+ ```python
92
+ {
93
+ "email_reply": str,
94
+ "calendar_action": "book" | "propose_alternatives" | "reschedule" | "decline",
95
+ "meeting_details": {
96
+ "participants": [str, ...],
97
+ "start_time": "ISO-8601",
98
+ "end_time": "ISO-8601",
99
+ "subject": str,
100
+ "location": str | None,
101
+ "proposed_alternatives": [...] | None
102
+ }
103
+ }
104
+ ```
105
+
106
+ ### Reward function (multiple independent graders)
107
+
108
+ | Component | Range | What it checks |
109
+ |-----------|-------|----------------|
110
+ | **Email quality** | 0–1 | Politeness markers, greeting/closing, sufficient detail (20+ words), professional tone, optional LLM-judge for nuance |
111
+ | **Scheduling correctness** | 0–1 | No double-booking, within working hours, appropriate duration (15min–2hrs), all participants included |
112
+ | **Conflict resolution** | 0–1 | Recognizes conflicts, proposes 2–3 alternatives, explains professionally, prioritizes correctly |
113
+
114
+ ### Anti-reward-hacking penalties
115
+
116
+ - Short email (`< 20` words): **−0.30**
117
+ - Missing `meeting_details`: **−0.40**
118
+ - Generic / templated phrasing: **−0.10**
119
+ - Overly long email (`> 1500` chars): **−0.15**
120
+
121
+ These were added because GRPO will find shortcuts. During the first run, the model briefly collapsed to a single short safe response — the penalties + KL regularization fixed it cleanly.
122
+
123
+ ---
124
+
125
+ ## API endpoints
126
+
127
+ | Endpoint | Method | Description |
128
+ |----------|--------|-------------|
129
+ | `/reset?task=easy\|medium\|hard` | POST | Start new episode, returns observation |
130
+ | `/step` | POST | Submit action, returns observation/reward/done/info |
131
+ | `/state` | GET | Current state |
132
+ | `/tasks` | GET | List all tasks |
133
+ | `/health` | GET | Health check |
134
+ | `/metadata` | GET | Environment info |
135
+ | `/schema` | GET | Action / observation / state schemas |
136
+
137
+ Full interactive docs: https://devanshudon-exec-assist.hf.space/docs
138
+
139
+ ---
140
+
141
+ ## Setup & usage
142
+
143
+ ### Run the environment locally
144
+
145
+ ```bash
146
+ git clone https://huggingface.co/spaces/DevanshuDon/exec-assist
147
+ cd exec-assist
148
+ pip install -r requirements.txt
149
+ uvicorn server.app:app --port 8000
150
+ # open http://127.0.0.1:8000/docs
151
+ ```
152
+
153
+ ### Reproduce the baseline
154
+
155
+ ```bash
156
+ export APIBASEURL=https://openrouter.ai/api/v1
157
+ export MODELNAME=nvidia/nemotron-3-super-120b-a12b:free
158
+ export HFTOKEN=your-openrouter-key
159
+ python inference.py
160
+ ```
161
+
162
+ Expected output (structured `[START] / [STEP] / [END]` logs as required):
163
+ ```
164
+ [START] task=easy env=exec-assist model=...
165
+ [STEP] step=1 action=assistant(easy) reward=0.32 done=true error=null
166
+ [END] success=false steps=1 score=0.315 rewards=0.32
167
+ ```
168
+
169
+ ### Run the trained model
170
+
171
+ Open `train_colab.ipynb` in Google Colab, set runtime → T4 GPU, Run All. Total time ~50 min including evaluation. Outputs `training_results.png` and `results.json`.
172
+
173
+ ### Docker
174
+
175
+ ```bash
176
+ docker build -t exec-assist .
177
+ docker run -p 7860:7860 exec-assist
178
+ ```
179
+
180
+ ---
181
+
182
+ ## Training pipeline
183
+
184
+ **Stack:** TRL `GRPOTrainer` + HuggingFace Transformers, Qwen2.5-0.5B-Instruct, free Colab T4.
185
+
186
+ **Approach:** Pre-collect 90 scenarios from the deployed HF Space (30 each across easy/medium/hard) into a `Dataset`, with each scenario stored as a column. The reward function receives the scenario as a kwarg and scores deterministically without API calls during training. This decouples training from environment latency.
187
+
188
+ **Hyperparameters (final, working config):**
189
+
190
+ ```python
191
+ GRPOConfig(
192
+ learning_rate=1e-6, # critical — 5e-6 caused collapse
193
+ per_device_train_batch_size=2,
194
+ gradient_accumulation_steps=4,
195
+ num_generations=8, # diversity within group
196
+ num_train_epochs=3,
197
+ beta=0.1, # KL penalty — prevents mode collapse
198
+ fp16=False, bf16=False, # fp32 for stable gradients
199
+ gradient_checkpointing=True,
200
+ )
201
+ ```
202
+
203
+ **The collapse and the fix.** Initial run (1 epoch, lr=5e-6, beta=0) collapsed: trained model scored exactly 0.2 on every prompt regardless of input — the model found a single safe response and stopped exploring. Adding the KL term (`beta=0.1`), reducing the learning rate, and increasing `num_generations` from 4 to 8 produced the clean training curve shown above.
204
+
205
+ **Anti-reward-hacking observations during training:** GRPO did try to game several signals — outputting only the JSON without an email body (caught by short-email penalty), proposing booking times outside working hours (caught by scheduling check), repeating the prompt back as a "reply" (caught by generic-phrasing detector). Each penalty was triggered during early steps and disappeared as training progressed.
206
+
207
+ ---
208
+
209
+ ## Repository structure
210
+
211
+ ```
212
+ exec-assist/
213
+ ├── server/
214
+ │ ├── app.py # FastAPI app + environment logic
215
+ │ ├── models.py # Pydantic Action/Observation/State models
216
+ │ └── data.py # Scenario generation, scoring functions, LLM judge
217
+ ├── client.py # EnvClient wrapper (Gym-style)
218
+ ├── inference.py # Baseline inference (required, structured logs)
219
+ ├── train_colab.ipynb # GRPO training notebook
220
+ ├── training_results.png # Training curves + baseline-vs-trained
221
+ ├── results.json # Raw evaluation data + 270-step training log
222
+ ├── blog_post.md # Mini-blog write-up
223
+ ├── openenv.yaml # OpenEnv manifest
224
+ ├── Dockerfile # Python 3.10, port 7860
225
+ ├── requirements.txt
226
+ └── README.md # This file
227
+ ```
228
+
229
+ ---
230
+
231
+ ## Compliance checklist
232
+
233
+ - ✅ Built on **OpenEnv** (latest release, `openenv-core>=0.2.0`)
234
+ - ✅ Real-world task simulation (not games or toys)
235
+ - ✅ Full OpenEnv spec — typed Pydantic models for Action/Observation/State, `step()`/`reset()`/`state()` endpoints, `openenv.yaml` manifest
236
+ - ✅ **3 tasks** with deterministic graders, scores in [0, 1], easy → medium → hard difficulty progression
237
+ - ✅ Meaningful reward function with **partial-progress signal** + anti-reward-hacking penalties
238
+ - ✅ **Baseline inference script** (`inference.py`) using OpenAI client, reads `APIBASEURL`/`MODELNAME`/`HFTOKEN`, structured `[START]/[STEP]/[END]` logs
239
+ - ✅ **Training script** (TRL GRPO) with reproducible Colab notebook
240
+ - ✅ **Real training evidence** — reward curves, baseline vs. trained, before/after numbers (above)
241
+ - ✅ Deployed to **HuggingFace Space** with Docker
242
+ - ✅ Working **Dockerfile** (Python 3.10), `docker build && docker run` works
243
+ - ✅ README with environment description, action/observation spaces, setup, baseline scores
244
+
245
+ ---
246
+
247
+ ## Author
248
+
249
+ **Devanshu** ([@DevanshuDon](https://huggingface.co/DevanshuDon)) — built for OpenEnv Hackathon, April 2026.
results.json ADDED
@@ -0,0 +1,1167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "baseline": {
3
+ "easy": [
4
+ 0.2,
5
+ 0.65,
6
+ 0.65,
7
+ 0.65,
8
+ 0.65,
9
+ 0.2,
10
+ 0.2,
11
+ 0.0,
12
+ 0.04999999999999999,
13
+ 0.2
14
+ ],
15
+ "medium": [
16
+ 0.15714285714285714,
17
+ 0.11428571428571427,
18
+ 0.15714285714285714,
19
+ 0.15714285714285714,
20
+ 0.15714285714285714,
21
+ 0.5471428571428572,
22
+ 0.5471428571428572,
23
+ 0.15714285714285714,
24
+ 0.11428571428571427,
25
+ 0.15714285714285714
26
+ ],
27
+ "hard": [
28
+ 0.14785714285714285,
29
+ 0.14785714285714285,
30
+ 0.10071428571428576,
31
+ 0.14785714285714285,
32
+ 0.5498571428571429,
33
+ 0.0,
34
+ 0.14785714285714285,
35
+ 0.14785714285714285,
36
+ 0.5498571428571429,
37
+ 0.5498571428571429
38
+ ]
39
+ },
40
+ "trained": {
41
+ "easy": [
42
+ 1.0,
43
+ 1.0,
44
+ 1.0,
45
+ 1.0,
46
+ 1.0,
47
+ 1.0,
48
+ 1.0,
49
+ 1.0,
50
+ 1.0,
51
+ 0.95
52
+ ],
53
+ "medium": [
54
+ 0.7571428571428571,
55
+ 0.6842857142857143,
56
+ 0.7571428571428571,
57
+ 0.7571428571428571,
58
+ 0.8171428571428572,
59
+ 0.7571428571428571,
60
+ 0.7271428571428571,
61
+ 0.7571428571428571,
62
+ 0.7121428571428572,
63
+ 0.7271428571428571
64
+ ],
65
+ "hard": [
66
+ 0.6518571428571429,
67
+ 0.6688571428571428,
68
+ 0.7407142857142858,
69
+ 0.7028571428571428,
70
+ 0.8010000000000002,
71
+ 0.7878571428571429,
72
+ 0.7878571428571429,
73
+ 0.7878571428571429,
74
+ 0.7878571428571429,
75
+ 0.6518571428571429
76
+ ]
77
+ },
78
+ "training_log": [
79
+ {
80
+ "step": 1,
81
+ "reward": 0.25410714745521545
82
+ },
83
+ {
84
+ "step": 2,
85
+ "reward": 0.036964286118745804
86
+ },
87
+ {
88
+ "step": 3,
89
+ "reward": 0.12232142686843872
90
+ },
91
+ {
92
+ "step": 4,
93
+ "reward": 0.14007142186164856
94
+ },
95
+ {
96
+ "step": 5,
97
+ "reward": 0.10798214375972748
98
+ },
99
+ {
100
+ "step": 6,
101
+ "reward": 0.26973217725753784
102
+ },
103
+ {
104
+ "step": 7,
105
+ "reward": 0.21267856657505035
106
+ },
107
+ {
108
+ "step": 8,
109
+ "reward": 0.2703035771846771
110
+ },
111
+ {
112
+ "step": 9,
113
+ "reward": 0.28862500190734863
114
+ },
115
+ {
116
+ "step": 10,
117
+ "reward": 0.4040178656578064
118
+ },
119
+ {
120
+ "step": 11,
121
+ "reward": 0.29374998807907104
122
+ },
123
+ {
124
+ "step": 12,
125
+ "reward": 0.3059999942779541
126
+ },
127
+ {
128
+ "step": 13,
129
+ "reward": 0.3656250238418579
130
+ },
131
+ {
132
+ "step": 14,
133
+ "reward": 0.40044641494750977
134
+ },
135
+ {
136
+ "step": 15,
137
+ "reward": 0.12955357134342194
138
+ },
139
+ {
140
+ "step": 16,
141
+ "reward": 0.408482164144516
142
+ },
143
+ {
144
+ "step": 17,
145
+ "reward": 0.46098214387893677
146
+ },
147
+ {
148
+ "step": 18,
149
+ "reward": 0.49732139706611633
150
+ },
151
+ {
152
+ "step": 19,
153
+ "reward": 0.4546428620815277
154
+ },
155
+ {
156
+ "step": 20,
157
+ "reward": 0.4602678418159485
158
+ },
159
+ {
160
+ "step": 21,
161
+ "reward": 0.4337499737739563
162
+ },
163
+ {
164
+ "step": 22,
165
+ "reward": 0.579464316368103
166
+ },
167
+ {
168
+ "step": 23,
169
+ "reward": 0.5285714268684387
170
+ },
171
+ {
172
+ "step": 24,
173
+ "reward": 0.27405357360839844
174
+ },
175
+ {
176
+ "step": 25,
177
+ "reward": 0.44980356097221375
178
+ },
179
+ {
180
+ "step": 26,
181
+ "reward": 0.5672321319580078
182
+ },
183
+ {
184
+ "step": 27,
185
+ "reward": 0.16444642841815948
186
+ },
187
+ {
188
+ "step": 28,
189
+ "reward": 0.4348214268684387
190
+ },
191
+ {
192
+ "step": 29,
193
+ "reward": 0.35455358028411865
194
+ },
195
+ {
196
+ "step": 30,
197
+ "reward": 0.40312498807907104
198
+ },
199
+ {
200
+ "step": 31,
201
+ "reward": 0.5915178060531616
202
+ },
203
+ {
204
+ "step": 32,
205
+ "reward": 0.42767855525016785
206
+ },
207
+ {
208
+ "step": 33,
209
+ "reward": 0.4596250057220459
210
+ },
211
+ {
212
+ "step": 34,
213
+ "reward": 0.357142835855484
214
+ },
215
+ {
216
+ "step": 35,
217
+ "reward": 0.34437501430511475
218
+ },
219
+ {
220
+ "step": 36,
221
+ "reward": 0.21607142686843872
222
+ },
223
+ {
224
+ "step": 37,
225
+ "reward": 0.23191072046756744
226
+ },
227
+ {
228
+ "step": 38,
229
+ "reward": 0.4008482098579407
230
+ },
231
+ {
232
+ "step": 39,
233
+ "reward": 0.08360714465379715
234
+ },
235
+ {
236
+ "step": 40,
237
+ "reward": 0.28391069173812866
238
+ },
239
+ {
240
+ "step": 41,
241
+ "reward": 0.39952677488327026
242
+ },
243
+ {
244
+ "step": 42,
245
+ "reward": 0.3128303587436676
246
+ },
247
+ {
248
+ "step": 43,
249
+ "reward": 0.5379464626312256
250
+ },
251
+ {
252
+ "step": 44,
253
+ "reward": 0.6946428418159485
254
+ },
255
+ {
256
+ "step": 45,
257
+ "reward": 0.48794642090797424
258
+ },
259
+ {
260
+ "step": 46,
261
+ "reward": 0.5474107265472412
262
+ },
263
+ {
264
+ "step": 47,
265
+ "reward": 0.3933393061161041
266
+ },
267
+ {
268
+ "step": 48,
269
+ "reward": 0.565625011920929
270
+ },
271
+ {
272
+ "step": 49,
273
+ "reward": 0.4785892963409424
274
+ },
275
+ {
276
+ "step": 50,
277
+ "reward": 0.46958038210868835
278
+ },
279
+ {
280
+ "step": 51,
281
+ "reward": 0.36991071701049805
282
+ },
283
+ {
284
+ "step": 52,
285
+ "reward": 0.2941964268684387
286
+ },
287
+ {
288
+ "step": 53,
289
+ "reward": 0.4853571653366089
290
+ },
291
+ {
292
+ "step": 54,
293
+ "reward": 0.23973214626312256
294
+ },
295
+ {
296
+ "step": 55,
297
+ "reward": 0.7227678298950195
298
+ },
299
+ {
300
+ "step": 56,
301
+ "reward": 0.4734821319580078
302
+ },
303
+ {
304
+ "step": 57,
305
+ "reward": 0.5602678656578064
306
+ },
307
+ {
308
+ "step": 58,
309
+ "reward": 0.5581071376800537
310
+ },
311
+ {
312
+ "step": 59,
313
+ "reward": 0.6915178298950195
314
+ },
315
+ {
316
+ "step": 60,
317
+ "reward": 0.328830361366272
318
+ },
319
+ {
320
+ "step": 61,
321
+ "reward": 0.3727678656578064
322
+ },
323
+ {
324
+ "step": 62,
325
+ "reward": 0.4290178418159485
326
+ },
327
+ {
328
+ "step": 63,
329
+ "reward": 0.5301785469055176
330
+ },
331
+ {
332
+ "step": 64,
333
+ "reward": 0.29794642329216003
334
+ },
335
+ {
336
+ "step": 65,
337
+ "reward": 0.5418839454650879
338
+ },
339
+ {
340
+ "step": 66,
341
+ "reward": 0.5446428656578064
342
+ },
343
+ {
344
+ "step": 67,
345
+ "reward": 0.30937498807907104
346
+ },
347
+ {
348
+ "step": 68,
349
+ "reward": 0.571696400642395
350
+ },
351
+ {
352
+ "step": 69,
353
+ "reward": 0.5544642806053162
354
+ },
355
+ {
356
+ "step": 70,
357
+ "reward": 0.6167857646942139
358
+ },
359
+ {
360
+ "step": 71,
361
+ "reward": 0.45669642090797424
362
+ },
363
+ {
364
+ "step": 72,
365
+ "reward": 0.31955355405807495
366
+ },
367
+ {
368
+ "step": 73,
369
+ "reward": 0.5181249976158142
370
+ },
371
+ {
372
+ "step": 74,
373
+ "reward": 0.4415178596973419
374
+ },
375
+ {
376
+ "step": 75,
377
+ "reward": 0.5451785326004028
378
+ },
379
+ {
380
+ "step": 76,
381
+ "reward": 0.3028392791748047
382
+ },
383
+ {
384
+ "step": 77,
385
+ "reward": 0.2091071605682373
386
+ },
387
+ {
388
+ "step": 78,
389
+ "reward": 0.536339282989502
390
+ },
391
+ {
392
+ "step": 79,
393
+ "reward": 0.2366071492433548
394
+ },
395
+ {
396
+ "step": 80,
397
+ "reward": 0.3268928527832031
398
+ },
399
+ {
400
+ "step": 81,
401
+ "reward": 0.5390892624855042
402
+ },
403
+ {
404
+ "step": 82,
405
+ "reward": 0.4825892746448517
406
+ },
407
+ {
408
+ "step": 83,
409
+ "reward": 0.46875
410
+ },
411
+ {
412
+ "step": 84,
413
+ "reward": 0.7821428775787354
414
+ },
415
+ {
416
+ "step": 85,
417
+ "reward": 0.4580357074737549
418
+ },
419
+ {
420
+ "step": 86,
421
+ "reward": 0.5209821462631226
422
+ },
423
+ {
424
+ "step": 87,
425
+ "reward": 0.4017857313156128
426
+ },
427
+ {
428
+ "step": 88,
429
+ "reward": 0.660178542137146
430
+ },
431
+ {
432
+ "step": 89,
433
+ "reward": 0.5458393096923828
434
+ },
435
+ {
436
+ "step": 90,
437
+ "reward": 0.7919642925262451
438
+ },
439
+ {
440
+ "step": 91,
441
+ "reward": 0.4300000071525574
442
+ },
443
+ {
444
+ "step": 92,
445
+ "reward": 0.501964271068573
446
+ },
447
+ {
448
+ "step": 93,
449
+ "reward": 0.6446428298950195
450
+ },
451
+ {
452
+ "step": 94,
453
+ "reward": 0.5094642639160156
454
+ },
455
+ {
456
+ "step": 95,
457
+ "reward": 0.5647678375244141
458
+ },
459
+ {
460
+ "step": 96,
461
+ "reward": 0.6352678537368774
462
+ },
463
+ {
464
+ "step": 97,
465
+ "reward": 0.5024999976158142
466
+ },
467
+ {
468
+ "step": 98,
469
+ "reward": 0.515874981880188
470
+ },
471
+ {
472
+ "step": 99,
473
+ "reward": 0.46294644474983215
474
+ },
475
+ {
476
+ "step": 100,
477
+ "reward": 0.8723214268684387
478
+ },
479
+ {
480
+ "step": 101,
481
+ "reward": 0.5212500095367432
482
+ },
483
+ {
484
+ "step": 102,
485
+ "reward": 0.671875
486
+ },
487
+ {
488
+ "step": 103,
489
+ "reward": 0.5864999890327454
490
+ },
491
+ {
492
+ "step": 104,
493
+ "reward": 0.6749999523162842
494
+ },
495
+ {
496
+ "step": 105,
497
+ "reward": 0.5629464387893677
498
+ },
499
+ {
500
+ "step": 106,
501
+ "reward": 0.5281071662902832
502
+ },
503
+ {
504
+ "step": 107,
505
+ "reward": 0.6936607360839844
506
+ },
507
+ {
508
+ "step": 108,
509
+ "reward": 0.6465713977813721
510
+ },
511
+ {
512
+ "step": 109,
513
+ "reward": 0.5022321343421936
514
+ },
515
+ {
516
+ "step": 110,
517
+ "reward": 0.5313928127288818
518
+ },
519
+ {
520
+ "step": 111,
521
+ "reward": 0.6238213777542114
522
+ },
523
+ {
524
+ "step": 112,
525
+ "reward": 0.6399999856948853
526
+ },
527
+ {
528
+ "step": 113,
529
+ "reward": 0.7440178394317627
530
+ },
531
+ {
532
+ "step": 114,
533
+ "reward": 0.5431250333786011
534
+ },
535
+ {
536
+ "step": 115,
537
+ "reward": 0.6102678775787354
538
+ },
539
+ {
540
+ "step": 116,
541
+ "reward": 0.6504464149475098
542
+ },
543
+ {
544
+ "step": 117,
545
+ "reward": 0.7581071257591248
546
+ },
547
+ {
548
+ "step": 118,
549
+ "reward": 0.6492946147918701
550
+ },
551
+ {
552
+ "step": 119,
553
+ "reward": 0.6843750476837158
554
+ },
555
+ {
556
+ "step": 120,
557
+ "reward": 0.5536428689956665
558
+ },
559
+ {
560
+ "step": 121,
561
+ "reward": 0.653249979019165
562
+ },
563
+ {
564
+ "step": 122,
565
+ "reward": 0.5297499895095825
566
+ },
567
+ {
568
+ "step": 123,
569
+ "reward": 0.6578571796417236
570
+ },
571
+ {
572
+ "step": 124,
573
+ "reward": 0.8348214626312256
574
+ },
575
+ {
576
+ "step": 125,
577
+ "reward": 0.349785715341568
578
+ },
579
+ {
580
+ "step": 126,
581
+ "reward": 0.7781250476837158
582
+ },
583
+ {
584
+ "step": 127,
585
+ "reward": 0.8968750238418579
586
+ },
587
+ {
588
+ "step": 128,
589
+ "reward": 0.8093750476837158
590
+ },
591
+ {
592
+ "step": 129,
593
+ "reward": 0.6137499809265137
594
+ },
595
+ {
596
+ "step": 130,
597
+ "reward": 0.8910714387893677
598
+ },
599
+ {
600
+ "step": 131,
601
+ "reward": 0.4497321546077728
602
+ },
603
+ {
604
+ "step": 132,
605
+ "reward": 0.43910714983940125
606
+ },
607
+ {
608
+ "step": 133,
609
+ "reward": 0.48475003242492676
610
+ },
611
+ {
612
+ "step": 134,
613
+ "reward": 0.90625
614
+ },
615
+ {
616
+ "step": 135,
617
+ "reward": 0.6499999761581421
618
+ },
619
+ {
620
+ "step": 136,
621
+ "reward": 0.4575803279876709
622
+ },
623
+ {
624
+ "step": 137,
625
+ "reward": 0.5043749809265137
626
+ },
627
+ {
628
+ "step": 138,
629
+ "reward": 0.5194821357727051
630
+ },
631
+ {
632
+ "step": 139,
633
+ "reward": 0.6681874990463257
634
+ },
635
+ {
636
+ "step": 140,
637
+ "reward": 0.6075000166893005
638
+ },
639
+ {
640
+ "step": 141,
641
+ "reward": 0.5226786136627197
642
+ },
643
+ {
644
+ "step": 142,
645
+ "reward": 0.544910728931427
646
+ },
647
+ {
648
+ "step": 143,
649
+ "reward": 0.3448214530944824
650
+ },
651
+ {
652
+ "step": 144,
653
+ "reward": 0.5093749761581421
654
+ },
655
+ {
656
+ "step": 145,
657
+ "reward": 0.5335178375244141
658
+ },
659
+ {
660
+ "step": 146,
661
+ "reward": 0.5901785492897034
662
+ },
663
+ {
664
+ "step": 147,
665
+ "reward": 0.6714285612106323
666
+ },
667
+ {
668
+ "step": 148,
669
+ "reward": 0.6086249351501465
670
+ },
671
+ {
672
+ "step": 149,
673
+ "reward": 0.4005535840988159
674
+ },
675
+ {
676
+ "step": 150,
677
+ "reward": 0.4816071391105652
678
+ },
679
+ {
680
+ "step": 151,
681
+ "reward": 0.5088571310043335
682
+ },
683
+ {
684
+ "step": 152,
685
+ "reward": 0.8410714268684387
686
+ },
687
+ {
688
+ "step": 153,
689
+ "reward": 0.661517858505249
690
+ },
691
+ {
692
+ "step": 154,
693
+ "reward": 0.4182142913341522
694
+ },
695
+ {
696
+ "step": 155,
697
+ "reward": 0.5462499856948853
698
+ },
699
+ {
700
+ "step": 156,
701
+ "reward": 0.5656249523162842
702
+ },
703
+ {
704
+ "step": 157,
705
+ "reward": 0.6449106931686401
706
+ },
707
+ {
708
+ "step": 158,
709
+ "reward": 0.8441964387893677
710
+ },
711
+ {
712
+ "step": 159,
713
+ "reward": 0.6388392448425293
714
+ },
715
+ {
716
+ "step": 160,
717
+ "reward": 0.3429464101791382
718
+ },
719
+ {
720
+ "step": 161,
721
+ "reward": 0.4982143044471741
722
+ },
723
+ {
724
+ "step": 162,
725
+ "reward": 0.4846428632736206
726
+ },
727
+ {
728
+ "step": 163,
729
+ "reward": 0.4471428394317627
730
+ },
731
+ {
732
+ "step": 164,
733
+ "reward": 0.6001249551773071
734
+ },
735
+ {
736
+ "step": 165,
737
+ "reward": 0.735714316368103
738
+ },
739
+ {
740
+ "step": 166,
741
+ "reward": 0.641964316368103
742
+ },
743
+ {
744
+ "step": 167,
745
+ "reward": 0.6100000143051147
746
+ },
747
+ {
748
+ "step": 168,
749
+ "reward": 0.7066963911056519
750
+ },
751
+ {
752
+ "step": 169,
753
+ "reward": 0.6348214149475098
754
+ },
755
+ {
756
+ "step": 170,
757
+ "reward": 0.5228928327560425
758
+ },
759
+ {
760
+ "step": 171,
761
+ "reward": 0.5739464163780212
762
+ },
763
+ {
764
+ "step": 172,
765
+ "reward": 0.6174107193946838
766
+ },
767
+ {
768
+ "step": 173,
769
+ "reward": 0.5413392782211304
770
+ },
771
+ {
772
+ "step": 174,
773
+ "reward": 0.5052499771118164
774
+ },
775
+ {
776
+ "step": 175,
777
+ "reward": 0.5122321248054504
778
+ },
779
+ {
780
+ "step": 176,
781
+ "reward": 0.2723214328289032
782
+ },
783
+ {
784
+ "step": 177,
785
+ "reward": 0.796875
786
+ },
787
+ {
788
+ "step": 178,
789
+ "reward": 0.5441964864730835
790
+ },
791
+ {
792
+ "step": 179,
793
+ "reward": 0.578125
794
+ },
795
+ {
796
+ "step": 180,
797
+ "reward": 0.5441964268684387
798
+ },
799
+ {
800
+ "step": 181,
801
+ "reward": 0.4543749988079071
802
+ },
803
+ {
804
+ "step": 182,
805
+ "reward": 0.31626784801483154
806
+ },
807
+ {
808
+ "step": 183,
809
+ "reward": 0.6285713911056519
810
+ },
811
+ {
812
+ "step": 184,
813
+ "reward": 0.6952678561210632
814
+ },
815
+ {
816
+ "step": 185,
817
+ "reward": 0.484375
818
+ },
819
+ {
820
+ "step": 186,
821
+ "reward": 0.5447321534156799
822
+ },
823
+ {
824
+ "step": 187,
825
+ "reward": 0.6228570938110352
826
+ },
827
+ {
828
+ "step": 188,
829
+ "reward": 0.5247321128845215
830
+ },
831
+ {
832
+ "step": 189,
833
+ "reward": 0.6542679071426392
834
+ },
835
+ {
836
+ "step": 190,
837
+ "reward": 0.5883928537368774
838
+ },
839
+ {
840
+ "step": 191,
841
+ "reward": 0.5099107027053833
842
+ },
843
+ {
844
+ "step": 192,
845
+ "reward": 0.49196428060531616
846
+ },
847
+ {
848
+ "step": 193,
849
+ "reward": 0.6783928871154785
850
+ },
851
+ {
852
+ "step": 194,
853
+ "reward": 0.8446428775787354
854
+ },
855
+ {
856
+ "step": 195,
857
+ "reward": 0.27032142877578735
858
+ },
859
+ {
860
+ "step": 196,
861
+ "reward": 0.5037678480148315
862
+ },
863
+ {
864
+ "step": 197,
865
+ "reward": 0.7468750476837158
866
+ },
867
+ {
868
+ "step": 198,
869
+ "reward": 0.5247321128845215
870
+ },
871
+ {
872
+ "step": 199,
873
+ "reward": 0.5624642968177795
874
+ },
875
+ {
876
+ "step": 200,
877
+ "reward": 0.7284821271896362
878
+ },
879
+ {
880
+ "step": 201,
881
+ "reward": 0.5191071033477783
882
+ },
883
+ {
884
+ "step": 202,
885
+ "reward": 0.6629464030265808
886
+ },
887
+ {
888
+ "step": 203,
889
+ "reward": 0.5663928985595703
890
+ },
891
+ {
892
+ "step": 204,
893
+ "reward": 0.5843750238418579
894
+ },
895
+ {
896
+ "step": 205,
897
+ "reward": 0.6022321581840515
898
+ },
899
+ {
900
+ "step": 206,
901
+ "reward": 0.40528571605682373
902
+ },
903
+ {
904
+ "step": 207,
905
+ "reward": 0.3661428689956665
906
+ },
907
+ {
908
+ "step": 208,
909
+ "reward": 0.5410714149475098
910
+ },
911
+ {
912
+ "step": 209,
913
+ "reward": 0.8133928775787354
914
+ },
915
+ {
916
+ "step": 210,
917
+ "reward": 0.6950892806053162
918
+ },
919
+ {
920
+ "step": 211,
921
+ "reward": 0.6952678561210632
922
+ },
923
+ {
924
+ "step": 212,
925
+ "reward": 0.6540178656578064
926
+ },
927
+ {
928
+ "step": 213,
929
+ "reward": 0.642464280128479
930
+ },
931
+ {
932
+ "step": 214,
933
+ "reward": 0.37055355310440063
934
+ },
935
+ {
936
+ "step": 215,
937
+ "reward": 0.7700892686843872
938
+ },
939
+ {
940
+ "step": 216,
941
+ "reward": 0.6158928871154785
942
+ },
943
+ {
944
+ "step": 217,
945
+ "reward": 0.8133928775787354
946
+ },
947
+ {
948
+ "step": 218,
949
+ "reward": 0.5236964225769043
950
+ },
951
+ {
952
+ "step": 219,
953
+ "reward": 0.5915178656578064
954
+ },
955
+ {
956
+ "step": 220,
957
+ "reward": 0.5193749666213989
958
+ },
959
+ {
960
+ "step": 221,
961
+ "reward": 0.956250011920929
962
+ },
963
+ {
964
+ "step": 222,
965
+ "reward": 0.8441964387893677
966
+ },
967
+ {
968
+ "step": 223,
969
+ "reward": 0.5497321486473083
970
+ },
971
+ {
972
+ "step": 224,
973
+ "reward": 0.6256250143051147
974
+ },
975
+ {
976
+ "step": 225,
977
+ "reward": 0.637946367263794
978
+ },
979
+ {
980
+ "step": 226,
981
+ "reward": 0.6213749647140503
982
+ },
983
+ {
984
+ "step": 227,
985
+ "reward": 0.7883928418159485
986
+ },
987
+ {
988
+ "step": 228,
989
+ "reward": 0.6973214149475098
990
+ },
991
+ {
992
+ "step": 229,
993
+ "reward": 0.84375
994
+ },
995
+ {
996
+ "step": 230,
997
+ "reward": 0.6660535335540771
998
+ },
999
+ {
1000
+ "step": 231,
1001
+ "reward": 0.9249999523162842
1002
+ },
1003
+ {
1004
+ "step": 232,
1005
+ "reward": 0.5973213911056519
1006
+ },
1007
+ {
1008
+ "step": 233,
1009
+ "reward": 0.5616071224212646
1010
+ },
1011
+ {
1012
+ "step": 234,
1013
+ "reward": 0.6325000524520874
1014
+ },
1015
+ {
1016
+ "step": 235,
1017
+ "reward": 0.5837500095367432
1018
+ },
1019
+ {
1020
+ "step": 236,
1021
+ "reward": 0.4354107081890106
1022
+ },
1023
+ {
1024
+ "step": 237,
1025
+ "reward": 0.8660714626312256
1026
+ },
1027
+ {
1028
+ "step": 238,
1029
+ "reward": 0.6324999928474426
1030
+ },
1031
+ {
1032
+ "step": 239,
1033
+ "reward": 0.5628928542137146
1034
+ },
1035
+ {
1036
+ "step": 240,
1037
+ "reward": 0.6083928346633911
1038
+ },
1039
+ {
1040
+ "step": 241,
1041
+ "reward": 0.42701786756515503
1042
+ },
1043
+ {
1044
+ "step": 242,
1045
+ "reward": 0.890625
1046
+ },
1047
+ {
1048
+ "step": 243,
1049
+ "reward": 0.8290178775787354
1050
+ },
1051
+ {
1052
+ "step": 244,
1053
+ "reward": 0.6953214406967163
1054
+ },
1055
+ {
1056
+ "step": 245,
1057
+ "reward": 0.5653928518295288
1058
+ },
1059
+ {
1060
+ "step": 246,
1061
+ "reward": 0.6873035430908203
1062
+ },
1063
+ {
1064
+ "step": 247,
1065
+ "reward": 0.5102678537368774
1066
+ },
1067
+ {
1068
+ "step": 248,
1069
+ "reward": 0.5462678670883179
1070
+ },
1071
+ {
1072
+ "step": 249,
1073
+ "reward": 0.9468749761581421
1074
+ },
1075
+ {
1076
+ "step": 250,
1077
+ "reward": 0.4848214387893677
1078
+ },
1079
+ {
1080
+ "step": 251,
1081
+ "reward": 0.7349107265472412
1082
+ },
1083
+ {
1084
+ "step": 252,
1085
+ "reward": 0.5615178346633911
1086
+ },
1087
+ {
1088
+ "step": 253,
1089
+ "reward": 0.859375
1090
+ },
1091
+ {
1092
+ "step": 254,
1093
+ "reward": 0.706250011920929
1094
+ },
1095
+ {
1096
+ "step": 255,
1097
+ "reward": 0.7140178680419922
1098
+ },
1099
+ {
1100
+ "step": 256,
1101
+ "reward": 0.42228570580482483
1102
+ },
1103
+ {
1104
+ "step": 257,
1105
+ "reward": 0.6294910907745361
1106
+ },
1107
+ {
1108
+ "step": 258,
1109
+ "reward": 0.5381250381469727
1110
+ },
1111
+ {
1112
+ "step": 259,
1113
+ "reward": 0.6176071166992188
1114
+ },
1115
+ {
1116
+ "step": 260,
1117
+ "reward": 0.7239285707473755
1118
+ },
1119
+ {
1120
+ "step": 261,
1121
+ "reward": 0.5584821701049805
1122
+ },
1123
+ {
1124
+ "step": 262,
1125
+ "reward": 0.6481249928474426
1126
+ },
1127
+ {
1128
+ "step": 263,
1129
+ "reward": 0.7474821209907532
1130
+ },
1131
+ {
1132
+ "step": 264,
1133
+ "reward": 0.7473214268684387
1134
+ },
1135
+ {
1136
+ "step": 265,
1137
+ "reward": 0.5647231936454773
1138
+ },
1139
+ {
1140
+ "step": 266,
1141
+ "reward": 0.6187499761581421
1142
+ },
1143
+ {
1144
+ "step": 267,
1145
+ "reward": 0.5382499694824219
1146
+ },
1147
+ {
1148
+ "step": 268,
1149
+ "reward": 0.6027321219444275
1150
+ },
1151
+ {
1152
+ "step": 269,
1153
+ "reward": 0.6174999475479126
1154
+ },
1155
+ {
1156
+ "step": 270,
1157
+ "reward": 0.8348214626312256
1158
+ }
1159
+ ],
1160
+ "config": {
1161
+ "model": "Qwen/Qwen2.5-0.5B-Instruct",
1162
+ "n_per_task": 30,
1163
+ "num_generations": 8,
1164
+ "epochs": 3,
1165
+ "lr": 1e-06
1166
+ }
1167
+ }
training_results.png ADDED

Git LFS Details

  • SHA256: ccf993c0d7a261553a8f5cb86a1f2e0395a4d0d7ed7a691747071e47466aea57
  • Pointer size: 131 Bytes
  • Size of remote file: 162 kB