Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
|
@@ -200,6 +200,8 @@ docker run -p 7860:7860 exec-assist
|
|
| 200 |
|
| 201 |
**Approach:** Pre-collect 90 scenarios from the deployed HF Space (30 each across easy/medium/hard) into a `Dataset`, with each scenario stored as a column. The reward function receives the scenario as a kwarg and scores deterministically without API calls during training. This decouples training from environment latency.
|
| 202 |
|
|
|
|
|
|
|
| 203 |
**Hyperparameters (final, working config):**
|
| 204 |
|
| 205 |
```python
|
|
|
|
| 200 |
|
| 201 |
**Approach:** Pre-collect 90 scenarios from the deployed HF Space (30 each across easy/medium/hard) into a `Dataset`, with each scenario stored as a column. The reward function receives the scenario as a kwarg and scores deterministically without API calls during training. This decouples training from environment latency.
|
| 202 |
|
| 203 |
+
**A note on loss curves.** The plot above shows reward curves with moving averages and a reward-variance "convergence proxy" (lower variance = stabilizing policy). GRPO's actual loss values were logged to TensorBoard during training but weren't exported to the published `results.json` due to Colab session limits — the reward signal is the primary training metric for GRPO with verifiable rewards (RLVR), and is what the policy is directly optimized against. The variance plot serves the same diagnostic purpose: showing convergence over time.
|
| 204 |
+
|
| 205 |
**Hyperparameters (final, working config):**
|
| 206 |
|
| 207 |
```python
|