DevanshuDon commited on
Commit
2cce177
·
verified ·
1 Parent(s): 5f21d0a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -200,6 +200,8 @@ docker run -p 7860:7860 exec-assist
200
 
201
  **Approach:** Pre-collect 90 scenarios from the deployed HF Space (30 each across easy/medium/hard) into a `Dataset`, with each scenario stored as a column. The reward function receives the scenario as a kwarg and scores deterministically without API calls during training. This decouples training from environment latency.
202
 
 
 
203
  **Hyperparameters (final, working config):**
204
 
205
  ```python
 
200
 
201
  **Approach:** Pre-collect 90 scenarios from the deployed HF Space (30 each across easy/medium/hard) into a `Dataset`, with each scenario stored as a column. The reward function receives the scenario as a kwarg and scores deterministically without API calls during training. This decouples training from environment latency.
202
 
203
+ **A note on loss curves.** The plot above shows reward curves with moving averages and a reward-variance "convergence proxy" (lower variance = stabilizing policy). GRPO's actual loss values were logged to TensorBoard during training but weren't exported to the published `results.json` due to Colab session limits — the reward signal is the primary training metric for GRPO with verifiable rewards (RLVR), and is what the policy is directly optimized against. The variance plot serves the same diagnostic purpose: showing convergence over time.
204
+
205
  **Hyperparameters (final, working config):**
206
 
207
  ```python