Spaces:

DevanshuDon
/

exec-assist

Sleeping

DevanshuDon commited on 24 days ago

Commit

2cce177

verified ·

1 Parent(s): 5f21d0a

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -200,6 +200,8 @@ docker run -p 7860:7860 exec-assist
 **Approach:** Pre-collect 90 scenarios from the deployed HF Space (30 each across easy/medium/hard) into a `Dataset`, with each scenario stored as a column. The reward function receives the scenario as a kwarg and scores deterministically without API calls during training. This decouples training from environment latency.
 **Hyperparameters (final, working config):**
 ```python

 **Approach:** Pre-collect 90 scenarios from the deployed HF Space (30 each across easy/medium/hard) into a `Dataset`, with each scenario stored as a column. The reward function receives the scenario as a kwarg and scores deterministically without API calls during training. This decouples training from environment latency.
+**A note on loss curves.** The plot above shows reward curves with moving averages and a reward-variance "convergence proxy" (lower variance = stabilizing policy). GRPO's actual loss values were logged to TensorBoard during training but weren't exported to the published `results.json` due to Colab session limits — the reward signal is the primary training metric for GRPO with verifiable rewards (RLVR), and is what the policy is directly optimized against. The variance plot serves the same diagnostic purpose: showing convergence over time.
 **Hyperparameters (final, working config):**
 ```python