Spaces:
Sleeping
Sleeping
Update BLOG.md
Browse files
BLOG.md
CHANGED
|
@@ -476,116 +476,4 @@ On the training distribution: **yes, clearly.** On novel compounds: **not yet, b
|
|
| 476 |
|
| 477 |
Fork it. Run it. Beat it. Tell us where we got it wrong.
|
| 478 |
|
| 479 |
-
---
|
| 480 |
-
|
| 481 |
-
### Appendix A Β· Notation glossary
|
| 482 |
-
|
| 483 |
-
Every mathematical symbol used above, gathered for reference.
|
| 484 |
-
|
| 485 |
-
#### Greek letters
|
| 486 |
-
|
| 487 |
-
| Symbol | Reads as | Used for | Value(s) in this work |
|
| 488 |
-
| --- | --- | --- | --- |
|
| 489 |
-
| `Ξ±` | alpha | LoRA scaling coefficient β the `Ξ±/r` factor multiplies the low-rank update `B A`. **Not a learning rate.** | `Ξ±=16` (SFT), `Ξ±=32` (GRPO) |
|
| 490 |
-
| `Ξ²` | beta | KL-penalty coefficient in the GRPO loss; weights how strongly the policy is pulled toward the frozen reference. | `0.04` |
|
| 491 |
-
| `Ξ΅` | epsilon | (i) Numerical stabiliser added to `Ο_R` when normalising advantages; (ii) PPO clip width β also written `clip`. | `1e-6`, `0.2` |
|
| 492 |
-
| `ΞΌ_R` | mu of R | Mean of the K within-group returns. | runtime |
|
| 493 |
-
| `Ο_R` | sigma of R | Standard deviation of the K within-group returns. | runtime |
|
| 494 |
-
| `Ο` | sigma | Generic standard deviation; used in plot error bars. | runtime |
|
| 495 |
-
| `Ο` | tau | A full episode trajectory `(o_0, a_0, r_0, β¦, o_T, a_T, r_T)`. | runtime |
|
| 496 |
-
| `Ο_1, Ο_2` | tau-1, tau-2 | The Phase-1 / Phase-2 sub-trajectories of `Ο`. | runtime |
|
| 497 |
-
| `Ξ` | delta | Difference between two metrics (e.g. `Ξ mean_final = RL β Base`). | reported per-row |
|
| 498 |
-
| `Ο` (`Ο_orch`) | pi | A policy. `Ο_orch` is the orchestrator's routing policy (see hierarchical-RL diagram). | learned |
|
| 499 |
-
|
| 500 |
-
#### Reward / return symbols
|
| 501 |
-
|
| 502 |
-
| Symbol | Meaning |
|
| 503 |
-
| --- | --- |
|
| 504 |
-
| `R_i` | Group-relative return for rollout `i`: `R_i = terminal_reward_i + r_cross_i`. |
|
| 505 |
-
| `A_i` | GRPO advantage: `A_i = (R_i β ΞΌ_R) / (Ο_R + Ξ΅)`. Standardised within the K-rollout group. |
|
| 506 |
-
| `r_code(...)` | Phase-2 grader score in `[0, 1]` β patch quality (file overlap + AST + syntax) or no-change detection. |
|
| 507 |
-
| `r_cross(Ο)` | Counterfactual cross-phase reward, defined in Β§4. |
|
| 508 |
-
| `final` | Top-level grader output in `[0, 1]`: weighted sum of `p1_rca`, `p1_efficiency`, `patch_quality`, `no_change_detection`, `p2_efficiency`. |
|
| 509 |
-
|
| 510 |
-
#### GRPO update symbols (per-token, per-segment)
|
| 511 |
-
|
| 512 |
-
| Symbol | Meaning |
|
| 513 |
-
| --- | --- |
|
| 514 |
-
| `plp` | Log-probability of an assistant token under the **policy** (current trainable model). |
|
| 515 |
-
| `rlp` | Log-probability of the same token under the **reference** model (frozen base). |
|
| 516 |
-
| `ratio` | `exp(plp β rlp)` β importance-sampling ratio. |
|
| 517 |
-
| `unclipped`, `clipped` | `ratio Β· A_i` and `clamp(ratio, 1βΞ΅, 1+Ξ΅) Β· A_i` respectively. |
|
| 518 |
-
| `pg_loss` | `βmin(unclipped, clipped)` β clipped surrogate (negated for minimisation). |
|
| 519 |
-
| `kl_loss` | `Ξ² Β· (rlp β plp)` β per-token forward-KL approximation. |
|
| 520 |
-
|
| 521 |
-
#### Hyperparameters by name
|
| 522 |
-
|
| 523 |
-
| Symbol | Meaning | Value |
|
| 524 |
-
| --- | --- | --- |
|
| 525 |
-
| `K` | GRPO group size (rollouts per prompt). | `4` |
|
| 526 |
-
| `r` | LoRA rank β width of the low-rank update. | `32` (SFT), `16` (GRPO) |
|
| 527 |
-
| `dropout` | Dropout on LoRA `A` activations. | `0.05` |
|
| 528 |
-
| `lr` | AdamW learning rate. | `2e-4` (SFT), `1e-5` (GRPO) |
|
| 529 |
-
| `max_steps` | Step budget per episode. | `40` |
|
| 530 |
-
| `n_tokens` | Total assistant tokens in a GRPO group (used as loss denominator). | runtime |
|
| 531 |
-
|
| 532 |
-
#### Stats / evaluation
|
| 533 |
-
|
| 534 |
-
| Term | Meaning |
|
| 535 |
-
| --- | --- |
|
| 536 |
-
| **CDF** | Empirical Cumulative Distribution Function of cumulative reward across rollouts (Figure 1). |
|
| 537 |
-
| **`stdev`** | Standard deviation; reported on `final` and as plot error bars (`Ο at plateau`). |
|
| 538 |
-
| **`Pearson r`** | Linear correlation coefficient in `[β1, +1]`. Reported between Phase-2 *breadth* (number of unique files inspected) and `final` on Pool D β negative means narrowing search hurts on novel compounds. |
|
| 539 |
-
| **`ECE`** | **Expected Calibration Error.** Average gap between the agent's stated confidence and its empirical accuracy across confidence bins; lower is better. |
|
| 540 |
-
| **`stdev β€ 0.15`** | Variance-gate threshold over a 64-sample window before Stage 4 opens. |
|
| 541 |
-
|
| 542 |
-
#### Misc symbols
|
| 543 |
-
|
| 544 |
-
| Symbol | Meaning |
|
| 545 |
-
| --- | --- |
|
| 546 |
-
| `β` | Process step or state transition (e.g. `Base β SFT β GRPO β Merge`). |
|
| 547 |
-
| `Γ` | Cartesian product / multiplication (e.g. `7 services Γ 10 actions`). |
|
| 548 |
-
| `Β·` | List separator in dense tables / captions; also dot product where unambiguous. |
|
| 549 |
-
| `β` | Approximately equal. |
|
| 550 |
-
| `β€`, `β₯` | At most / at least. |
|
| 551 |
-
| `β² / βΌ` | Increase / decrease in a Ξ column (sign already encoded in the value). |
|
| 552 |
-
| `β
` | Null / empty context β a Phase-2 episode given no Phase-1 evidence. |
|
| 553 |
-
| `[a, b]` | Closed interval; e.g. component scores live in `[0, 1]`. |
|
| 554 |
-
|
| 555 |
-
---
|
| 556 |
-
|
| 557 |
-
### Appendix B Β· Diagram source files in this repo
|
| 558 |
-
|
| 559 |
-
All images live in **`./assets/`** at the repo root β the canonical HF Spaces convention. Paths in this blog use plain markdown image syntax (``) so they render the same way in:
|
| 560 |
-
|
| 561 |
-
- the **HF Space README/blog** (relative-path resolution),
|
| 562 |
-
- the **Hugging Face blog** (`huggingface.co/blog/...`),
|
| 563 |
-
- a **GitHub mirror** (no path changes needed),
|
| 564 |
-
- and a **local Markdown preview**.
|
| 565 |
-
|
| 566 |
-
| File | Used in | Notes |
|
| 567 |
-
| --- | --- | --- |
|
| 568 |
-
| `./assets/pipeline.svg` | Β§0 hero, Β§6 | Five-stage horizontal pipeline (data flow Base β SFT β GRPO β Merge). |
|
| 569 |
-
| `./assets/agent_loop.svg` | Β§2 | Agent β env loop with the partial-observation card. |
|
| 570 |
-
| `./assets/hierarchical_rl_architecture.svg` | Β§6 | Three-level hierarchy β orchestrator + subagents + segment-level GRPO with `r_cross`. The *gradient* view that complements pipeline.svg's *data* view. |
|
| 571 |
-
| `./assets/cdf.png` | Β§7 Figure 1 | Reward CDF per source β drop your chart at this path. |
|
| 572 |
-
| `./assets/efficiency.png` | Β§7 Figure 2 | Efficiency curve β drop your chart at this path. |
|
| 573 |
-
| Mermaid blocks (inline) | Β§2, Β§3, Β§5 | Render natively on GitHub and HF Space markdown. |
|
| 574 |
-
|
| 575 |
-
**To render the SVGs as PNGs (for Twitter / slide decks):**
|
| 576 |
-
|
| 577 |
-
```bash
|
| 578 |
-
# Either:
|
| 579 |
-
npx svgexport ./assets/pipeline.svg ./assets/pipeline.png 2x
|
| 580 |
-
# or:
|
| 581 |
-
rsvg-convert -z 2 -o ./assets/pipeline.png ./assets/pipeline.svg
|
| 582 |
-
```
|
| 583 |
-
|
| 584 |
-
**To replace the result figures**, drop your two charts at:
|
| 585 |
-
|
| 586 |
-
- `./assets/cdf.png` β Figure 1 (reward distribution per source)
|
| 587 |
-
- `./assets/efficiency.png` β Figure 2 (reward vs. steps)
|
| 588 |
-
|
| 589 |
-
The blog already links to those paths.
|
| 590 |
-
|
| 591 |
-
> **Why `./assets/...`?** HF Spaces resolve relative paths from the rendered file's directory. Putting `BLOG.md` at the repo root and all images under `./assets/` means every link works without ever rewriting a URL β no `https://huggingface.co/spaces/<owner>/<name>/resolve/main/...` boilerplate, no broken paths if the Space is forked.
|
|
|
|
| 476 |
|
| 477 |
Fork it. Run it. Beat it. Tell us where we got it wrong.
|
| 478 |
|
| 479 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|