srinjoyd commited on
Commit
290a696
Β·
verified Β·
1 Parent(s): eb1f7f2

Update BLOG.md

Browse files
Files changed (1) hide show
  1. BLOG.md +1 -113
BLOG.md CHANGED
@@ -476,116 +476,4 @@ On the training distribution: **yes, clearly.** On novel compounds: **not yet, b
476
 
477
  Fork it. Run it. Beat it. Tell us where we got it wrong.
478
 
479
- ---
480
-
481
- ### Appendix A Β· Notation glossary
482
-
483
- Every mathematical symbol used above, gathered for reference.
484
-
485
- #### Greek letters
486
-
487
- | Symbol | Reads as | Used for | Value(s) in this work |
488
- | --- | --- | --- | --- |
489
- | `Ξ±` | alpha | LoRA scaling coefficient β€” the `Ξ±/r` factor multiplies the low-rank update `B A`. **Not a learning rate.** | `Ξ±=16` (SFT), `Ξ±=32` (GRPO) |
490
- | `Ξ²` | beta | KL-penalty coefficient in the GRPO loss; weights how strongly the policy is pulled toward the frozen reference. | `0.04` |
491
- | `Ξ΅` | epsilon | (i) Numerical stabiliser added to `Οƒ_R` when normalising advantages; (ii) PPO clip width β€” also written `clip`. | `1e-6`, `0.2` |
492
- | `ΞΌ_R` | mu of R | Mean of the K within-group returns. | runtime |
493
- | `Οƒ_R` | sigma of R | Standard deviation of the K within-group returns. | runtime |
494
- | `Οƒ` | sigma | Generic standard deviation; used in plot error bars. | runtime |
495
- | `Ο„` | tau | A full episode trajectory `(o_0, a_0, r_0, …, o_T, a_T, r_T)`. | runtime |
496
- | `Ο„_1, Ο„_2` | tau-1, tau-2 | The Phase-1 / Phase-2 sub-trajectories of `Ο„`. | runtime |
497
- | `Ξ”` | delta | Difference between two metrics (e.g. `Ξ” mean_final = RL βˆ’ Base`). | reported per-row |
498
- | `Ο€` (`Ο€_orch`) | pi | A policy. `Ο€_orch` is the orchestrator's routing policy (see hierarchical-RL diagram). | learned |
499
-
500
- #### Reward / return symbols
501
-
502
- | Symbol | Meaning |
503
- | --- | --- |
504
- | `R_i` | Group-relative return for rollout `i`: `R_i = terminal_reward_i + r_cross_i`. |
505
- | `A_i` | GRPO advantage: `A_i = (R_i βˆ’ ΞΌ_R) / (Οƒ_R + Ξ΅)`. Standardised within the K-rollout group. |
506
- | `r_code(...)` | Phase-2 grader score in `[0, 1]` β€” patch quality (file overlap + AST + syntax) or no-change detection. |
507
- | `r_cross(Ο„)` | Counterfactual cross-phase reward, defined in Β§4. |
508
- | `final` | Top-level grader output in `[0, 1]`: weighted sum of `p1_rca`, `p1_efficiency`, `patch_quality`, `no_change_detection`, `p2_efficiency`. |
509
-
510
- #### GRPO update symbols (per-token, per-segment)
511
-
512
- | Symbol | Meaning |
513
- | --- | --- |
514
- | `plp` | Log-probability of an assistant token under the **policy** (current trainable model). |
515
- | `rlp` | Log-probability of the same token under the **reference** model (frozen base). |
516
- | `ratio` | `exp(plp βˆ’ rlp)` β€” importance-sampling ratio. |
517
- | `unclipped`, `clipped` | `ratio Β· A_i` and `clamp(ratio, 1βˆ’Ξ΅, 1+Ξ΅) Β· A_i` respectively. |
518
- | `pg_loss` | `βˆ’min(unclipped, clipped)` β€” clipped surrogate (negated for minimisation). |
519
- | `kl_loss` | `Ξ² Β· (rlp βˆ’ plp)` β€” per-token forward-KL approximation. |
520
-
521
- #### Hyperparameters by name
522
-
523
- | Symbol | Meaning | Value |
524
- | --- | --- | --- |
525
- | `K` | GRPO group size (rollouts per prompt). | `4` |
526
- | `r` | LoRA rank β€” width of the low-rank update. | `32` (SFT), `16` (GRPO) |
527
- | `dropout` | Dropout on LoRA `A` activations. | `0.05` |
528
- | `lr` | AdamW learning rate. | `2e-4` (SFT), `1e-5` (GRPO) |
529
- | `max_steps` | Step budget per episode. | `40` |
530
- | `n_tokens` | Total assistant tokens in a GRPO group (used as loss denominator). | runtime |
531
-
532
- #### Stats / evaluation
533
-
534
- | Term | Meaning |
535
- | --- | --- |
536
- | **CDF** | Empirical Cumulative Distribution Function of cumulative reward across rollouts (Figure 1). |
537
- | **`stdev`** | Standard deviation; reported on `final` and as plot error bars (`Οƒ at plateau`). |
538
- | **`Pearson r`** | Linear correlation coefficient in `[βˆ’1, +1]`. Reported between Phase-2 *breadth* (number of unique files inspected) and `final` on Pool D β€” negative means narrowing search hurts on novel compounds. |
539
- | **`ECE`** | **Expected Calibration Error.** Average gap between the agent's stated confidence and its empirical accuracy across confidence bins; lower is better. |
540
- | **`stdev ≀ 0.15`** | Variance-gate threshold over a 64-sample window before Stage 4 opens. |
541
-
542
- #### Misc symbols
543
-
544
- | Symbol | Meaning |
545
- | --- | --- |
546
- | `β†’` | Process step or state transition (e.g. `Base β†’ SFT β†’ GRPO β†’ Merge`). |
547
- | `Γ—` | Cartesian product / multiplication (e.g. `7 services Γ— 10 actions`). |
548
- | `Β·` | List separator in dense tables / captions; also dot product where unambiguous. |
549
- | `β‰ˆ` | Approximately equal. |
550
- | `≀`, `β‰₯` | At most / at least. |
551
- | `β–² / β–Ό` | Increase / decrease in a Ξ” column (sign already encoded in the value). |
552
- | `βˆ…` | Null / empty context β€” a Phase-2 episode given no Phase-1 evidence. |
553
- | `[a, b]` | Closed interval; e.g. component scores live in `[0, 1]`. |
554
-
555
- ---
556
-
557
- ### Appendix B Β· Diagram source files in this repo
558
-
559
- All images live in **`./assets/`** at the repo root β€” the canonical HF Spaces convention. Paths in this blog use plain markdown image syntax (`![alt](./assets/file.svg)`) so they render the same way in:
560
-
561
- - the **HF Space README/blog** (relative-path resolution),
562
- - the **Hugging Face blog** (`huggingface.co/blog/...`),
563
- - a **GitHub mirror** (no path changes needed),
564
- - and a **local Markdown preview**.
565
-
566
- | File | Used in | Notes |
567
- | --- | --- | --- |
568
- | `./assets/pipeline.svg` | Β§0 hero, Β§6 | Five-stage horizontal pipeline (data flow Base β†’ SFT β†’ GRPO β†’ Merge). |
569
- | `./assets/agent_loop.svg` | Β§2 | Agent ↔ env loop with the partial-observation card. |
570
- | `./assets/hierarchical_rl_architecture.svg` | Β§6 | Three-level hierarchy β€” orchestrator + subagents + segment-level GRPO with `r_cross`. The *gradient* view that complements pipeline.svg's *data* view. |
571
- | `./assets/cdf.png` | Β§7 Figure 1 | Reward CDF per source β€” drop your chart at this path. |
572
- | `./assets/efficiency.png` | Β§7 Figure 2 | Efficiency curve β€” drop your chart at this path. |
573
- | Mermaid blocks (inline) | Β§2, Β§3, Β§5 | Render natively on GitHub and HF Space markdown. |
574
-
575
- **To render the SVGs as PNGs (for Twitter / slide decks):**
576
-
577
- ```bash
578
- # Either:
579
- npx svgexport ./assets/pipeline.svg ./assets/pipeline.png 2x
580
- # or:
581
- rsvg-convert -z 2 -o ./assets/pipeline.png ./assets/pipeline.svg
582
- ```
583
-
584
- **To replace the result figures**, drop your two charts at:
585
-
586
- - `./assets/cdf.png` β€” Figure 1 (reward distribution per source)
587
- - `./assets/efficiency.png` β€” Figure 2 (reward vs. steps)
588
-
589
- The blog already links to those paths.
590
-
591
- > **Why `./assets/...`?** HF Spaces resolve relative paths from the rendered file's directory. Putting `BLOG.md` at the repo root and all images under `./assets/` means every link works without ever rewriting a URL β€” no `https://huggingface.co/spaces/<owner>/<name>/resolve/main/...` boilerplate, no broken paths if the Space is forked.
 
476
 
477
  Fork it. Run it. Beat it. Tell us where we got it wrong.
478
 
479
+ ---