Spaces:

Meta-HF-hackathon
/

updated-policy

Sleeping

App Files Files Community

srinjoyd commited on Apr 26

Commit

290a696

verified ·

1 Parent(s): eb1f7f2

Update BLOG.md

Browse files

Files changed (1) hide show

BLOG.md +1 -113

BLOG.md CHANGED Viewed

@@ -476,116 +476,4 @@ On the training distribution: **yes, clearly.** On novel compounds: **not yet, b
 Fork it. Run it. Beat it. Tell us where we got it wrong.
----
-### Appendix A · Notation glossary
-Every mathematical symbol used above, gathered for reference.
-#### Greek letters
-| Symbol | Reads as | Used for | Value(s) in this work |
-| --- | --- | --- | --- |
-| `α` | alpha | LoRA scaling coefficient — the `α/r` factor multiplies the low-rank update `B A`. **Not a learning rate.** | `α=16` (SFT), `α=32` (GRPO) |
-| `β` | beta | KL-penalty coefficient in the GRPO loss; weights how strongly the policy is pulled toward the frozen reference. | `0.04` |
-| `ε` | epsilon | (i) Numerical stabiliser added to `σ_R` when normalising advantages; (ii) PPO clip width — also written `clip`. | `1e-6`, `0.2` |
-| `μ_R` | mu of R | Mean of the K within-group returns. | runtime |
-| `σ_R` | sigma of R | Standard deviation of the K within-group returns. | runtime |
-| `σ` | sigma | Generic standard deviation; used in plot error bars. | runtime |
-| `τ` | tau | A full episode trajectory `(o_0, a_0, r_0, …, o_T, a_T, r_T)`. | runtime |
-| `τ_1, τ_2` | tau-1, tau-2 | The Phase-1 / Phase-2 sub-trajectories of `τ`. | runtime |
-| `Δ` | delta | Difference between two metrics (e.g. `Δ mean_final = RL − Base`). | reported per-row |
-| `π` (`π_orch`) | pi | A policy. `π_orch` is the orchestrator's routing policy (see hierarchical-RL diagram). | learned |
-#### Reward / return symbols
-| Symbol | Meaning |
-| --- | --- |
-| `R_i` | Group-relative return for rollout `i`: `R_i = terminal_reward_i + r_cross_i`. |
-| `A_i` | GRPO advantage: `A_i = (R_i − μ_R) / (σ_R + ε)`. Standardised within the K-rollout group. |
-| `r_code(...)` | Phase-2 grader score in `[0, 1]` — patch quality (file overlap + AST + syntax) or no-change detection. |
-| `r_cross(τ)` | Counterfactual cross-phase reward, defined in §4. |
-| `final` | Top-level grader output in `[0, 1]`: weighted sum of `p1_rca`, `p1_efficiency`, `patch_quality`, `no_change_detection`, `p2_efficiency`. |
-#### GRPO update symbols (per-token, per-segment)
-| Symbol | Meaning |
-| --- | --- |
-| `plp` | Log-probability of an assistant token under the **policy** (current trainable model). |
-| `rlp` | Log-probability of the same token under the **reference** model (frozen base). |
-| `ratio` | `exp(plp − rlp)` — importance-sampling ratio. |
-| `unclipped`, `clipped` | `ratio · A_i` and `clamp(ratio, 1−ε, 1+ε) · A_i` respectively. |
-| `pg_loss` | `−min(unclipped, clipped)` — clipped surrogate (negated for minimisation). |
-| `kl_loss` | `β · (rlp − plp)` — per-token forward-KL approximation. |
-#### Hyperparameters by name
-| Symbol | Meaning | Value |
-| --- | --- | --- |
-| `K` | GRPO group size (rollouts per prompt). | `4` |
-| `r` | LoRA rank — width of the low-rank update. | `32` (SFT), `16` (GRPO) |
-| `dropout` | Dropout on LoRA `A` activations. | `0.05` |
-| `lr` | AdamW learning rate. | `2e-4` (SFT), `1e-5` (GRPO) |
-| `max_steps` | Step budget per episode. | `40` |
-| `n_tokens` | Total assistant tokens in a GRPO group (used as loss denominator). | runtime |
-#### Stats / evaluation
-| Term | Meaning |
-| --- | --- |
-| **CDF** | Empirical Cumulative Distribution Function of cumulative reward across rollouts (Figure 1). |
-| **`stdev`** | Standard deviation; reported on `final` and as plot error bars (`σ at plateau`). |
-| **`Pearson r`** | Linear correlation coefficient in `[−1, +1]`. Reported between Phase-2 *breadth* (number of unique files inspected) and `final` on Pool D — negative means narrowing search hurts on novel compounds. |
-| **`ECE`** | **Expected Calibration Error.** Average gap between the agent's stated confidence and its empirical accuracy across confidence bins; lower is better. |
-| **`stdev ≤ 0.15`** | Variance-gate threshold over a 64-sample window before Stage 4 opens. |
-#### Misc symbols
-| Symbol | Meaning |
-| --- | --- |
-| `→` | Process step or state transition (e.g. `Base → SFT → GRPO → Merge`). |
-| `×` | Cartesian product / multiplication (e.g. `7 services × 10 actions`). |
-| `·` | List separator in dense tables / captions; also dot product where unambiguous. |
-| `≈` | Approximately equal. |
-| `≤`, `≥` | At most / at least. |
-| `▲ / ▼` | Increase / decrease in a Δ column (sign already encoded in the value). |
-| `∅` | Null / empty context — a Phase-2 episode given no Phase-1 evidence. |
-| `[a, b]` | Closed interval; e.g. component scores live in `[0, 1]`. |
----
-### Appendix B · Diagram source files in this repo
-All images live in **`./assets/`** at the repo root — the canonical HF Spaces convention. Paths in this blog use plain markdown image syntax (`![alt](./assets/file.svg)`) so they render the same way in:
-- the **HF Space README/blog** (relative-path resolution),
-- the **Hugging Face blog** (`huggingface.co/blog/...`),
-- a **GitHub mirror** (no path changes needed),
-- and a **local Markdown preview**.
-| File | Used in | Notes |
-| --- | --- | --- |
-| `./assets/pipeline.svg` | §0 hero, §6 | Five-stage horizontal pipeline (data flow Base → SFT → GRPO → Merge). |
-| `./assets/agent_loop.svg` | §2 | Agent ↔ env loop with the partial-observation card. |
-| `./assets/hierarchical_rl_architecture.svg` | §6 | Three-level hierarchy — orchestrator + subagents + segment-level GRPO with `r_cross`. The *gradient* view that complements pipeline.svg's *data* view. |
-| `./assets/cdf.png` | §7 Figure 1 | Reward CDF per source — drop your chart at this path. |
-| `./assets/efficiency.png` | §7 Figure 2 | Efficiency curve — drop your chart at this path. |
-| Mermaid blocks (inline) | §2, §3, §5 | Render natively on GitHub and HF Space markdown. |
-**To render the SVGs as PNGs (for Twitter / slide decks):**
-```bash
-# Either:
-npx svgexport ./assets/pipeline.svg ./assets/pipeline.png 2x
-# or:
-rsvg-convert -z 2 -o ./assets/pipeline.png ./assets/pipeline.svg
-```
-**To replace the result figures**, drop your two charts at:
-- `./assets/cdf.png` — Figure 1 (reward distribution per source)
-- `./assets/efficiency.png` — Figure 2 (reward vs. steps)
-The blog already links to those paths.
-> **Why `./assets/...`?** HF Spaces resolve relative paths from the rendered file's directory. Putting `BLOG.md` at the repo root and all images under `./assets/` means every link works without ever rewriting a URL — no `https://huggingface.co/spaces/<owner>/<name>/resolve/main/...` boilerplate, no broken paths if the Space is forked.


476
477	Fork it. Run it. Beat it. Tell us where we got it wrong.
478
479	+ ---