Spaces:

sanjay7676
/

Team404_FORGE

Sleeping

App Files Files Community

sanjay7676 commited on Apr 26

Commit

2b35bd5

1 Parent(s): a2ac82d

fix(api): Pydantic v2 step payload (exclude_none); env candidate_solutions; README Space API curl guide; Gradio API tab

Browse files

Files changed (6) hide show

BLOG.md +39 -34
MINI_BLOG.md +0 -47
README.md +30 -3
api_server.py +2 -1
app.py +9 -7
env.py +1 -1

BLOG.md CHANGED Viewed

@@ -1,47 +1,52 @@
-# 🛡️ FORGE-v4: Building the "Immune System" for AI Code Generation
-### The Silent Crisis in AI Coding
-We've all seen it: an AI writes a perfect "Quick Sort" in seconds. But what happens when you give that same code an array of 10,000 duplicate zeros? Or a list of mixed large negatives? Often, the AI's "perfect" code crashes, enters an infinite loop, or returns incorrect results.
-Standard benchmarks measure **capability**. We built **FORGE-v4** to measure **robustness**.
----
-## ⚔️ The Concept: Adversarial Red-Teaming
-FORGE-v4 isn't just a static test suite; it's a living environment. We implemented a **Red-vs-Blue** dynamic:
-- **The Defender (Blue Team)**: Our Coder agent tries to solve sorting tasks correctly.
-- **The Adversary (Red Team)**: Our Breaker agent actively searches for the Coder's "blind spots."
-As the Coder improves, the Breaker escalates. It progresses through **4 Tiers of difficulty**—from basic lists to extreme boundary values and stress tests. This tiered red-teaming ensures that the model isn't just memorizing common patterns, but actually hardening its logic.
----
-## 🧠 The Secret Sauce: CoachMemory
-One of the most innovative features of FORGE-v4 is the **CoachMemory feedback loop**.
-In most training environments, a model fails, gets a low reward, and moves on. In FORGE-v4, every failure is analyzed by the "Coach."
-*   Did the model fail on negatives?
-*   Did it time out on large arrays?
-*   Did it destroy duplicates?
-These insights are stored in persistent memory. In the next episode, the model reads these "lessons" and adapts its strategy. This mimics the human engineering process: **Mistake → Analysis → Correction.**
----
-## 📈 Results that Matter
-Our benchmarks show that while a baseline heuristic policy might have a high "average" pass rate (91%), it is easily broken by Tier 3 and Tier 4 attacks.
-Our **FORGE-v4 Model Policy** achieved:
-- **100% Pass Rate** across all adversarial tiers.
-- **+2.10 Reward Gain** over the baseline.
-- **Sustained Tier 4 Robustness**: It didn't just survive; it thrived under extreme pressure.
----
-## 🌍 Why This Matters
-As AI agents move from "writing scripts" to "building infrastructure," robustness is no longer optional. FORGE-v4 provides the framework to ensure that the code powering our world is not just smart, but **unbreakable**.
-**Try the demo:** [Hugging Face Space](https://huggingface.co/spaces/sanjay7676/Team404_FORGE)
----
-*Created with ❤️ for the Meta OpenEnv Hackathon by Team 404.*

+# FORGE-v4 Mini Blog: From Fragile Code to Adversarial Robustness
+## The story in one line
+FORGE-v4 trains a coding agent to survive adversarial edge cases by making it fight a breaker, learn from failures, and improve over repeated reward-driven episodes.
+## Why we built this
+Most coding models look good on clean examples and then fail on real inputs: negatives, duplicates, boundary values, and timeout-prone cases. We wanted an environment where failure is explicit, measurable, and useful for training.
+## The journey
+### Chapter 1: baseline confidence, hidden fragility
+We started with a defender that often passed easy tests but broke under stress tiers. That gave us a critical signal: average correctness is not robustness.
+### Chapter 2: breaker escalation
+We added a tiered breaker that progressively attacked blind spots. The environment moved from simple lists to harder adversarial distributions.
+### Chapter 3: memory as improvement engine
+CoachMemory converted repeated failure patterns into structured lessons. Instead of forgetting mistakes each episode, the loop made mistakes actionable.
+### Chapter 4: measurable training loop
+We used benchmark/compare runs to produce reward and pass-rate evidence, exported preference pairs, and connected that to a small-model-first adapter training path.
+## What changed after training cycles
+- Defender pass rate stabilized under tougher tiers.
+- Average defender reward improved versus baseline runs.
+- Breaker pressure remained high, but the defender failed less often on known edge patterns.
+## Evidence (committed outputs)
+### Reward trend
+![Reward curve](outputs/reward_curve.png)
+### Pass-rate trend
+![Pass rate curve](outputs/pass_rate.png)
+### Loss-like training signal
+![Loss curve](outputs/loss_curve.png)
+### Machine-readable benchmark summary
+- `outputs/final_report.json`
+## Deliverables
+- Hugging Face Space: https://huggingface.co/spaces/sanjay7676/Team404_FORGE
+- GitHub repository: https://github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2
+- **Docker image (public — anyone can pull)**
+  - **Docker Hub (browse tags):** https://hub.docker.com/r/sanjay767676/forge
+  - **Pull command:** `docker pull sanjay767676/forge:latest`
+  - **Registry image reference:** `docker.io/sanjay767676/forge:latest`
+- Colab notebook: https://colab.research.google.com/github/Sanjay767676/Meta-x-Scaler-Team404--Round2/blob/main/FORGE_Training_Colab.ipynb
+- Colab model + adapter training: https://colab.research.google.com/drive/1mKXjIX-eB2GSiebI-_n37KzVlN1NKCu8?usp=sharing
+- YouTube demo placeholder: https://youtube.com/watch?v=YOUR_DEMO_VIDEO_ID
+## Why this matters
+FORGE-v4 is designed to train coding behavior that is verifiable, harder to reward-hack, and more resilient under adversarial conditions. That is the capability gap we think matters most for real LLM deployment.

MINI_BLOG.md DELETED Viewed

@@ -1,47 +0,0 @@
-# FORGE-v4 Mini Blog: From Fragile Code to Adversarial Robustness
-## The story in one line
-FORGE-v4 trains a coding agent to survive adversarial edge cases by making it fight a breaker, learn from failures, and improve over repeated reward-driven episodes.
-## Why we built this
-Most coding models look good on clean examples and then fail on real inputs: negatives, duplicates, boundary values, and timeout-prone cases. We wanted an environment where failure is explicit, measurable, and useful for training.
-## The journey
-### Chapter 1: baseline confidence, hidden fragility
-We started with a defender that often passed easy tests but broke under stress tiers. That gave us a critical signal: average correctness is not robustness.
-### Chapter 2: breaker escalation
-We added a tiered breaker that progressively attacked blind spots. The environment moved from simple lists to harder adversarial distributions.
-### Chapter 3: memory as improvement engine
-CoachMemory converted repeated failure patterns into structured lessons. Instead of forgetting mistakes each episode, the loop made mistakes actionable.
-### Chapter 4: measurable training loop
-We used benchmark/compare runs to produce reward and pass-rate evidence, exported preference pairs, and connected that to a small-model-first adapter training path.
-## What changed after training cycles
-- Defender pass rate stabilized under tougher tiers.
-- Average defender reward improved versus baseline runs.
-- Breaker pressure remained high, but the defender failed less often on known edge patterns.
-## Evidence (committed outputs)
-### Reward trend
-![Reward curve](outputs/reward_curve.png)
-### Pass-rate trend
-![Pass rate curve](outputs/pass_rate.png)
-### Loss-like training signal
-![Loss curve](outputs/loss_curve.png)
-### Machine-readable benchmark summary
-- `outputs/final_report.json`
-## Deliverables
-- Hugging Face Space: https://huggingface.co/spaces/sanjay7676/Team404_FORGE
-- GitHub repository: https://github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2
-- Colab notebook: https://colab.research.google.com/github/Sanjay767676/Meta-x-Scaler-Team404--Round2/blob/main/FORGE_Training_Colab.ipynb
-- YouTube demo placeholder: https://youtube.com/watch?v=YOUR_DEMO_VIDEO_ID
-## Why this matters
-FORGE-v4 is designed to train coding behavior that is verifiable, harder to reward-hack, and more resilient under adversarial conditions. That is the capability gap we think matters most for real LLM deployment.

README.md CHANGED Viewed

@@ -20,6 +20,7 @@ suggested_hardware: cpu-basic
 [![Colab (Drive)](https://img.shields.io/badge/Training-Colab-orange)](https://colab.research.google.com/drive/1mKXjIX-eB2GSiebI-_n37KzVlN1NKCu8?usp=sharing)
 [![Colab (GitHub)](https://img.shields.io/badge/Colab-GitHub_sync-green)](https://colab.research.google.com/github/Sanjay767676/Meta-x-Scaler-Team404--Round2/blob/main/FORGE_Training_Colab.ipynb)
 [![Adapter](https://img.shields.io/badge/HF-Adapter-blue)](https://huggingface.co/sanjay7676/forge-qwen-final)
 [![Hackathon Guide](https://img.shields.io/badge/Meta-OpenEnv%20Guide-0a66c2)](https://docs.google.com/document/d/1Odznuzwtb1ecDOm2t6ToZd4MuMXXfO6vWUGcxbC6mFs/edit?tab=t.0#bookmark=kix.2dz0x0nie3me)
 ### Judge quick links (all materials)
@@ -30,10 +31,12 @@ suggested_hardware: cpu-basic
 | **OpenEnv + TRL (framework docs)** | [Hugging Face TRL — OpenEnv integration](https://huggingface.co/docs/trl/openenv) |
 | **Hugging Face Space (submit this URL)** | [huggingface.co/spaces/sanjay7676/Team404_FORGE](https://huggingface.co/spaces/sanjay7676/Team404_FORGE) |
 | **Source code** | [github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2](https://github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2) |
-| **Mini-blog (writeup)** | [MINI_BLOG.md](MINI_BLOG.md) in repo |
 | **Training Colab (author Drive)** | [Colab notebook](https://colab.research.google.com/drive/1mKXjIX-eB2GSiebI-_n37KzVlN1NKCu8?usp=sharing) |
 | **Training Colab (synced from GitHub)** | [FORGE_Training_Colab.ipynb on Colab](https://colab.research.google.com/github/Sanjay767676/Meta-x-Scaler-Team404--Round2/blob/main/FORGE_Training_Colab.ipynb) |
 | **Trained adapter** | [sanjay7676/forge-qwen-final](https://huggingface.co/sanjay7676/forge-qwen-final) |
 | **Command / security cheat sheet** | [guide.md](guide.md) |
 | **Video / slides** | YouTube demo placeholder: https://youtube.com/watch?v=YOUR_DEMO_VIDEO_ID |
@@ -44,6 +47,25 @@ suggested_hardware: cpu-basic
 - For a stable demo on CPU, set Space secret **`CODE_PROVIDER_MODE=mock`** (or use **NIM** / **OpenRouter** keys so the router never loads local `custom_hf`). Loading **`Qwen2.5-Coder-1.5B` + LoRA** on free CPU is likely to **OOM or time out**.
 - Full training stack: install **[`requirements-train.txt`](requirements-train.txt)** on **Colab** or locally (see Quickstart).
 ### NOTE 1 — Non‑negotiable submission requirements (checklist)
 | # | Requirement | FORGE-v4 |
@@ -51,7 +73,7 @@ suggested_hardware: cpu-basic
 | 1 | **OpenEnv (latest):** build on the framework | **`openenv-core>=0.2.3`** in [`requirements.txt`](requirements.txt). Training extras in [`requirements-train.txt`](requirements-train.txt). Wrapper: [`env_openenv.py`](env_openenv.py). Core: [`env.py`](env.py). |
 | 2 | **Training:** Unsloth or TRL (or other RL stack) + **Colab** | [`train_unsloth.py`](train_unsloth.py) (Unsloth + TRL), [`train_colab.py`](train_colab.py), [`FORGE_Training_Colab.ipynb`](FORGE_Training_Colab.ipynb), Colab links in the table above. |
 | 3 | **Evidence of training:** loss + reward plots (real run) | Committed: [`outputs/reward_curve.png`](outputs/reward_curve.png), [`outputs/loss_curve.png`](outputs/loss_curve.png), [`outputs/pass_rate.png`](outputs/pass_rate.png), [`outputs/final_report.json`](outputs/final_report.json). |
-| 4 | **Writeup / video:** mini-blog on HF *or* &lt;2 min YouTube *etc.* | **[MINI_BLOG.md](MINI_BLOG.md)** linked here; add **public YouTube or slide URL** in the table row when published. |
 | 5 | **Hugging Face Space:** discoverable & runnable | **[Team404_FORGE](https://huggingface.co/spaces/sanjay7676/Team404_FORGE)** — **use this URL in the submission form.** |
 | 6 | **README:** motivate, explain env, show results + **link Space + all materials** | This file. |
 | 7 | **No huge video files** on Hub | Only **URLs** to external video/slides (see table). |
@@ -77,7 +99,7 @@ suggested_hardware: cpu-basic
 ## Minimum submission checklist (summary)
-Same items as **NOTE 1** above: OpenEnv dependency + wrapper, Colab + training scripts, committed plots/JSON, writeup link, runnable Space URL, README hub — all linked from the **Judge quick links** table.
 ---
@@ -304,6 +326,11 @@ python train_unsloth.py --mode dpo
 Public image on **Docker Hub**: **`sanjay767676/forge`** (repository `forge` under user `sanjay767676`).
 ### Pull & run (no build — public image)
 ```bash

 [![Colab (Drive)](https://img.shields.io/badge/Training-Colab-orange)](https://colab.research.google.com/drive/1mKXjIX-eB2GSiebI-_n37KzVlN1NKCu8?usp=sharing)
 [![Colab (GitHub)](https://img.shields.io/badge/Colab-GitHub_sync-green)](https://colab.research.google.com/github/Sanjay767676/Meta-x-Scaler-Team404--Round2/blob/main/FORGE_Training_Colab.ipynb)
 [![Adapter](https://img.shields.io/badge/HF-Adapter-blue)](https://huggingface.co/sanjay7676/forge-qwen-final)
+[![Docker Hub](https://img.shields.io/badge/Docker%20Hub-sanjay767676%2Fforge-2496ED?logo=docker&logoColor=white)](https://hub.docker.com/r/sanjay767676/forge)
 [![Hackathon Guide](https://img.shields.io/badge/Meta-OpenEnv%20Guide-0a66c2)](https://docs.google.com/document/d/1Odznuzwtb1ecDOm2t6ToZd4MuMXXfO6vWUGcxbC6mFs/edit?tab=t.0#bookmark=kix.2dz0x0nie3me)
 ### Judge quick links (all materials)
 | **OpenEnv + TRL (framework docs)** | [Hugging Face TRL — OpenEnv integration](https://huggingface.co/docs/trl/openenv) |
 | **Hugging Face Space (submit this URL)** | [huggingface.co/spaces/sanjay7676/Team404_FORGE](https://huggingface.co/spaces/sanjay7676/Team404_FORGE) |
 | **Source code** | [github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2](https://github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2) |
+| **Blog (writeup)** | [BLOG.md](BLOG.md) in repo |
 | **Training Colab (author Drive)** | [Colab notebook](https://colab.research.google.com/drive/1mKXjIX-eB2GSiebI-_n37KzVlN1NKCu8?usp=sharing) |
+| **Colab model + adapter training** | https://colab.research.google.com/drive/1mKXjIX-eB2GSiebI-_n37KzVlN1NKCu8?usp=sharing |
 | **Training Colab (synced from GitHub)** | [FORGE_Training_Colab.ipynb on Colab](https://colab.research.google.com/github/Sanjay767676/Meta-x-Scaler-Team404--Round2/blob/main/FORGE_Training_Colab.ipynb) |
 | **Trained adapter** | [sanjay7676/forge-qwen-final](https://huggingface.co/sanjay7676/forge-qwen-final) |
+| **Docker image (public — anyone can pull)** | **Hub (tags, README):** [hub.docker.com/r/sanjay767676/forge](https://hub.docker.com/r/sanjay767676/forge) — **pull:** `docker pull sanjay767676/forge:latest` — **registry ref:** `docker.io/sanjay767676/forge:latest` |
 | **Command / security cheat sheet** | [guide.md](guide.md) |
 | **Video / slides** | YouTube demo placeholder: https://youtube.com/watch?v=YOUR_DEMO_VIDEO_ID |
 - For a stable demo on CPU, set Space secret **`CODE_PROVIDER_MODE=mock`** (or use **NIM** / **OpenRouter** keys so the router never loads local `custom_hf`). Loading **`Qwen2.5-Coder-1.5B` + LoRA** on free CPU is likely to **OOM or time out**.
 - Full training stack: install **[`requirements-train.txt`](requirements-train.txt)** on **Colab** or locally (see Quickstart).
+### OpenEnv HTTP API on the Hugging Face Space
+The Space runs the same FastAPI routes as [`api_server.py`](api_server.py) on the **app root** (Gradio UI is at **`/ui`**; `/` redirects to `/ui`). There is **no `/start`** endpoint — begin an episode with **`POST /reset`**, then drive it with **`POST /step`**.
+1. **Base URL:** open the live Space, then use the **`*.hf.space`** host shown in the address bar (for this project it is typically **`https://sanjay7676-team404-forge.hf.space`**). If yours differs, copy it from the running app or from the Space **Embed** snippet.
+2. **Check liveness:** `curl -sS "https://sanjay7676-team404-forge.hf.space/health"`
+3. **New episode:** `curl -sS -X POST "https://sanjay7676-team404-forge.hf.space/reset" -H "Content-Type: application/json"`
+4. **Step** (JSON body must include `coder_code` and `coder_version`; omit `candidate_solutions` or send a JSON array of strings):
+```bash
+curl -sS -X POST "https://sanjay7676-team404-forge.hf.space/step" \
+  -H "Content-Type: application/json" \
+  -d "{\"coder_code\": \"def solution(arr):\\n    return sorted(list(arr))\", \"coder_version\": \"demo\"}"
+```
+5. **Observe state:** `curl -sS "https://sanjay7676-team404-forge.hf.space/state"`
+**Note:** The Space shares **one** in-memory environment across all visitors — concurrent `reset` / `step` calls can interleave. For isolated runs, use **Docker** or **local** `api_server.py` on port `8000`.
 ### NOTE 1 — Non‑negotiable submission requirements (checklist)
 | # | Requirement | FORGE-v4 |
 | 1 | **OpenEnv (latest):** build on the framework | **`openenv-core>=0.2.3`** in [`requirements.txt`](requirements.txt). Training extras in [`requirements-train.txt`](requirements-train.txt). Wrapper: [`env_openenv.py`](env_openenv.py). Core: [`env.py`](env.py). |
 | 2 | **Training:** Unsloth or TRL (or other RL stack) + **Colab** | [`train_unsloth.py`](train_unsloth.py) (Unsloth + TRL), [`train_colab.py`](train_colab.py), [`FORGE_Training_Colab.ipynb`](FORGE_Training_Colab.ipynb), Colab links in the table above. |
 | 3 | **Evidence of training:** loss + reward plots (real run) | Committed: [`outputs/reward_curve.png`](outputs/reward_curve.png), [`outputs/loss_curve.png`](outputs/loss_curve.png), [`outputs/pass_rate.png`](outputs/pass_rate.png), [`outputs/final_report.json`](outputs/final_report.json). |
+| 4 | **Writeup / video:** mini-blog on HF *or* &lt;2 min YouTube *etc.* | **[BLOG.md](BLOG.md)** linked here; add **public YouTube or slide URL** in the table row when published. |
 | 5 | **Hugging Face Space:** discoverable & runnable | **[Team404_FORGE](https://huggingface.co/spaces/sanjay7676/Team404_FORGE)** — **use this URL in the submission form.** |
 | 6 | **README:** motivate, explain env, show results + **link Space + all materials** | This file. |
 | 7 | **No huge video files** on Hub | Only **URLs** to external video/slides (see table). |
 ## Minimum submission checklist (summary)
+Same items as **NOTE 1** above: OpenEnv dependency + wrapper, Colab + training scripts, committed plots/JSON, writeup link, runnable Space URL, **public Docker image** ([Hub](https://hub.docker.com/r/sanjay767676/forge) + `docker pull sanjay767676/forge:latest`), README hub — all linked from the **Judge quick links** table.
 ---
 Public image on **Docker Hub**: **`sanjay767676/forge`** (repository `forge` under user `sanjay767676`).
+| What | URL / reference |
+| :-- | :-- |
+| **Browse image (tags, description)** | [https://hub.docker.com/r/sanjay767676/forge](https://hub.docker.com/r/sanjay767676/forge) |
+| **Pull from CLI** | `docker pull sanjay767676/forge:latest` (same as `docker pull docker.io/sanjay767676/forge:latest`) |
 ### Pull & run (no build — public image)
 ```bash

api_server.py CHANGED Viewed

@@ -30,7 +30,8 @@ async def reset():
 async def step(action: Action):
     """Perform a step in the environment."""
     try:
-        result = env.step(action.model_dump())
         return result
     except RuntimeError as e:
         raise HTTPException(status_code=400, detail=str(e))

 async def step(action: Action):
     """Perform a step in the environment."""
     try:
+        # exclude_none: otherwise candidate_solutions=None breaks env (get() returns None, not default []).
+        result = env.step(action.model_dump(exclude_none=True))
         return result
     except RuntimeError as e:
         raise HTTPException(status_code=400, detail=str(e))

app.py CHANGED Viewed

@@ -162,13 +162,15 @@ with gr.Blocks(theme=gr.themes.Soft()) as demo:
     with gr.Tab("3. API Endpoints"):
         gr.Markdown("""
         ### OpenEnv API Standard
-        FORGE-v4 exposes a FastAPI server (available at `:8000` when running locally) with the following endpoints:
-        - **`POST /reset`**: Initializes a new episode and returns the problem description.
-        - **`POST /step`**: Receives code candidates, evaluates them, and returns rewards/diagnostics.
-        - **`GET /state`**: Returns current environment status and memory summary.
-        These endpoints allow external agents to interface with FORGE-v4 programmatically.
         """)
     # Event handlers

     with gr.Tab("3. API Endpoints"):
         gr.Markdown("""
         ### OpenEnv API Standard
+        FORGE-v4 exposes a FastAPI server on the **same origin** as this UI: routes live at the **site root**, while Gradio is under **`/ui`**. Locally, `python api_server.py` serves on **`:8000`**; on this Space, use your **`*.hf.space`** base URL (no separate `/start` — use **`POST /reset`** then **`POST /step`**).
+        - **`GET /health`**: Liveness / version check.
+        - **`POST /reset`**: Starts a new episode and returns the initial state.
+        - **`POST /step`**: JSON body: `coder_code`, `coder_version`, optional `candidate_solutions` (array of strings). Returns rewards and updated state.
+        - **`GET /state`**: Current environment snapshot.
+        **Example (replace `BASE` with your Space `https://….hf.space` host):**
+        `curl -sS "$BASE/health"` → `curl -sS -X POST "$BASE/reset" -H "Content-Type: application/json"` → `curl -sS -X POST "$BASE/step" -H "Content-Type: application/json" -d '{"coder_code":"def solution(arr):\\n    return sorted(list(arr))","coder_version":"demo"}'`
         """)
     # Event handlers

env.py CHANGED Viewed

@@ -120,7 +120,7 @@ class FORGEEnv:
         coder_code = action.get("coder_code", "")
         coder_version = action.get("coder_version", "unknown")
-        candidate_solutions = action.get("candidate_solutions", [])
         if not isinstance(coder_code, str):
             raise TypeError("action['coder_code'] must be a string.")
         if not isinstance(coder_version, str):

         coder_code = action.get("coder_code", "")
         coder_version = action.get("coder_version", "unknown")
+        candidate_solutions = action.get("candidate_solutions") or []
         if not isinstance(coder_code, str):
             raise TypeError("action['coder_code'] must be a string.")
         if not isinstance(coder_version, str):