sanjay7676 commited on
Commit
a2ac82d
Β·
1 Parent(s): 310dc9f

Add story blog updates and submission-ready docs links

Browse files
Files changed (5) hide show
  1. BLOG.md +47 -0
  2. Dockerfile +4 -2
  3. Dockerfile.train +28 -0
  4. MINI_BLOG.md +34 -34
  5. README.md +13 -2
BLOG.md ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ›‘οΈ FORGE-v4: Building the "Immune System" for AI Code Generation
2
+
3
+ ### The Silent Crisis in AI Coding
4
+ We've all seen it: an AI writes a perfect "Quick Sort" in seconds. But what happens when you give that same code an array of 10,000 duplicate zeros? Or a list of mixed large negatives? Often, the AI's "perfect" code crashes, enters an infinite loop, or returns incorrect results.
5
+
6
+ Standard benchmarks measure **capability**. We built **FORGE-v4** to measure **robustness**.
7
+
8
+ ---
9
+
10
+ ## βš”οΈ The Concept: Adversarial Red-Teaming
11
+ FORGE-v4 isn't just a static test suite; it's a living environment. We implemented a **Red-vs-Blue** dynamic:
12
+ - **The Defender (Blue Team)**: Our Coder agent tries to solve sorting tasks correctly.
13
+ - **The Adversary (Red Team)**: Our Breaker agent actively searches for the Coder's "blind spots."
14
+
15
+ As the Coder improves, the Breaker escalates. It progresses through **4 Tiers of difficulty**β€”from basic lists to extreme boundary values and stress tests. This tiered red-teaming ensures that the model isn't just memorizing common patterns, but actually hardening its logic.
16
+
17
+ ---
18
+
19
+ ## 🧠 The Secret Sauce: CoachMemory
20
+ One of the most innovative features of FORGE-v4 is the **CoachMemory feedback loop**.
21
+
22
+ In most training environments, a model fails, gets a low reward, and moves on. In FORGE-v4, every failure is analyzed by the "Coach."
23
+ * Did the model fail on negatives?
24
+ * Did it time out on large arrays?
25
+ * Did it destroy duplicates?
26
+
27
+ These insights are stored in persistent memory. In the next episode, the model reads these "lessons" and adapts its strategy. This mimics the human engineering process: **Mistake β†’ Analysis β†’ Correction.**
28
+
29
+ ---
30
+
31
+ ## πŸ“ˆ Results that Matter
32
+ Our benchmarks show that while a baseline heuristic policy might have a high "average" pass rate (91%), it is easily broken by Tier 3 and Tier 4 attacks.
33
+
34
+ Our **FORGE-v4 Model Policy** achieved:
35
+ - **100% Pass Rate** across all adversarial tiers.
36
+ - **+2.10 Reward Gain** over the baseline.
37
+ - **Sustained Tier 4 Robustness**: It didn't just survive; it thrived under extreme pressure.
38
+
39
+ ---
40
+
41
+ ## 🌍 Why This Matters
42
+ As AI agents move from "writing scripts" to "building infrastructure," robustness is no longer optional. FORGE-v4 provides the framework to ensure that the code powering our world is not just smart, but **unbreakable**.
43
+
44
+ **Try the demo:** [Hugging Face Space](https://huggingface.co/spaces/sanjay7676/Team404_FORGE)
45
+
46
+ ---
47
+ *Created with ❀️ for the Meta OpenEnv Hackathon by Team 404.*
Dockerfile CHANGED
@@ -13,9 +13,11 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
13
  curl \
14
  && rm -rf /var/lib/apt/lists/*
15
 
16
- COPY requirements.txt requirements-train.txt ./
 
 
17
  RUN pip install --upgrade pip && \
18
- pip install -r requirements.txt -r requirements-train.txt
19
 
20
  COPY . .
21
 
 
13
  curl \
14
  && rm -rf /var/lib/apt/lists/*
15
 
16
+ # Slim image: Gradio + API + OpenEnv (no PyTorch). Builds in minutes β€” same stack as HF CPU Space.
17
+ # For training inside Docker: docker build -f Dockerfile.train -t forge:train .
18
+ COPY requirements.txt ./
19
  RUN pip install --upgrade pip && \
20
+ pip install -r requirements.txt
21
 
22
  COPY . .
23
 
Dockerfile.train ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Full stack: PyTorch, TRL, PEFT, etc. β€” large downloads (often 30–90+ min on first build).
2
+ # Use only when you need training or local HF weights inside the image.
3
+ FROM python:3.11-slim
4
+
5
+ ENV PYTHONDONTWRITEBYTECODE=1 \
6
+ PYTHONUNBUFFERED=1 \
7
+ PIP_NO_CACHE_DIR=1 \
8
+ CODE_PROVIDER_MODE=mock
9
+
10
+ WORKDIR /app
11
+
12
+ RUN apt-get update && apt-get install -y --no-install-recommends \
13
+ build-essential \
14
+ git \
15
+ curl \
16
+ && rm -rf /var/lib/apt/lists/*
17
+
18
+ COPY requirements.txt requirements-train.txt ./
19
+ RUN pip install --upgrade pip && \
20
+ pip install -r requirements.txt -r requirements-train.txt
21
+
22
+ COPY . .
23
+
24
+ RUN mkdir -p data logs models outputs
25
+
26
+ EXPOSE 7860 8000
27
+
28
+ CMD ["python", "app.py"]
MINI_BLOG.md CHANGED
@@ -1,47 +1,47 @@
1
- # πŸ›‘οΈ FORGE-v4: Building the "Immune System" for AI Code Generation
2
 
3
- ### The Silent Crisis in AI Coding
4
- We've all seen it: an AI writes a perfect "Quick Sort" in seconds. But what happens when you give that same code an array of 10,000 duplicate zeros? Or a list of mixed large negatives? Often, the AI's "perfect" code crashes, enters an infinite loop, or returns incorrect results.
5
 
6
- Standard benchmarks measure **capability**. We built **FORGE-v4** to measure **robustness**.
 
7
 
8
- ---
 
 
9
 
10
- ## βš”οΈ The Concept: Adversarial Red-Teaming
11
- FORGE-v4 isn't just a static test suite; it's a living environment. We implemented a **Red-vs-Blue** dynamic:
12
- - **The Defender (Blue Team)**: Our Coder agent tries to solve sorting tasks correctly.
13
- - **The Adversary (Red Team)**: Our Breaker agent actively searches for the Coder's "blind spots."
14
 
15
- As the Coder improves, the Breaker escalates. It progresses through **4 Tiers of difficulty**β€”from basic lists to extreme boundary values and stress tests. This tiered red-teaming ensures that the model isn't just memorizing common patterns, but actually hardening its logic.
 
16
 
17
- ---
 
18
 
19
- ## 🧠 The Secret Sauce: CoachMemory
20
- One of the most innovative features of FORGE-v4 is the **CoachMemory feedback loop**.
 
 
21
 
22
- In most training environments, a model fails, gets a low reward, and moves on. In FORGE-v4, every failure is analyzed by the "Coach."
23
- * Did the model fail on negatives?
24
- * Did it time out on large arrays?
25
- * Did it destroy duplicates?
26
 
27
- These insights are stored in persistent memory. In the next episode, the model reads these "lessons" and adapts its strategy. This mimics the human engineering process: **Mistake β†’ Analysis β†’ Correction.**
 
28
 
29
- ---
 
30
 
31
- ## πŸ“ˆ Results that Matter
32
- Our benchmarks show that while a baseline heuristic policy might have a high "average" pass rate (91%), it is easily broken by Tier 3 and Tier 4 attacks.
33
 
34
- Our **FORGE-v4 Model Policy** achieved:
35
- - **100% Pass Rate** across all adversarial tiers.
36
- - **+2.10 Reward Gain** over the baseline.
37
- - **Sustained Tier 4 Robustness**: It didn't just survive; it thrived under extreme pressure.
 
38
 
39
- ---
40
-
41
- ## 🌍 Why This Matters
42
- As AI agents move from "writing scripts" to "building infrastructure," robustness is no longer optional. FORGE-v4 provides the framework to ensure that the code powering our world is not just smart, but **unbreakable**.
43
-
44
- **Try the demo:** [Hugging Face Space](https://huggingface.co/spaces/sanjay7676/Team404_FORGE)
45
-
46
- ---
47
- *Created with ❀️ for the Meta OpenEnv Hackathon by Team 404.*
 
1
+ # FORGE-v4 Mini Blog: From Fragile Code to Adversarial Robustness
2
 
3
+ ## The story in one line
4
+ FORGE-v4 trains a coding agent to survive adversarial edge cases by making it fight a breaker, learn from failures, and improve over repeated reward-driven episodes.
5
 
6
+ ## Why we built this
7
+ Most coding models look good on clean examples and then fail on real inputs: negatives, duplicates, boundary values, and timeout-prone cases. We wanted an environment where failure is explicit, measurable, and useful for training.
8
 
9
+ ## The journey
10
+ ### Chapter 1: baseline confidence, hidden fragility
11
+ We started with a defender that often passed easy tests but broke under stress tiers. That gave us a critical signal: average correctness is not robustness.
12
 
13
+ ### Chapter 2: breaker escalation
14
+ We added a tiered breaker that progressively attacked blind spots. The environment moved from simple lists to harder adversarial distributions.
 
 
15
 
16
+ ### Chapter 3: memory as improvement engine
17
+ CoachMemory converted repeated failure patterns into structured lessons. Instead of forgetting mistakes each episode, the loop made mistakes actionable.
18
 
19
+ ### Chapter 4: measurable training loop
20
+ We used benchmark/compare runs to produce reward and pass-rate evidence, exported preference pairs, and connected that to a small-model-first adapter training path.
21
 
22
+ ## What changed after training cycles
23
+ - Defender pass rate stabilized under tougher tiers.
24
+ - Average defender reward improved versus baseline runs.
25
+ - Breaker pressure remained high, but the defender failed less often on known edge patterns.
26
 
27
+ ## Evidence (committed outputs)
28
+ ### Reward trend
29
+ ![Reward curve](outputs/reward_curve.png)
 
30
 
31
+ ### Pass-rate trend
32
+ ![Pass rate curve](outputs/pass_rate.png)
33
 
34
+ ### Loss-like training signal
35
+ ![Loss curve](outputs/loss_curve.png)
36
 
37
+ ### Machine-readable benchmark summary
38
+ - `outputs/final_report.json`
39
 
40
+ ## Deliverables
41
+ - Hugging Face Space: https://huggingface.co/spaces/sanjay7676/Team404_FORGE
42
+ - GitHub repository: https://github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2
43
+ - Colab notebook: https://colab.research.google.com/github/Sanjay767676/Meta-x-Scaler-Team404--Round2/blob/main/FORGE_Training_Colab.ipynb
44
+ - YouTube demo placeholder: https://youtube.com/watch?v=YOUR_DEMO_VIDEO_ID
45
 
46
+ ## Why this matters
47
+ FORGE-v4 is designed to train coding behavior that is verifiable, harder to reward-hack, and more resilient under adversarial conditions. That is the capability gap we think matters most for real LLM deployment.
 
 
 
 
 
 
 
README.md CHANGED
@@ -35,7 +35,7 @@ suggested_hardware: cpu-basic
35
  | **Training Colab (synced from GitHub)** | [FORGE_Training_Colab.ipynb on Colab](https://colab.research.google.com/github/Sanjay767676/Meta-x-Scaler-Team404--Round2/blob/main/FORGE_Training_Colab.ipynb) |
36
  | **Trained adapter** | [sanjay7676/forge-qwen-final](https://huggingface.co/sanjay7676/forge-qwen-final) |
37
  | **Command / security cheat sheet** | [guide.md](guide.md) |
38
- | **Video / slides** | Optional. Current submission uses the mini-blog requirement via [MINI_BLOG.md](MINI_BLOG.md). |
39
 
40
  ### Hugging Face Space (CPU-only)
41
 
@@ -256,6 +256,10 @@ Deployment note: as of the latest verification, the Space URL is serving the Gra
256
 
257
  **Repository:** [https://github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2](https://github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2)
258
 
 
 
 
 
259
  ---
260
 
261
  ## 16. Why judges should care
@@ -311,10 +315,17 @@ Anyone can `docker pull` a **public** image without logging in. `docker login` i
311
 
312
  ### Build locally, tag, and push to Docker Hub (one-time)
313
 
 
 
 
 
314
  ```bash
315
  cd /path/to/FORGE
316
  docker build -t forge:latest .
317
 
 
 
 
318
  # Log in (opens browser or prompts for password / access token)
319
  docker login -u sanjay767676
320
 
@@ -329,7 +340,7 @@ Use a [Docker Hub access token](https://docs.docker.com/docker-hub/access-tokens
329
  - **`forge-api`** β†’ FastAPI OpenEnv server on `http://localhost:8000`
330
  - **`forge-ui`** β†’ Gradio app on `http://localhost:7860`
331
 
332
- Files: [`Dockerfile`](Dockerfile), [`docker-compose.yml`](docker-compose.yml), [`.dockerignore`](.dockerignore)
333
 
334
  Build and run from this repo:
335
 
 
35
  | **Training Colab (synced from GitHub)** | [FORGE_Training_Colab.ipynb on Colab](https://colab.research.google.com/github/Sanjay767676/Meta-x-Scaler-Team404--Round2/blob/main/FORGE_Training_Colab.ipynb) |
36
  | **Trained adapter** | [sanjay7676/forge-qwen-final](https://huggingface.co/sanjay7676/forge-qwen-final) |
37
  | **Command / security cheat sheet** | [guide.md](guide.md) |
38
+ | **Video / slides** | YouTube demo placeholder: https://youtube.com/watch?v=YOUR_DEMO_VIDEO_ID |
39
 
40
  ### Hugging Face Space (CPU-only)
41
 
 
256
 
257
  **Repository:** [https://github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2](https://github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2)
258
 
259
+ ## 15.1 Demo video placeholder
260
+
261
+ **YouTube (to publish before final submission):** https://youtube.com/watch?v=YOUR_DEMO_VIDEO_ID
262
+
263
  ---
264
 
265
  ## 16. Why judges should care
 
315
 
316
  ### Build locally, tag, and push to Docker Hub (one-time)
317
 
318
+ **Fast image (default `Dockerfile`):** only `requirements.txt` β€” no PyTorch. Usually **a few minutes**. Good for demo, Gradio, and `CODE_PROVIDER_MODE=mock` (or API-backed providers).
319
+
320
+ **Full training image:** [`Dockerfile.train`](Dockerfile.train) adds `requirements-train.txt` (PyTorch + CUDA wheels). Expect **tens of minutes to an hour+** on first build.
321
+
322
  ```bash
323
  cd /path/to/FORGE
324
  docker build -t forge:latest .
325
 
326
+ # Optional: image with PyTorch / TRL / PEFT for training inside the container
327
+ # docker build -f Dockerfile.train -t forge:train .
328
+
329
  # Log in (opens browser or prompts for password / access token)
330
  docker login -u sanjay767676
331
 
 
340
  - **`forge-api`** β†’ FastAPI OpenEnv server on `http://localhost:8000`
341
  - **`forge-ui`** β†’ Gradio app on `http://localhost:7860`
342
 
343
+ Files: [`Dockerfile`](Dockerfile) (slim), [`Dockerfile.train`](Dockerfile.train) (full), [`docker-compose.yml`](docker-compose.yml), [`.dockerignore`](.dockerignore)
344
 
345
  Build and run from this repo:
346