Spaces:
Sleeping
Sleeping
Commit Β·
a2ac82d
1
Parent(s): 310dc9f
Add story blog updates and submission-ready docs links
Browse files- BLOG.md +47 -0
- Dockerfile +4 -2
- Dockerfile.train +28 -0
- MINI_BLOG.md +34 -34
- README.md +13 -2
BLOG.md
ADDED
|
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# π‘οΈ FORGE-v4: Building the "Immune System" for AI Code Generation
|
| 2 |
+
|
| 3 |
+
### The Silent Crisis in AI Coding
|
| 4 |
+
We've all seen it: an AI writes a perfect "Quick Sort" in seconds. But what happens when you give that same code an array of 10,000 duplicate zeros? Or a list of mixed large negatives? Often, the AI's "perfect" code crashes, enters an infinite loop, or returns incorrect results.
|
| 5 |
+
|
| 6 |
+
Standard benchmarks measure **capability**. We built **FORGE-v4** to measure **robustness**.
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## βοΈ The Concept: Adversarial Red-Teaming
|
| 11 |
+
FORGE-v4 isn't just a static test suite; it's a living environment. We implemented a **Red-vs-Blue** dynamic:
|
| 12 |
+
- **The Defender (Blue Team)**: Our Coder agent tries to solve sorting tasks correctly.
|
| 13 |
+
- **The Adversary (Red Team)**: Our Breaker agent actively searches for the Coder's "blind spots."
|
| 14 |
+
|
| 15 |
+
As the Coder improves, the Breaker escalates. It progresses through **4 Tiers of difficulty**βfrom basic lists to extreme boundary values and stress tests. This tiered red-teaming ensures that the model isn't just memorizing common patterns, but actually hardening its logic.
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## π§ The Secret Sauce: CoachMemory
|
| 20 |
+
One of the most innovative features of FORGE-v4 is the **CoachMemory feedback loop**.
|
| 21 |
+
|
| 22 |
+
In most training environments, a model fails, gets a low reward, and moves on. In FORGE-v4, every failure is analyzed by the "Coach."
|
| 23 |
+
* Did the model fail on negatives?
|
| 24 |
+
* Did it time out on large arrays?
|
| 25 |
+
* Did it destroy duplicates?
|
| 26 |
+
|
| 27 |
+
These insights are stored in persistent memory. In the next episode, the model reads these "lessons" and adapts its strategy. This mimics the human engineering process: **Mistake β Analysis β Correction.**
|
| 28 |
+
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## π Results that Matter
|
| 32 |
+
Our benchmarks show that while a baseline heuristic policy might have a high "average" pass rate (91%), it is easily broken by Tier 3 and Tier 4 attacks.
|
| 33 |
+
|
| 34 |
+
Our **FORGE-v4 Model Policy** achieved:
|
| 35 |
+
- **100% Pass Rate** across all adversarial tiers.
|
| 36 |
+
- **+2.10 Reward Gain** over the baseline.
|
| 37 |
+
- **Sustained Tier 4 Robustness**: It didn't just survive; it thrived under extreme pressure.
|
| 38 |
+
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
## π Why This Matters
|
| 42 |
+
As AI agents move from "writing scripts" to "building infrastructure," robustness is no longer optional. FORGE-v4 provides the framework to ensure that the code powering our world is not just smart, but **unbreakable**.
|
| 43 |
+
|
| 44 |
+
**Try the demo:** [Hugging Face Space](https://huggingface.co/spaces/sanjay7676/Team404_FORGE)
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
*Created with β€οΈ for the Meta OpenEnv Hackathon by Team 404.*
|
Dockerfile
CHANGED
|
@@ -13,9 +13,11 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
|
|
| 13 |
curl \
|
| 14 |
&& rm -rf /var/lib/apt/lists/*
|
| 15 |
|
| 16 |
-
|
|
|
|
|
|
|
| 17 |
RUN pip install --upgrade pip && \
|
| 18 |
-
pip install -r requirements.txt
|
| 19 |
|
| 20 |
COPY . .
|
| 21 |
|
|
|
|
| 13 |
curl \
|
| 14 |
&& rm -rf /var/lib/apt/lists/*
|
| 15 |
|
| 16 |
+
# Slim image: Gradio + API + OpenEnv (no PyTorch). Builds in minutes β same stack as HF CPU Space.
|
| 17 |
+
# For training inside Docker: docker build -f Dockerfile.train -t forge:train .
|
| 18 |
+
COPY requirements.txt ./
|
| 19 |
RUN pip install --upgrade pip && \
|
| 20 |
+
pip install -r requirements.txt
|
| 21 |
|
| 22 |
COPY . .
|
| 23 |
|
Dockerfile.train
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Full stack: PyTorch, TRL, PEFT, etc. β large downloads (often 30β90+ min on first build).
|
| 2 |
+
# Use only when you need training or local HF weights inside the image.
|
| 3 |
+
FROM python:3.11-slim
|
| 4 |
+
|
| 5 |
+
ENV PYTHONDONTWRITEBYTECODE=1 \
|
| 6 |
+
PYTHONUNBUFFERED=1 \
|
| 7 |
+
PIP_NO_CACHE_DIR=1 \
|
| 8 |
+
CODE_PROVIDER_MODE=mock
|
| 9 |
+
|
| 10 |
+
WORKDIR /app
|
| 11 |
+
|
| 12 |
+
RUN apt-get update && apt-get install -y --no-install-recommends \
|
| 13 |
+
build-essential \
|
| 14 |
+
git \
|
| 15 |
+
curl \
|
| 16 |
+
&& rm -rf /var/lib/apt/lists/*
|
| 17 |
+
|
| 18 |
+
COPY requirements.txt requirements-train.txt ./
|
| 19 |
+
RUN pip install --upgrade pip && \
|
| 20 |
+
pip install -r requirements.txt -r requirements-train.txt
|
| 21 |
+
|
| 22 |
+
COPY . .
|
| 23 |
+
|
| 24 |
+
RUN mkdir -p data logs models outputs
|
| 25 |
+
|
| 26 |
+
EXPOSE 7860 8000
|
| 27 |
+
|
| 28 |
+
CMD ["python", "app.py"]
|
MINI_BLOG.md
CHANGED
|
@@ -1,47 +1,47 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
-
##
|
| 4 |
-
|
| 5 |
|
| 6 |
-
|
|
|
|
| 7 |
|
| 8 |
-
|
|
|
|
|
|
|
| 9 |
|
| 10 |
-
##
|
| 11 |
-
|
| 12 |
-
- **The Defender (Blue Team)**: Our Coder agent tries to solve sorting tasks correctly.
|
| 13 |
-
- **The Adversary (Red Team)**: Our Breaker agent actively searches for the Coder's "blind spots."
|
| 14 |
|
| 15 |
-
|
|
|
|
| 16 |
|
| 17 |
-
|
|
|
|
| 18 |
|
| 19 |
-
##
|
| 20 |
-
|
|
|
|
|
|
|
| 21 |
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
* Did it destroy duplicates?
|
| 26 |
|
| 27 |
-
|
|
|
|
| 28 |
|
| 29 |
-
-
|
|
|
|
| 30 |
|
| 31 |
-
##
|
| 32 |
-
|
| 33 |
|
| 34 |
-
|
| 35 |
-
-
|
| 36 |
-
-
|
| 37 |
-
-
|
|
|
|
| 38 |
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
## π Why This Matters
|
| 42 |
-
As AI agents move from "writing scripts" to "building infrastructure," robustness is no longer optional. FORGE-v4 provides the framework to ensure that the code powering our world is not just smart, but **unbreakable**.
|
| 43 |
-
|
| 44 |
-
**Try the demo:** [Hugging Face Space](https://huggingface.co/spaces/sanjay7676/Team404_FORGE)
|
| 45 |
-
|
| 46 |
-
---
|
| 47 |
-
*Created with β€οΈ for the Meta OpenEnv Hackathon by Team 404.*
|
|
|
|
| 1 |
+
# FORGE-v4 Mini Blog: From Fragile Code to Adversarial Robustness
|
| 2 |
|
| 3 |
+
## The story in one line
|
| 4 |
+
FORGE-v4 trains a coding agent to survive adversarial edge cases by making it fight a breaker, learn from failures, and improve over repeated reward-driven episodes.
|
| 5 |
|
| 6 |
+
## Why we built this
|
| 7 |
+
Most coding models look good on clean examples and then fail on real inputs: negatives, duplicates, boundary values, and timeout-prone cases. We wanted an environment where failure is explicit, measurable, and useful for training.
|
| 8 |
|
| 9 |
+
## The journey
|
| 10 |
+
### Chapter 1: baseline confidence, hidden fragility
|
| 11 |
+
We started with a defender that often passed easy tests but broke under stress tiers. That gave us a critical signal: average correctness is not robustness.
|
| 12 |
|
| 13 |
+
### Chapter 2: breaker escalation
|
| 14 |
+
We added a tiered breaker that progressively attacked blind spots. The environment moved from simple lists to harder adversarial distributions.
|
|
|
|
|
|
|
| 15 |
|
| 16 |
+
### Chapter 3: memory as improvement engine
|
| 17 |
+
CoachMemory converted repeated failure patterns into structured lessons. Instead of forgetting mistakes each episode, the loop made mistakes actionable.
|
| 18 |
|
| 19 |
+
### Chapter 4: measurable training loop
|
| 20 |
+
We used benchmark/compare runs to produce reward and pass-rate evidence, exported preference pairs, and connected that to a small-model-first adapter training path.
|
| 21 |
|
| 22 |
+
## What changed after training cycles
|
| 23 |
+
- Defender pass rate stabilized under tougher tiers.
|
| 24 |
+
- Average defender reward improved versus baseline runs.
|
| 25 |
+
- Breaker pressure remained high, but the defender failed less often on known edge patterns.
|
| 26 |
|
| 27 |
+
## Evidence (committed outputs)
|
| 28 |
+
### Reward trend
|
| 29 |
+

|
|
|
|
| 30 |
|
| 31 |
+
### Pass-rate trend
|
| 32 |
+

|
| 33 |
|
| 34 |
+
### Loss-like training signal
|
| 35 |
+

|
| 36 |
|
| 37 |
+
### Machine-readable benchmark summary
|
| 38 |
+
- `outputs/final_report.json`
|
| 39 |
|
| 40 |
+
## Deliverables
|
| 41 |
+
- Hugging Face Space: https://huggingface.co/spaces/sanjay7676/Team404_FORGE
|
| 42 |
+
- GitHub repository: https://github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2
|
| 43 |
+
- Colab notebook: https://colab.research.google.com/github/Sanjay767676/Meta-x-Scaler-Team404--Round2/blob/main/FORGE_Training_Colab.ipynb
|
| 44 |
+
- YouTube demo placeholder: https://youtube.com/watch?v=YOUR_DEMO_VIDEO_ID
|
| 45 |
|
| 46 |
+
## Why this matters
|
| 47 |
+
FORGE-v4 is designed to train coding behavior that is verifiable, harder to reward-hack, and more resilient under adversarial conditions. That is the capability gap we think matters most for real LLM deployment.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
|
@@ -35,7 +35,7 @@ suggested_hardware: cpu-basic
|
|
| 35 |
| **Training Colab (synced from GitHub)** | [FORGE_Training_Colab.ipynb on Colab](https://colab.research.google.com/github/Sanjay767676/Meta-x-Scaler-Team404--Round2/blob/main/FORGE_Training_Colab.ipynb) |
|
| 36 |
| **Trained adapter** | [sanjay7676/forge-qwen-final](https://huggingface.co/sanjay7676/forge-qwen-final) |
|
| 37 |
| **Command / security cheat sheet** | [guide.md](guide.md) |
|
| 38 |
-
| **Video / slides** |
|
| 39 |
|
| 40 |
### Hugging Face Space (CPU-only)
|
| 41 |
|
|
@@ -256,6 +256,10 @@ Deployment note: as of the latest verification, the Space URL is serving the Gra
|
|
| 256 |
|
| 257 |
**Repository:** [https://github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2](https://github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2)
|
| 258 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 259 |
---
|
| 260 |
|
| 261 |
## 16. Why judges should care
|
|
@@ -311,10 +315,17 @@ Anyone can `docker pull` a **public** image without logging in. `docker login` i
|
|
| 311 |
|
| 312 |
### Build locally, tag, and push to Docker Hub (one-time)
|
| 313 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 314 |
```bash
|
| 315 |
cd /path/to/FORGE
|
| 316 |
docker build -t forge:latest .
|
| 317 |
|
|
|
|
|
|
|
|
|
|
| 318 |
# Log in (opens browser or prompts for password / access token)
|
| 319 |
docker login -u sanjay767676
|
| 320 |
|
|
@@ -329,7 +340,7 @@ Use a [Docker Hub access token](https://docs.docker.com/docker-hub/access-tokens
|
|
| 329 |
- **`forge-api`** β FastAPI OpenEnv server on `http://localhost:8000`
|
| 330 |
- **`forge-ui`** β Gradio app on `http://localhost:7860`
|
| 331 |
|
| 332 |
-
Files: [`Dockerfile`](Dockerfile), [`docker-compose.yml`](docker-compose.yml), [`.dockerignore`](.dockerignore)
|
| 333 |
|
| 334 |
Build and run from this repo:
|
| 335 |
|
|
|
|
| 35 |
| **Training Colab (synced from GitHub)** | [FORGE_Training_Colab.ipynb on Colab](https://colab.research.google.com/github/Sanjay767676/Meta-x-Scaler-Team404--Round2/blob/main/FORGE_Training_Colab.ipynb) |
|
| 36 |
| **Trained adapter** | [sanjay7676/forge-qwen-final](https://huggingface.co/sanjay7676/forge-qwen-final) |
|
| 37 |
| **Command / security cheat sheet** | [guide.md](guide.md) |
|
| 38 |
+
| **Video / slides** | YouTube demo placeholder: https://youtube.com/watch?v=YOUR_DEMO_VIDEO_ID |
|
| 39 |
|
| 40 |
### Hugging Face Space (CPU-only)
|
| 41 |
|
|
|
|
| 256 |
|
| 257 |
**Repository:** [https://github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2](https://github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2)
|
| 258 |
|
| 259 |
+
## 15.1 Demo video placeholder
|
| 260 |
+
|
| 261 |
+
**YouTube (to publish before final submission):** https://youtube.com/watch?v=YOUR_DEMO_VIDEO_ID
|
| 262 |
+
|
| 263 |
---
|
| 264 |
|
| 265 |
## 16. Why judges should care
|
|
|
|
| 315 |
|
| 316 |
### Build locally, tag, and push to Docker Hub (one-time)
|
| 317 |
|
| 318 |
+
**Fast image (default `Dockerfile`):** only `requirements.txt` β no PyTorch. Usually **a few minutes**. Good for demo, Gradio, and `CODE_PROVIDER_MODE=mock` (or API-backed providers).
|
| 319 |
+
|
| 320 |
+
**Full training image:** [`Dockerfile.train`](Dockerfile.train) adds `requirements-train.txt` (PyTorch + CUDA wheels). Expect **tens of minutes to an hour+** on first build.
|
| 321 |
+
|
| 322 |
```bash
|
| 323 |
cd /path/to/FORGE
|
| 324 |
docker build -t forge:latest .
|
| 325 |
|
| 326 |
+
# Optional: image with PyTorch / TRL / PEFT for training inside the container
|
| 327 |
+
# docker build -f Dockerfile.train -t forge:train .
|
| 328 |
+
|
| 329 |
# Log in (opens browser or prompts for password / access token)
|
| 330 |
docker login -u sanjay767676
|
| 331 |
|
|
|
|
| 340 |
- **`forge-api`** β FastAPI OpenEnv server on `http://localhost:8000`
|
| 341 |
- **`forge-ui`** β Gradio app on `http://localhost:7860`
|
| 342 |
|
| 343 |
+
Files: [`Dockerfile`](Dockerfile) (slim), [`Dockerfile.train`](Dockerfile.train) (full), [`docker-compose.yml`](docker-compose.yml), [`.dockerignore`](.dockerignore)
|
| 344 |
|
| 345 |
Build and run from this repo:
|
| 346 |
|