Spaces:

sanjay7676
/

Team404_FORGE

Sleeping

App Files Files Community

sanjay7676 commited on Apr 26

Commit

a2ac82d

1 Parent(s): 310dc9f

Add story blog updates and submission-ready docs links

Browse files

Files changed (5) hide show

BLOG.md +47 -0
Dockerfile +4 -2
Dockerfile.train +28 -0
MINI_BLOG.md +34 -34
README.md +13 -2

BLOG.md ADDED Viewed

	@@ -0,0 +1,47 @@

+# 🛡️ FORGE-v4: Building the "Immune System" for AI Code Generation
+### The Silent Crisis in AI Coding
+We've all seen it: an AI writes a perfect "Quick Sort" in seconds. But what happens when you give that same code an array of 10,000 duplicate zeros? Or a list of mixed large negatives? Often, the AI's "perfect" code crashes, enters an infinite loop, or returns incorrect results.
+Standard benchmarks measure **capability**. We built **FORGE-v4** to measure **robustness**.
+---
+## ⚔️ The Concept: Adversarial Red-Teaming
+FORGE-v4 isn't just a static test suite; it's a living environment. We implemented a **Red-vs-Blue** dynamic:
+- **The Defender (Blue Team)**: Our Coder agent tries to solve sorting tasks correctly.
+- **The Adversary (Red Team)**: Our Breaker agent actively searches for the Coder's "blind spots."
+As the Coder improves, the Breaker escalates. It progresses through **4 Tiers of difficulty**—from basic lists to extreme boundary values and stress tests. This tiered red-teaming ensures that the model isn't just memorizing common patterns, but actually hardening its logic.
+---
+## 🧠 The Secret Sauce: CoachMemory
+One of the most innovative features of FORGE-v4 is the **CoachMemory feedback loop**.
+In most training environments, a model fails, gets a low reward, and moves on. In FORGE-v4, every failure is analyzed by the "Coach."
+*   Did the model fail on negatives?
+*   Did it time out on large arrays?
+*   Did it destroy duplicates?
+These insights are stored in persistent memory. In the next episode, the model reads these "lessons" and adapts its strategy. This mimics the human engineering process: **Mistake → Analysis → Correction.**
+---
+## 📈 Results that Matter
+Our benchmarks show that while a baseline heuristic policy might have a high "average" pass rate (91%), it is easily broken by Tier 3 and Tier 4 attacks.
+Our **FORGE-v4 Model Policy** achieved:
+- **100% Pass Rate** across all adversarial tiers.
+- **+2.10 Reward Gain** over the baseline.
+- **Sustained Tier 4 Robustness**: It didn't just survive; it thrived under extreme pressure.
+---
+## 🌍 Why This Matters
+As AI agents move from "writing scripts" to "building infrastructure," robustness is no longer optional. FORGE-v4 provides the framework to ensure that the code powering our world is not just smart, but **unbreakable**.
+**Try the demo:** [Hugging Face Space](https://huggingface.co/spaces/sanjay7676/Team404_FORGE)
+---
+*Created with ❤️ for the Meta OpenEnv Hackathon by Team 404.*

Dockerfile CHANGED Viewed

@@ -13,9 +13,11 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
     curl \
     && rm -rf /var/lib/apt/lists/*
-COPY requirements.txt requirements-train.txt ./
 RUN pip install --upgrade pip && \
-    pip install -r requirements.txt -r requirements-train.txt
 COPY . .

     curl \
     && rm -rf /var/lib/apt/lists/*
+# Slim image: Gradio + API + OpenEnv (no PyTorch). Builds in minutes — same stack as HF CPU Space.
+# For training inside Docker: docker build -f Dockerfile.train -t forge:train .
+COPY requirements.txt ./
 RUN pip install --upgrade pip && \
+    pip install -r requirements.txt
 COPY . .

Dockerfile.train ADDED Viewed

	@@ -0,0 +1,28 @@

+# Full stack: PyTorch, TRL, PEFT, etc. — large downloads (often 30–90+ min on first build).
+# Use only when you need training or local HF weights inside the image.
+FROM python:3.11-slim
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    PIP_NO_CACHE_DIR=1 \
+    CODE_PROVIDER_MODE=mock
+WORKDIR /app
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    build-essential \
+    git \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+COPY requirements.txt requirements-train.txt ./
+RUN pip install --upgrade pip && \
+    pip install -r requirements.txt -r requirements-train.txt
+COPY . .
+RUN mkdir -p data logs models outputs
+EXPOSE 7860 8000
+CMD ["python", "app.py"]

MINI_BLOG.md CHANGED Viewed

@@ -1,47 +1,47 @@
-# 🛡️ FORGE-v4: Building the "Immune System" for AI Code Generation
-### The Silent Crisis in AI Coding
-We've all seen it: an AI writes a perfect "Quick Sort" in seconds. But what happens when you give that same code an array of 10,000 duplicate zeros? Or a list of mixed large negatives? Often, the AI's "perfect" code crashes, enters an infinite loop, or returns incorrect results.
-Standard benchmarks measure **capability**. We built **FORGE-v4** to measure **robustness**.
----
-## ⚔️ The Concept: Adversarial Red-Teaming
-FORGE-v4 isn't just a static test suite; it's a living environment. We implemented a **Red-vs-Blue** dynamic:
-- **The Defender (Blue Team)**: Our Coder agent tries to solve sorting tasks correctly.
-- **The Adversary (Red Team)**: Our Breaker agent actively searches for the Coder's "blind spots."
-As the Coder improves, the Breaker escalates. It progresses through **4 Tiers of difficulty**—from basic lists to extreme boundary values and stress tests. This tiered red-teaming ensures that the model isn't just memorizing common patterns, but actually hardening its logic.
----
-## 🧠 The Secret Sauce: CoachMemory
-One of the most innovative features of FORGE-v4 is the **CoachMemory feedback loop**.
-In most training environments, a model fails, gets a low reward, and moves on. In FORGE-v4, every failure is analyzed by the "Coach."
-*   Did the model fail on negatives?
-*   Did it time out on large arrays?
-*   Did it destroy duplicates?
-These insights are stored in persistent memory. In the next episode, the model reads these "lessons" and adapts its strategy. This mimics the human engineering process: **Mistake → Analysis → Correction.**
----
-## 📈 Results that Matter
-Our benchmarks show that while a baseline heuristic policy might have a high "average" pass rate (91%), it is easily broken by Tier 3 and Tier 4 attacks.
-Our **FORGE-v4 Model Policy** achieved:
-- **100% Pass Rate** across all adversarial tiers.
-- **+2.10 Reward Gain** over the baseline.
-- **Sustained Tier 4 Robustness**: It didn't just survive; it thrived under extreme pressure.
----
-## 🌍 Why This Matters
-As AI agents move from "writing scripts" to "building infrastructure," robustness is no longer optional. FORGE-v4 provides the framework to ensure that the code powering our world is not just smart, but **unbreakable**.
-**Try the demo:** [Hugging Face Space](https://huggingface.co/spaces/sanjay7676/Team404_FORGE)
----
-*Created with ❤️ for the Meta OpenEnv Hackathon by Team 404.*

+# FORGE-v4 Mini Blog: From Fragile Code to Adversarial Robustness
+## The story in one line
+FORGE-v4 trains a coding agent to survive adversarial edge cases by making it fight a breaker, learn from failures, and improve over repeated reward-driven episodes.
+## Why we built this
+Most coding models look good on clean examples and then fail on real inputs: negatives, duplicates, boundary values, and timeout-prone cases. We wanted an environment where failure is explicit, measurable, and useful for training.
+## The journey
+### Chapter 1: baseline confidence, hidden fragility
+We started with a defender that often passed easy tests but broke under stress tiers. That gave us a critical signal: average correctness is not robustness.
+### Chapter 2: breaker escalation
+We added a tiered breaker that progressively attacked blind spots. The environment moved from simple lists to harder adversarial distributions.
+### Chapter 3: memory as improvement engine
+CoachMemory converted repeated failure patterns into structured lessons. Instead of forgetting mistakes each episode, the loop made mistakes actionable.
+### Chapter 4: measurable training loop
+We used benchmark/compare runs to produce reward and pass-rate evidence, exported preference pairs, and connected that to a small-model-first adapter training path.
+## What changed after training cycles
+- Defender pass rate stabilized under tougher tiers.
+- Average defender reward improved versus baseline runs.
+- Breaker pressure remained high, but the defender failed less often on known edge patterns.
+## Evidence (committed outputs)
+### Reward trend
+![Reward curve](outputs/reward_curve.png)
+### Pass-rate trend
+![Pass rate curve](outputs/pass_rate.png)
+### Loss-like training signal
+![Loss curve](outputs/loss_curve.png)
+### Machine-readable benchmark summary
+- `outputs/final_report.json`
+## Deliverables
+- Hugging Face Space: https://huggingface.co/spaces/sanjay7676/Team404_FORGE
+- GitHub repository: https://github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2
+- Colab notebook: https://colab.research.google.com/github/Sanjay767676/Meta-x-Scaler-Team404--Round2/blob/main/FORGE_Training_Colab.ipynb
+- YouTube demo placeholder: https://youtube.com/watch?v=YOUR_DEMO_VIDEO_ID
+## Why this matters
+FORGE-v4 is designed to train coding behavior that is verifiable, harder to reward-hack, and more resilient under adversarial conditions. That is the capability gap we think matters most for real LLM deployment.

README.md CHANGED Viewed

@@ -35,7 +35,7 @@ suggested_hardware: cpu-basic
 | **Training Colab (synced from GitHub)** | [FORGE_Training_Colab.ipynb on Colab](https://colab.research.google.com/github/Sanjay767676/Meta-x-Scaler-Team404--Round2/blob/main/FORGE_Training_Colab.ipynb) |
 | **Trained adapter** | [sanjay7676/forge-qwen-final](https://huggingface.co/sanjay7676/forge-qwen-final) |
 | **Command / security cheat sheet** | [guide.md](guide.md) |
-| **Video / slides** | Optional. Current submission uses the mini-blog requirement via [MINI_BLOG.md](MINI_BLOG.md). |
 ### Hugging Face Space (CPU-only)
@@ -256,6 +256,10 @@ Deployment note: as of the latest verification, the Space URL is serving the Gra
 **Repository:** [https://github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2](https://github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2)
 ---
 ## 16. Why judges should care
@@ -311,10 +315,17 @@ Anyone can `docker pull` a **public** image without logging in. `docker login` i
 ### Build locally, tag, and push to Docker Hub (one-time)
 ```bash
 cd /path/to/FORGE
 docker build -t forge:latest .
 # Log in (opens browser or prompts for password / access token)
 docker login -u sanjay767676
@@ -329,7 +340,7 @@ Use a [Docker Hub access token](https://docs.docker.com/docker-hub/access-tokens
 - **`forge-api`** → FastAPI OpenEnv server on `http://localhost:8000`
 - **`forge-ui`** → Gradio app on `http://localhost:7860`
-Files: [`Dockerfile`](Dockerfile), [`docker-compose.yml`](docker-compose.yml), [`.dockerignore`](.dockerignore)
 Build and run from this repo:

 | **Training Colab (synced from GitHub)** | [FORGE_Training_Colab.ipynb on Colab](https://colab.research.google.com/github/Sanjay767676/Meta-x-Scaler-Team404--Round2/blob/main/FORGE_Training_Colab.ipynb) |
 | **Trained adapter** | [sanjay7676/forge-qwen-final](https://huggingface.co/sanjay7676/forge-qwen-final) |
 | **Command / security cheat sheet** | [guide.md](guide.md) |
+| **Video / slides** | YouTube demo placeholder: https://youtube.com/watch?v=YOUR_DEMO_VIDEO_ID |
 ### Hugging Face Space (CPU-only)
 **Repository:** [https://github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2](https://github.com/Sanjay767676/Meta-x-Scaler-Team404--Round2)
+## 15.1 Demo video placeholder
+**YouTube (to publish before final submission):** https://youtube.com/watch?v=YOUR_DEMO_VIDEO_ID
 ---
 ## 16. Why judges should care
 ### Build locally, tag, and push to Docker Hub (one-time)
+**Fast image (default `Dockerfile`):** only `requirements.txt` — no PyTorch. Usually **a few minutes**. Good for demo, Gradio, and `CODE_PROVIDER_MODE=mock` (or API-backed providers).
+**Full training image:** [`Dockerfile.train`](Dockerfile.train) adds `requirements-train.txt` (PyTorch + CUDA wheels). Expect **tens of minutes to an hour+** on first build.
 ```bash
 cd /path/to/FORGE
 docker build -t forge:latest .
+# Optional: image with PyTorch / TRL / PEFT for training inside the container
+# docker build -f Dockerfile.train -t forge:train .
 # Log in (opens browser or prompts for password / access token)
 docker login -u sanjay767676
 - **`forge-api`** → FastAPI OpenEnv server on `http://localhost:8000`
 - **`forge-ui`** → Gradio app on `http://localhost:7860`
+Files: [`Dockerfile`](Dockerfile) (slim), [`Dockerfile.train`](Dockerfile.train) (full), [`docker-compose.yml`](docker-compose.yml), [`.dockerignore`](.dockerignore)
 Build and run from this repo: