Spaces:

sanjay7676
/

Team404_FORGE

Sleeping

ksanjuma1234 commited on Apr 25

Commit

fc01d79

1 Parent(s): 8337c4b

Add a new adversarial code generation environment for reinforcement learning

Create the FORGE-v4 Python environment, including core components for agent interaction, code execution, reward calculation, and memory storage.

Replit-Commit-Author: Agent
Replit-Commit-Session-Id: a7518b1f-70c7-4487-82d2-42195935723e
Replit-Commit-Checkpoint-Type: full_checkpoint
Replit-Commit-Event-Id: dbf3c097-a076-4e9c-b916-ee3775367bd1
Replit-Helium-Checkpoint-Created: true

Files changed (16) hide show

.replit +9 -1
FORGE-v4/.gitignore +37 -0
FORGE-v4/README.md +193 -0
FORGE-v4/app.py +95 -0
FORGE-v4/config.py +52 -0
FORGE-v4/env.py +175 -0
FORGE-v4/logs/.gitkeep +0 -0
FORGE-v4/memory.py +135 -0
FORGE-v4/models/.gitkeep +0 -0
FORGE-v4/outputs/.gitkeep +0 -0
FORGE-v4/requirements.txt +40 -0
FORGE-v4/rewards.py +116 -0
FORGE-v4/sandbox.py +108 -0
FORGE-v4/tasks.py +102 -0
FORGE-v4/trainer.py +158 -0
attached_assets/Pasted-Create-a-Python-project-named-FORGE-v4-Build-the-comple_1777105563327.txt +108 -0

.replit CHANGED Viewed

@@ -1,4 +1,4 @@
-modules = ["nodejs-24"]
 [deployment]
 router = "application"
@@ -18,3 +18,11 @@ expertMode = true
 [postMerge]
 path = "scripts/post-merge.sh"
 timeoutMs = 20000

+modules = ["nodejs-24", "python-3.11"]
 [deployment]
 router = "application"
 [postMerge]
 path = "scripts/post-merge.sh"
 timeoutMs = 20000
+[[ports]]
+localPort = 8080
+externalPort = 8080
+[[ports]]
+localPort = 8081
+externalPort = 80

FORGE-v4/.gitignore ADDED Viewed

	@@ -0,0 +1,37 @@

+# Python
+__pycache__/
+*.py[cod]
+*.pyo
+*.pyd
+.Python
+*.egg-info/
+dist/
+build/
+.eggs/
+*.whl
+# Virtual environments
+venv/
+.venv/
+env/
+# Data / runtime outputs (keep directories, ignore contents)
+data/*.json
+logs/*.log
+logs/*.jsonl
+models/*
+!models/.gitkeep
+outputs/*
+!outputs/.gitkeep
+# Jupyter / Colab
+.ipynb_checkpoints/
+*.ipynb
+# IDE
+.vscode/
+.idea/
+# OS
+.DS_Store
+Thumbs.db

FORGE-v4/README.md ADDED Viewed

	@@ -0,0 +1,193 @@

+# FORGE-v4
+**Adversarial Code Generation Environment for Reinforcement Learning**
+A hackathon project built on an **OpenEnv-style** reinforcement learning framework where two competing agents — a Coder and a Breaker — are trained adversarially on Python sorting tasks.
+---
+## Overview
+FORGE-v4 pits two agents against each other:
+| Agent | Role |
+|-------|------|
+| **Coder** | Writes Python code to solve integer array sorting tasks |
+| **Breaker** | Generates adversarial test cases to expose flaws in the Coder's solution |
+Each episode the Coder earns rewards for passing hidden tests; the Breaker earns rewards for breaking the Coder's solution. A **Coach Memory** module accumulates lessons learned across episodes to guide future training.
+The skeleton is designed to be **drop-in ready for TRL / Unsloth fine-tuning** and **Hugging Face deployment**.
+---
+## Architecture
+```
+┌─────────────────────────────────────────────────┐
+│                   FORGEEnv (env.py)              │
+│                                                  │
+│  ┌──────────────┐        ┌──────────────────┐   │
+│  │  Coder Agent  │        │  Breaker Agent    │   │
+│  │  (policy fn) │        │  (policy fn)      │   │
+│  └──────┬───────┘        └────────┬──────────┘   │
+│         │ code (str)              │ test cases    │
+│         ▼                         ▼               │
+│  ┌──────────────────────────────────────────┐    │
+│  │           Sandbox (sandbox.py)           │    │
+│  │  subprocess · timeout · pass/fail/error  │    │
+│  └──────────────────┬───────────────────────┘    │
+│                     │ results                     │
+│                     ▼                             │
+│  ┌──────────────────────────────────────────┐    │
+│  │         Rewards (rewards.py)             │    │
+│  │  coder_reward() · breaker_reward()       │    │
+│  └──────────────────┬───────────────────────┘    │
+│                     │                             │
+│                     ▼                             │
+│  ┌──────────────────────────────────────────┐    │
+│  │       Coach Memory (memory.py)           │    │
+│  │  JSON-backed · lessons · summary()       │    │
+│  └──────────────────────────────────────────┘    │
+└─────────────────────────────────────────────────┘
+```
+---
+## File Structure
+```
+FORGE-v4/
+├── app.py           # CLI entry point — runs one demo episode
+├── env.py           # FORGEEnv: reset() / step() / get_state()
+├── tasks.py         # Task generator + hidden test sampler
+├── rewards.py       # coder_reward() and breaker_reward()
+├── sandbox.py       # Safe subprocess code execution with timeout
+├── memory.py        # CoachMemory: JSON-backed lessons store
+├── trainer.py       # Training loop + TRL/Unsloth hook placeholders
+├── config.py        # All constants (timeout, rewards, tier thresholds)
+├── requirements.txt # Dependencies
+├── README.md        # This file
+├── data/            # coach_memory.json (auto-created)
+├── logs/            # Episode logs
+├── models/          # Saved model checkpoints
+└── outputs/         # Generated code outputs
+```
+---
+## How to Run
+### 1. Install dependencies
+```bash
+pip install -r requirements.txt
+```
+> **Note:** The core skeleton has minimal dependencies. ML packages (TRL, Unsloth, PyTorch) are commented out in `requirements.txt` — uncomment them when adding LLM training.
+### 2. Run a demo episode
+```bash
+python app.py
+```
+This runs a single episode with placeholder Coder and Breaker policies (the Coder always uses `sorted()`, the Breaker sends fixed edge cases). You should see per-step reward output and a coach memory summary.
+### 3. Optional: override step count
+```bash
+python app.py --steps 3
+```
+---
+## Configuration
+Edit `config.py` to adjust environment constants:
+| Constant | Default | Description |
+|----------|---------|-------------|
+| `SANDBOX_TIMEOUT_SECONDS` | `5` | Max execution time per code run |
+| `MAX_ARRAY_SIZE` | `20` | Largest generated array |
+| `NUM_HIDDEN_TESTS` | `5` | Hidden test cases per task |
+| `CODER_PASS_REWARD` | `1.0` | Reward per passing test |
+| `BREAKER_BREAK_REWARD` | `1.0` | Reward per test that breaks coder |
+| `MAX_EPISODES` | `100` | Default training episode count |
+---
+## Extending with LLM Agents
+Replace the placeholder policies in `trainer.py`:
+```python
+# trainer.py
+def my_coder_policy(state: dict) -> str:
+    prompt = state["task_prompt"]
+    # Call your LLM here (TRL model, OpenAI API, Unsloth, etc.)
+    return generated_code
+def my_breaker_policy(state: dict) -> list[dict]:
+    prompt = state["task_prompt"]
+    # Call your adversarial LLM here
+    return [{"input": arr} for arr in generated_arrays]
+```
+Then run:
+```python
+from trainer import train
+summary = train(
+    coder_policy=my_coder_policy,
+    breaker_policy=my_breaker_policy,
+    num_episodes=50,
+)
+```
+---
+## TRL / Unsloth Integration (Future)
+Hook points are prepared in `trainer.py`:
+- `_on_episode_end()` — plug in `PPOTrainer.step()` or `GRPOTrainer` updates
+- `_on_step_end()` — plug in per-step reward logging (W&B, TensorBoard)
+```python
+# Example (uncomment in trainer.py after installing TRL):
+# from trl import PPOTrainer, PPOConfig
+# trainer = PPOTrainer(config=PPOConfig(...), model=model, ...)
+# trainer.step(queries, responses, rewards)
+```
+---
+## Google Colab
+1. Clone or upload the project to Colab.
+2. Install Unsloth:
+   ```
+   !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
+   ```
+3. Mount Drive and set `MEMORY_FILE` / `MODELS_DIR` in `config.py` to paths under `/content/drive/MyDrive/`.
+4. Run `python app.py` or import and call `train()` directly.
+---
+## Hugging Face Deployment
+After training, push your model with:
+```python
+model.push_to_hub("your-username/forge-v4-coder")
+tokenizer.push_to_hub("your-username/forge-v4-coder")
+```
+The repo structure (`models/`, `outputs/`) maps directly to HF Hub conventions.
+---
+## License
+MIT

FORGE-v4/app.py ADDED Viewed

	@@ -0,0 +1,95 @@

+# app.py
+# Main runner script for FORGE-v4.
+# Runs a minimal CLI demo of one sample episode.
+import sys
+import json
+from env import FORGEEnv
+from memory import CoachMemory
+from trainer import default_coder_policy, default_breaker_policy
+from config import STEPS_PER_EPISODE
+def run_demo_episode() -> None:
+    """
+    Execute a single demo episode and print the results to stdout.
+    """
+    print("=" * 60)
+    print("  FORGE-v4  |  Adversarial Code Generation Environment")
+    print("=" * 60)
+    # Initialise coach memory and environment
+    memory = CoachMemory()
+    env = FORGEEnv(memory=memory)
+    # Reset to start the episode
+    state = env.reset()
+    print(f"\n[Episode {state['episode']}]  Task prompt:\n")
+    print(state["task_prompt"])
+    print()
+    for step in range(1, STEPS_PER_EPISODE + 1):
+        print(f"── Step {step}/{STEPS_PER_EPISODE} " + "─" * 40)
+        # Agents produce their actions (placeholder policies for the demo)
+        coder_code    = default_coder_policy(state)
+        breaker_tests = default_breaker_policy(state)
+        action = {
+            "coder_code":    coder_code,
+            "breaker_tests": breaker_tests,
+        }
+        result = env.step(action)
+        cr = result["coder_reward"]
+        br = result["breaker_reward"]
+        print(
+            f"  Coder   → pass_rate: {cr['pass_rate']:.2f}  "
+            f"| passes: {cr['pass_count']}  "
+            f"| fails: {cr['fail_count']}  "
+            f"| errors: {cr['error_count']}  "
+            f"| reward: {cr['total_reward']:+.2f}"
+        )
+        print(
+            f"  Breaker → break_rate: {br['break_rate']:.2f}  "
+            f"| breaks: {br['breaks']}  "
+            f"| passes: {br['passes']}  "
+            f"| reward: {br['total_reward']:+.2f}"
+        )
+        if result["done"]:
+            break
+    print("\n" + "=" * 60)
+    print("  Episode complete.  Coach memory summary:")
+    print(json.dumps(memory.summary(), indent=2))
+    print("=" * 60)
+def main() -> None:
+    """Entry point — parse minimal CLI args and run."""
+    args = sys.argv[1:]
+    if "--help" in args or "-h" in args:
+        print("Usage: python app.py [--steps N]")
+        print("  --steps N   Override STEPS_PER_EPISODE for this run (default: from config.py)")
+        sys.exit(0)
+    # Optional: override step count via CLI
+    if "--steps" in args:
+        idx = args.index("--steps")
+        try:
+            import config
+            config.STEPS_PER_EPISODE = int(args[idx + 1])
+        except (IndexError, ValueError):
+            print("Error: --steps requires an integer argument.")
+            sys.exit(1)
+    run_demo_episode()
+if __name__ == "__main__":
+    main()

FORGE-v4/config.py ADDED Viewed

	@@ -0,0 +1,52 @@

+# config.py
+# Central configuration constants for FORGE-v4
+# ──────────────────────────────────────────────
+# Sandbox settings
+# ──────────────────────────────────────────────
+SANDBOX_TIMEOUT_SECONDS = 5          # Max time allowed for code execution
+SANDBOX_MAX_OUTPUT_CHARS = 4096      # Truncate stdout/stderr beyond this length
+# ──────────────────────────────────────────────
+# Task / environment settings
+# ──────────────────────────────────────────────
+MAX_ARRAY_SIZE = 20                  # Max length of generated integer arrays
+MIN_ARRAY_SIZE = 3                   # Min length of generated integer arrays
+ARRAY_VALUE_RANGE = (-100, 100)      # (min, max) integers in generated arrays
+NUM_HIDDEN_TESTS = 5                 # Number of hidden test cases per task
+# ──────────────────────────────────────────────
+# Reward settings
+# ──────────────────────────────────────────────
+# Coder reward weights
+CODER_PASS_REWARD = 1.0              # Reward per passing hidden test
+CODER_FAIL_PENALTY = -0.5            # Penalty per failing hidden test
+CODER_ERROR_PENALTY = -1.0           # Penalty when code raises an error
+# Breaker reward weights
+BREAKER_BREAK_REWARD = 1.0           # Reward when breaker's test breaks coder
+BREAKER_FAIL_PENALTY = -0.3          # Penalty when breaker's test does NOT break coder
+# ──────────────────────────────────────────────
+# Tier thresholds (coder skill levels)
+# ──────────────────────────────────────────────
+TIER_THRESHOLDS = {
+    "novice":       (0.0,  0.4),     # pass-rate range [low, high)
+    "intermediate": (0.4,  0.7),
+    "advanced":     (0.7,  0.9),
+    "expert":       (0.9,  1.01),
+}
+# ──────────────────────────────────────────────
+# Memory / logging
+# ──────────────────────────────────────────────
+MEMORY_FILE = "data/coach_memory.json"   # Persistent memory path
+LOG_DIR = "logs/"                        # Directory for episode logs
+MODELS_DIR = "models/"                   # Saved model checkpoints
+OUTPUTS_DIR = "outputs/"                 # Generated code outputs
+# ──────────────────────────────────────────────
+# Training placeholders
+# ──────────────────────────────────────────────
+MAX_EPISODES = 100                   # Default training episode count
+STEPS_PER_EPISODE = 10               # Steps per episode

FORGE-v4/env.py ADDED Viewed

	@@ -0,0 +1,175 @@

+# env.py
+# Main OpenEnv-style reinforcement learning environment for FORGE-v4.
+# Manages the interaction between the Coder Agent, Breaker Agent, and Sandbox.
+from typing import Any
+from tasks import generate_task, generate_breaker_task
+from sandbox import run_code_against_tests
+from rewards import coder_reward, breaker_reward
+from memory import CoachMemory
+from config import STEPS_PER_EPISODE
+class FORGEEnv:
+    """
+    Two-agent adversarial environment for code generation tasks.
+    Agents:
+        - Coder:   writes Python code to solve array-sorting tasks.
+        - Breaker: generates adversarial test cases to break the Coder's solution.
+    Episode flow:
+        1. reset()           → returns the initial task state
+        2. step(action)      × STEPS_PER_EPISODE steps
+        3. Rewards assigned to both agents at each step
+    Action format:
+        {
+            "coder_code":        str | None,   # Python source defining solution(arr)
+            "breaker_tests":     list | None,  # List of {"input": [...]} dicts
+        }
+    """
+    def __init__(self, memory: CoachMemory | None = None):
+        self.memory = memory or CoachMemory()
+        self.episode: int = 0
+        self.step_count: int = 0
+        self.current_task: dict[str, Any] = {}
+        self.done: bool = True
+        self._last_coder_code: str = ""
+        self._last_coder_pass_rate: float = 0.0
+    # ──────────────────────────────────────────────
+    # Core env methods
+    # ──────────────────────────────────────────────
+    def reset(self) -> dict[str, Any]:
+        """
+        Start a new episode.
+        Returns:
+            Initial state dict containing the task prompt and public example.
+        """
+        self.episode += 1
+        self.step_count = 0
+        self.done = False
+        self._last_coder_code = ""
+        self._last_coder_pass_rate = 0.0
+        self.current_task = generate_task()
+        state = self.get_state()
+        return state
+    def step(self, action: dict[str, Any]) -> dict[str, Any]:
+        """
+        Advance the environment by one step.
+        Args:
+            action: dict with optional keys:
+                "coder_code"    – Python source defining solution(arr)
+                "breaker_tests" – list of {"input": [...]} dicts
+        Returns:
+            {
+                "state":          current env state,
+                "coder_reward":   coder reward info dict,
+                "breaker_reward": breaker reward info dict,
+                "done":           bool (True when episode ends),
+                "info":           extra diagnostics,
+            }
+        """
+        if self.done:
+            raise RuntimeError("Episode is done. Call reset() before step().")
+        self.step_count += 1
+        coder_code    = action.get("coder_code", "")
+        breaker_tests = action.get("breaker_tests", [])
+        # ── Evaluate Coder ────────────────────────────────────────────────
+        coder_info = self._evaluate_coder(coder_code)
+        # ── Evaluate Breaker ──────────────────────────────────────────────
+        breaker_info = self._evaluate_breaker(coder_code, breaker_tests, coder_info)
+        # ── Log to Coach Memory ───────────────────────────────────────────
+        self.memory.add_lesson(
+            episode=self.episode,
+            agent="env",
+            observation=(
+                f"Step {self.step_count}: "
+                f"coder pass_rate={coder_info['pass_rate']:.2f}, "
+                f"breaker break_rate={breaker_info['break_rate']:.2f}"
+            ),
+            coder_reward=coder_info["total_reward"],
+            breaker_reward=breaker_info["total_reward"],
+            extra={
+                "step": self.step_count,
+                "coder_pass_rate": coder_info["pass_rate"],
+                "breaker_break_rate": breaker_info["break_rate"],
+            },
+        )
+        # ── Check done ────────────────────────────────────────────────────
+        if self.step_count >= STEPS_PER_EPISODE:
+            self.done = True
+        return {
+            "state":          self.get_state(),
+            "coder_reward":   coder_info,
+            "breaker_reward": breaker_info,
+            "done":           self.done,
+            "info": {
+                "episode":    self.episode,
+                "step":       self.step_count,
+            },
+        }
+    def get_state(self) -> dict[str, Any]:
+        """
+        Return the current observable state of the environment.
+        """
+        return {
+            "episode":        self.episode,
+            "step":           self.step_count,
+            "done":           self.done,
+            "task_prompt":    self.current_task.get("prompt", ""),
+            "public_example": self.current_task.get("public_example", {}),
+            "last_pass_rate": self._last_coder_pass_rate,
+        }
+    # ──────────────────────────────────────────────
+    # Private helpers
+    # ──────────────────────────────────────────────
+    def _evaluate_coder(self, code: str) -> dict[str, Any]:
+        """Run the coder's code against hidden tests and compute reward."""
+        hidden_tests = self.current_task.get("hidden_tests", [])
+        if not code or not hidden_tests:
+            # No code submitted — max penalty
+            dummy_results = [{"status": "error"} for _ in hidden_tests or [{}]]
+            info = coder_reward(dummy_results)
+        else:
+            results = run_code_against_tests(code, hidden_tests)
+            info = coder_reward(results)
+        # Cache for Breaker quality multiplier
+        self._last_coder_code = code
+        self._last_coder_pass_rate = info["pass_rate"]
+        return info
+    def _evaluate_breaker(
+        self,
+        coder_code: str,
+        breaker_tests: list[dict[str, Any]],
+        coder_info: dict[str, Any],
+    ) -> dict[str, Any]:
+        """Run the coder's code against the breaker's adversarial tests."""
+        if not coder_code or not breaker_tests:
+            # No submission from one of the agents
+            dummy = [{"status": "pass"} for _ in breaker_tests or [{}]]
+            return breaker_reward(dummy, coder_base_pass_rate=coder_info["pass_rate"])
+        results = run_code_against_tests(coder_code, breaker_tests)
+        return breaker_reward(results, coder_base_pass_rate=coder_info["pass_rate"])

FORGE-v4/logs/.gitkeep ADDED Viewed

File without changes

FORGE-v4/memory.py ADDED Viewed

	@@ -0,0 +1,135 @@

+# memory.py
+# Coach Memory system for FORGE-v4.
+# Stores lessons learned across episodes in a JSON file.
+import json
+import os
+from datetime import datetime
+from typing import Any
+from config import MEMORY_FILE
+class CoachMemory:
+    """
+    Persistent memory that accumulates lessons learned across training episodes.
+    Lessons are stored as a list of dicts in a JSON file and loaded on startup.
+    """
+    def __init__(self, filepath: str = MEMORY_FILE):
+        self.filepath = filepath
+        self.lessons: list[dict[str, Any]] = []
+        self._ensure_data_dir()
+        self.load()
+    # ──────────────────────────────────────────────
+    # Public API
+    # ──────────────────────────────────────────────
+    def add_lesson(
+        self,
+        episode: int,
+        agent: str,
+        observation: str,
+        coder_reward: float,
+        breaker_reward: float,
+        extra: dict[str, Any] | None = None,
+    ) -> None:
+        """
+        Record a lesson from one episode step.
+        Args:
+            episode:        Episode index.
+            agent:          "coder" | "breaker" | "env".
+            observation:    Human-readable description of what happened.
+            coder_reward:   Total coder reward for this step.
+            breaker_reward: Total breaker reward for this step.
+            extra:          Optional additional metadata.
+        """
+        lesson = {
+            "timestamp":      datetime.utcnow().isoformat(),
+            "episode":        episode,
+            "agent":          agent,
+            "observation":    observation,
+            "coder_reward":   coder_reward,
+            "breaker_reward": breaker_reward,
+        }
+        if extra:
+            lesson["extra"] = extra
+        self.lessons.append(lesson)
+        self.save()
+    def get_lessons(self, agent: str | None = None, last_n: int | None = None) -> list[dict[str, Any]]:
+        """
+        Retrieve stored lessons, optionally filtered by agent and/or limited to the last N.
+        Args:
+            agent:  Filter to a specific agent ("coder", "breaker", "env"), or None for all.
+            last_n: Return only the last N lessons if provided.
+        Returns:
+            List of lesson dicts.
+        """
+        result = self.lessons
+        if agent is not None:
+            result = [l for l in result if l.get("agent") == agent]
+        if last_n is not None:
+            result = result[-last_n:]
+        return result
+    def summary(self) -> dict[str, Any]:
+        """
+        Return a high-level summary of stored lessons.
+        """
+        if not self.lessons:
+            return {"total_lessons": 0, "episodes_seen": 0}
+        episodes = {l["episode"] for l in self.lessons}
+        coder_rewards = [l["coder_reward"] for l in self.lessons]
+        breaker_rewards = [l["breaker_reward"] for l in self.lessons]
+        return {
+            "total_lessons":      len(self.lessons),
+            "episodes_seen":      len(episodes),
+            "avg_coder_reward":   round(sum(coder_rewards) / len(coder_rewards), 4),
+            "avg_breaker_reward": round(sum(breaker_rewards) / len(breaker_rewards), 4),
+        }
+    def clear(self) -> None:
+        """
+        Wipe all stored lessons (use with caution).
+        """
+        self.lessons = []
+        self.save()
+    # ──────────────────────────────────────────────
+    # Persistence helpers
+    # ──────────────────────────────────────────────
+    def save(self) -> None:
+        """Persist lessons to JSON file."""
+        with open(self.filepath, "w", encoding="utf-8") as f:
+            json.dump(self.lessons, f, indent=2)
+    def load(self) -> None:
+        """Load lessons from JSON file if it exists."""
+        if os.path.exists(self.filepath):
+            try:
+                with open(self.filepath, "r", encoding="utf-8") as f:
+                    self.lessons = json.load(f)
+            except (json.JSONDecodeError, IOError):
+                # Start fresh if file is corrupted
+                self.lessons = []
+        else:
+            self.lessons = []
+    # ──────────────────────────────────────────────
+    # Internal helpers
+    # ──────────────────────────────────────────────
+    def _ensure_data_dir(self) -> None:
+        """Create the directory for the memory file if it doesn't exist."""
+        directory = os.path.dirname(self.filepath)
+        if directory:
+            os.makedirs(directory, exist_ok=True)

FORGE-v4/models/.gitkeep ADDED Viewed

File without changes

FORGE-v4/outputs/.gitkeep ADDED Viewed

File without changes

FORGE-v4/requirements.txt ADDED Viewed

	@@ -0,0 +1,40 @@

+# FORGE-v4 requirements
+# Core environment — no heavy ML deps needed to run the skeleton
+# Uncomment TRL / Unsloth blocks when adding LLM training.
+# ──────────────────────────────────────────────
+# Standard library extensions
+# ──────────────────────────────────────────────
+tqdm>=4.66.0          # Progress bars for training loops
+# ──────────────────────────────────────────────
+# Data / logging utilities
+# ──────────────────────────────────────────────
+numpy>=1.26.0         # Array math utilities
+pandas>=2.2.0         # Episode log analysis
+# ──────────────────────────────────────────────
+# Experiment tracking (optional but recommended)
+# ──────────────────────────────────────────────
+# wandb>=0.17.0       # Weights & Biases — uncomment to enable
+# ──────────────────────────────────────────────
+# LLM / RL training (future integration)
+# ──────────────────────────────────────────────
+# torch>=2.3.0
+# transformers>=4.41.0
+# trl>=0.9.0          # TRL PPO / GRPO trainer
+# datasets>=2.19.0    # Hugging Face Datasets
+# accelerate>=0.30.0  # Multi-GPU / mixed precision
+# ──────────────────────────────────────────────
+# Unsloth (Google Colab / fast fine-tuning)
+# ──────────────────────────────────────────────
+# unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git
+# Install separately in Colab:
+#   !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
+# ──────────────────────────────────────────────
+# Hugging Face Hub (model push / pull)
+# ──────────────────────────────────────────────
+# huggingface_hub>=0.23.0

FORGE-v4/rewards.py ADDED Viewed

	@@ -0,0 +1,116 @@

+# rewards.py
+# Reward functions for the Coder Agent and the Breaker Agent in FORGE-v4.
+from typing import Any
+from config import (
+    CODER_PASS_REWARD,
+    CODER_FAIL_PENALTY,
+    CODER_ERROR_PENALTY,
+    BREAKER_BREAK_REWARD,
+    BREAKER_FAIL_PENALTY,
+)
+def coder_reward(test_results: list[dict[str, Any]]) -> dict[str, Any]:
+    """
+    Compute the Coder agent's reward from sandbox test results.
+    Args:
+        test_results: list of result dicts from sandbox.run_code_against_tests().
+            Each dict has a "status" key: "pass" | "fail" | "error" | "timeout".
+    Returns:
+        {
+            "total_reward": float,
+            "pass_count":   int,
+            "fail_count":   int,
+            "error_count":  int,
+            "pass_rate":    float,   # fraction of tests passed
+            "breakdown":    list of per-test reward floats,
+        }
+    """
+    breakdown = []
+    pass_count = fail_count = error_count = 0
+    for r in test_results:
+        status = r.get("status", "error")
+        if status == "pass":
+            breakdown.append(CODER_PASS_REWARD)
+            pass_count += 1
+        elif status in ("error", "timeout"):
+            breakdown.append(CODER_ERROR_PENALTY)
+            error_count += 1
+        else:  # "fail"
+            breakdown.append(CODER_FAIL_PENALTY)
+            fail_count += 1
+    total = sum(breakdown)
+    n = len(test_results)
+    pass_rate = pass_count / n if n > 0 else 0.0
+    return {
+        "total_reward": round(total, 4),
+        "pass_count":   pass_count,
+        "fail_count":   fail_count,
+        "error_count":  error_count,
+        "pass_rate":    round(pass_rate, 4),
+        "breakdown":    breakdown,
+    }
+def breaker_reward(
+    adversarial_results: list[dict[str, Any]],
+    coder_base_pass_rate: float,
+) -> dict[str, Any]:
+    """
+    Compute the Breaker agent's reward.
+    The Breaker earns credit for tests that break the coder (non-pass outcomes).
+    It is penalised for tests that the coder still passes, because those tests
+    are not adversarial enough.
+    Args:
+        adversarial_results: results when the coder's code is run against the
+                             Breaker's adversarial test cases.
+        coder_base_pass_rate: the coder's pass-rate on the standard hidden tests
+                              (used to scale the Breaker's reward — breaking a
+                              strong coder is worth more).
+    Returns:
+        {
+            "total_reward": float,
+            "breaks":       int,   # number of tests that broke the coder
+            "passes":       int,   # number of tests the coder still passed
+            "break_rate":   float,
+            "breakdown":    list of per-test reward floats,
+        }
+    """
+    breakdown = []
+    breaks = passes = 0
+    # A higher-quality coder means a bigger multiplier for breaking them
+    quality_multiplier = max(1.0, 1.0 + coder_base_pass_rate)
+    for r in adversarial_results:
+        status = r.get("status", "error")
+        if status != "pass":
+            # Breaker successfully broke the coder
+            reward = BREAKER_BREAK_REWARD * quality_multiplier
+            breakdown.append(round(reward, 4))
+            breaks += 1
+        else:
+            # Coder survived — penalise the Breaker
+            breakdown.append(BREAKER_FAIL_PENALTY)
+            passes += 1
+    total = sum(breakdown)
+    n = len(adversarial_results)
+    break_rate = breaks / n if n > 0 else 0.0
+    return {
+        "total_reward": round(total, 4),
+        "breaks":       breaks,
+        "passes":       passes,
+        "break_rate":   round(break_rate, 4),
+        "breakdown":    breakdown,
+    }

FORGE-v4/sandbox.py ADDED Viewed

	@@ -0,0 +1,108 @@

+# sandbox.py
+# Safely execute agent-generated Python code in a restricted subprocess.
+# Returns structured pass/fail/error results with timeout handling.
+import subprocess
+import sys
+import textwrap
+import json
+import os
+from typing import Any
+from config import SANDBOX_TIMEOUT_SECONDS, SANDBOX_MAX_OUTPUT_CHARS
+def run_code(code: str, test_input: list[int]) -> dict[str, Any]:
+    """
+    Execute agent-generated code against a single test input.
+    Args:
+        code: Python source code that defines a `solution(arr)` function.
+        test_input: The integer list to pass to `solution`.
+    Returns:
+        {
+            "status":   "pass" | "fail" | "error" | "timeout",
+            "output":   the value returned by solution(arr), or None,
+            "expected": sorted(test_input),
+            "error_msg": exception string if status is "error", else "",
+        }
+    """
+    expected = sorted(test_input)
+    # Build a self-contained runner script
+    runner = textwrap.dedent(f"""
+import json, sys
+{code}
+test_input = {test_input!r}
+expected   = {expected!r}
+try:
+    result = solution(test_input)
+    if result == expected:
+        print(json.dumps({{"status": "pass", "output": result, "expected": expected, "error_msg": ""}}))
+    else:
+        print(json.dumps({{"status": "fail", "output": result, "expected": expected, "error_msg": ""}}))
+except Exception as exc:
+    print(json.dumps({{"status": "error", "output": None, "expected": expected, "error_msg": str(exc)}}))
+""")
+    try:
+        proc = subprocess.run(
+            [sys.executable, "-c", runner],
+            capture_output=True,
+            text=True,
+            timeout=SANDBOX_TIMEOUT_SECONDS,
+        )
+        raw = proc.stdout.strip()
+        # Truncate excessive output
+        if len(raw) > SANDBOX_MAX_OUTPUT_CHARS:
+            raw = raw[:SANDBOX_MAX_OUTPUT_CHARS]
+        if raw:
+            result = json.loads(raw)
+        else:
+            # No stdout — treat stderr as the error message
+            err = proc.stderr.strip()[:SANDBOX_MAX_OUTPUT_CHARS]
+            result = {
+                "status": "error",
+                "output": None,
+                "expected": expected,
+                "error_msg": err or "No output produced.",
+            }
+    except subprocess.TimeoutExpired:
+        result = {
+            "status": "timeout",
+            "output": None,
+            "expected": expected,
+            "error_msg": f"Code exceeded {SANDBOX_TIMEOUT_SECONDS}s timeout.",
+        }
+    except json.JSONDecodeError as exc:
+        result = {
+            "status": "error",
+            "output": None,
+            "expected": expected,
+            "error_msg": f"JSON decode error: {exc}  raw='{raw}'",
+        }
+    return result
+def run_code_against_tests(code: str, tests: list[dict[str, Any]]) -> list[dict[str, Any]]:
+    """
+    Run agent code against a list of test cases.
+    Args:
+        code: Python source defining `solution(arr)`.
+        tests: list of {"input": [...], "expected_output": [...]} dicts.
+    Returns:
+        List of result dicts, one per test.
+    """
+    results = []
+    for test in tests:
+        result = run_code(code, test["input"])
+        results.append(result)
+    return results

FORGE-v4/tasks.py ADDED Viewed

	@@ -0,0 +1,102 @@

+# tasks.py
+# Generates integer array sorting tasks and hidden test cases for FORGE-v4.
+import random
+from typing import Any
+from config import MAX_ARRAY_SIZE, MIN_ARRAY_SIZE, ARRAY_VALUE_RANGE, NUM_HIDDEN_TESTS
+def generate_task() -> dict[str, Any]:
+    """
+    Generate a single sorting task.
+    Returns a dict with:
+        - prompt: natural-language task description
+        - public_example: one visible (input, expected_output) pair
+        - hidden_tests: list of (input, expected_output) pairs kept secret from agents
+    """
+    size = random.randint(MIN_ARRAY_SIZE, MAX_ARRAY_SIZE)
+    arr = [random.randint(*ARRAY_VALUE_RANGE) for _ in range(size)]
+    public_example = {
+        "input": arr,
+        "expected_output": sorted(arr),
+    }
+    hidden_tests = _generate_hidden_tests(NUM_HIDDEN_TESTS)
+    task = {
+        "prompt": (
+            "Write a Python function named `solution(arr)` that takes a list of integers "
+            "and returns a new list sorted in ascending order. "
+            "Do not use `arr.sort()` in-place — return a new sorted list.\n\n"
+            f"Example:\n  Input:  {arr}\n  Output: {sorted(arr)}"
+        ),
+        "public_example": public_example,
+        "hidden_tests": hidden_tests,
+    }
+    return task
+def _generate_hidden_tests(n: int) -> list[dict[str, Any]]:
+    """
+    Generate n hidden test cases including edge-case variants.
+    """
+    tests = []
+    # Standard random arrays
+    for _ in range(n - 3):
+        size = random.randint(MIN_ARRAY_SIZE, MAX_ARRAY_SIZE)
+        arr = [random.randint(*ARRAY_VALUE_RANGE) for _ in range(size)]
+        tests.append({"input": arr, "expected_output": sorted(arr)})
+    # Edge case: already-sorted array
+    arr = sorted([random.randint(*ARRAY_VALUE_RANGE) for _ in range(5)])
+    tests.append({"input": arr, "expected_output": sorted(arr)})
+    # Edge case: reverse-sorted array
+    arr = sorted([random.randint(*ARRAY_VALUE_RANGE) for _ in range(5)], reverse=True)
+    tests.append({"input": arr, "expected_output": sorted(arr)})
+    # Edge case: single element
+    arr = [random.randint(*ARRAY_VALUE_RANGE)]
+    tests.append({"input": arr, "expected_output": sorted(arr)})
+    return tests
+def generate_breaker_task(original_task: dict[str, Any]) -> dict[str, Any]:
+    """
+    Given an existing task, produce adversarial test cases for the Breaker agent.
+    The Breaker is asked to produce arrays that are likely to break a naive solution.
+    Returns a dict with the adversarial prompt and a set of candidate adversarial arrays.
+    """
+    adversarial_candidates = [
+        # All identical elements
+        [0] * random.randint(3, 8),
+        # All negative values
+        [random.randint(-100, -1) for _ in range(random.randint(3, 8))],
+        # Large array
+        [random.randint(*ARRAY_VALUE_RANGE) for _ in range(MAX_ARRAY_SIZE)],
+        # Duplicate-heavy array
+        [random.choice([1, 2, 3]) for _ in range(random.randint(4, 10))],
+        # Mixed positive/negative with duplicates
+        [random.randint(-5, 5) for _ in range(random.randint(4, 12))],
+    ]
+    adversarial_tests = [
+        {"input": arr, "expected_output": sorted(arr)}
+        for arr in adversarial_candidates
+    ]
+    breaker_task = {
+        "prompt": (
+            "You are the Breaker agent. Generate adversarial integer arrays that are "
+            "likely to expose flaws in a naive sorting implementation. "
+            "Focus on edge cases: duplicates, negatives, large inputs, already-sorted, "
+            "reverse-sorted, and single-element arrays."
+        ),
+        "adversarial_tests": adversarial_tests,
+    }
+    return breaker_task

FORGE-v4/trainer.py ADDED Viewed

	@@ -0,0 +1,158 @@

+# trainer.py
+# Placeholder training loop hooks for FORGE-v4.
+# Ready for future TRL / Unsloth / Hugging Face integration.
+from typing import Any, Callable
+from env import FORGEEnv
+from memory import CoachMemory
+from config import MAX_EPISODES, STEPS_PER_EPISODE
+# ──────────────────────────────────────────────
+# Placeholder agent policy functions
+# ──────────────────────────────────────────────
+def default_coder_policy(state: dict[str, Any]) -> str:
+    """
+    Placeholder Coder policy.
+    In production this will call a fine-tuned LLM (e.g. via TRL/Unsloth) to
+    generate Python code from the task prompt.
+    Currently returns a trivial reference solution so the environment runs.
+    """
+    # TODO: Replace with LLM inference call
+    return "def solution(arr):\n    return sorted(arr)\n"
+def default_breaker_policy(state: dict[str, Any]) -> list[dict[str, Any]]:
+    """
+    Placeholder Breaker policy.
+    In production this will call a fine-tuned adversarial LLM to generate
+    adversarial test cases from the task prompt.
+    Currently returns a fixed set of edge-case test inputs.
+    """
+    # TODO: Replace with adversarial LLM inference call
+    return [
+        {"input": [],                             "expected_output": []},
+        {"input": [1],                            "expected_output": [1]},
+        {"input": [3, 1, 2],                      "expected_output": [1, 2, 3]},
+        {"input": [-5, -1, -3],                   "expected_output": [-5, -3, -1]},
+        {"input": [0, 0, 0, 0],                   "expected_output": [0, 0, 0, 0]},
+    ]
+# ──────────────────────────────────────────────
+# Core training loop
+# ──────────────────────────────────────────────
+def train(
+    coder_policy: Callable[[dict[str, Any]], str] = default_coder_policy,
+    breaker_policy: Callable[[dict[str, Any]], list[dict[str, Any]]] = default_breaker_policy,
+    num_episodes: int = MAX_EPISODES,
+    verbose: bool = True,
+) -> dict[str, Any]:
+    """
+    Run the FORGE-v4 training loop.
+    Args:
+        coder_policy:   Callable(state) → Python source string.
+        breaker_policy: Callable(state) → list of test-case dicts.
+        num_episodes:   Number of training episodes to run.
+        verbose:        Print per-episode summaries when True.
+    Returns:
+        Training summary dict with per-episode reward histories.
+    """
+    memory = CoachMemory()
+    env = FORGEEnv(memory=memory)
+    episode_history: list[dict[str, Any]] = []
+    for ep in range(1, num_episodes + 1):
+        state = env.reset()
+        episode_coder_rewards   = []
+        episode_breaker_rewards = []
+        for _ in range(STEPS_PER_EPISODE):
+            # ── Agent decisions ────────────────────────────────────────────
+            coder_code    = coder_policy(state)
+            breaker_tests = breaker_policy(state)
+            action = {
+                "coder_code":    coder_code,
+                "breaker_tests": breaker_tests,
+            }
+            # ── Environment step ───────────────────────────────────────────
+            result = env.step(action)
+            state  = result["state"]
+            episode_coder_rewards.append(result["coder_reward"]["total_reward"])
+            episode_breaker_rewards.append(result["breaker_reward"]["total_reward"])
+            if result["done"]:
+                break
+        # ── Episode summary ────────────────────────────────────────────────
+        avg_cr = round(sum(episode_coder_rewards)   / len(episode_coder_rewards),   4)
+        avg_br = round(sum(episode_breaker_rewards) / len(episode_breaker_rewards), 4)
+        ep_summary = {
+            "episode":              ep,
+            "avg_coder_reward":     avg_cr,
+            "avg_breaker_reward":   avg_br,
+            "steps":                env.step_count,
+        }
+        episode_history.append(ep_summary)
+        if verbose:
+            print(
+                f"[Episode {ep:>4}/{num_episodes}]  "
+                f"Coder avg reward: {avg_cr:+.4f}  |  "
+                f"Breaker avg reward: {avg_br:+.4f}"
+            )
+        # ── TRL / Unsloth hook placeholders ───────────────────────────────
+        _on_episode_end(ep, ep_summary, memory)
+    training_summary = {
+        "total_episodes":      num_episodes,
+        "episode_history":     episode_history,
+        "memory_summary":      memory.summary(),
+    }
+    return training_summary
+# ──────────────────────────────────────────────
+# Hook placeholders for future RL framework integration
+# ──────────────────────────────────────────────
+def _on_episode_end(
+    episode: int,
+    summary: dict[str, Any],
+    memory: CoachMemory,
+) -> None:
+    """
+    Called at the end of every episode.
+    TODO: Plug in TRL PPOTrainer / Unsloth model updates here.
+    E.g.:
+        trainer.step(queries, responses, rewards)
+        model.save_pretrained(f"models/checkpoint-ep{episode}")
+    """
+    pass  # placeholder
+def _on_step_end(
+    step: int,
+    result: dict[str, Any],
+) -> None:
+    """
+    Called after every environment step.
+    TODO: Plug in per-step reward logging (e.g. W&B, TensorBoard) here.
+    """
+    pass  # placeholder

attached_assets/Pasted-Create-a-Python-project-named-FORGE-v4-Build-the-comple_1777105563327.txt ADDED Viewed

	@@ -0,0 +1,108 @@

+Create a Python project named **FORGE-v4**.
+Build the complete project skeleton with this exact structure:
+FORGE-v4/
+│── app.py
+│── env.py
+│── tasks.py
+│── rewards.py
+│── sandbox.py
+│── memory.py
+│── trainer.py
+│── config.py
+│── requirements.txt
+│── README.md
+│── data/
+│── logs/
+│── models/
+│── outputs/
+Project Purpose:
+FORGE-v4 is a hackathon project based on an OpenEnv-style reinforcement learning environment where:
+1. A Coder Agent writes Python code to solve integer array sorting tasks.
+2. A Breaker Agent creates adversarial test cases to break the solution.
+3. A Sandbox safely runs generated code.
+4. Rewards are assigned to both agents.
+5. Coach Memory stores lessons learned across episodes.
+Generate clean, modular starter code for all files.
+Required file responsibilities:
+1. app.py
+* Main runner script
+* Minimal CLI demo
+* Runs one sample episode
+2. env.py
+* Main environment class
+* Include methods:
+  * reset()
+  * step(action)
+  * get_state()
+3. tasks.py
+* Generate integer array sorting tasks
+* Sample hidden test cases
+4. rewards.py
+* Functions for coder_reward()
+* Functions for breaker_reward()
+5. sandbox.py
+* Safely execute Python-generated code
+* Include timeout handling
+* Return pass/fail/error info
+6. memory.py
+* Coach memory system
+* Store lessons learned in JSON/list format
+* Load/save memory helpers
+7. trainer.py
+* Placeholder training loop hooks
+* Future TRL / Unsloth integration ready
+8. config.py
+* Store constants such as:
+  * timeout seconds
+  * max array size
+  * tier thresholds
+  * reward weights
+9. requirements.txt
+   Include useful starter dependencies only.
+10. README.md
+    Professional first draft including:
+* Project overview
+* Architecture
+* File structure
+* How to run
+Important Rules:
+* Generate WORKING starter code, not empty files.
+* Use Python best practices.
+* Add comments throughout code.
+* Keep modular design.
+* Keep ready for future Google Colab training and Hugging Face deployment.
+* No frontend needed now.
+* Focus only on backend environment skeleton.
+After generating files, ensure the project runs successfully with:
+python app.py