Spaces:
Sleeping
Sleeping
ksanjuma1234 commited on
Commit Β·
fc01d79
1
Parent(s): 8337c4b
Add a new adversarial code generation environment for reinforcement learning
Browse filesCreate the FORGE-v4 Python environment, including core components for agent interaction, code execution, reward calculation, and memory storage.
Replit-Commit-Author: Agent
Replit-Commit-Session-Id: a7518b1f-70c7-4487-82d2-42195935723e
Replit-Commit-Checkpoint-Type: full_checkpoint
Replit-Commit-Event-Id: dbf3c097-a076-4e9c-b916-ee3775367bd1
Replit-Helium-Checkpoint-Created: true
- .replit +9 -1
- FORGE-v4/.gitignore +37 -0
- FORGE-v4/README.md +193 -0
- FORGE-v4/app.py +95 -0
- FORGE-v4/config.py +52 -0
- FORGE-v4/env.py +175 -0
- FORGE-v4/logs/.gitkeep +0 -0
- FORGE-v4/memory.py +135 -0
- FORGE-v4/models/.gitkeep +0 -0
- FORGE-v4/outputs/.gitkeep +0 -0
- FORGE-v4/requirements.txt +40 -0
- FORGE-v4/rewards.py +116 -0
- FORGE-v4/sandbox.py +108 -0
- FORGE-v4/tasks.py +102 -0
- FORGE-v4/trainer.py +158 -0
- attached_assets/Pasted-Create-a-Python-project-named-FORGE-v4-Build-the-comple_1777105563327.txt +108 -0
.replit
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
modules = ["nodejs-24"]
|
| 2 |
|
| 3 |
[deployment]
|
| 4 |
router = "application"
|
|
@@ -18,3 +18,11 @@ expertMode = true
|
|
| 18 |
[postMerge]
|
| 19 |
path = "scripts/post-merge.sh"
|
| 20 |
timeoutMs = 20000
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
modules = ["nodejs-24", "python-3.11"]
|
| 2 |
|
| 3 |
[deployment]
|
| 4 |
router = "application"
|
|
|
|
| 18 |
[postMerge]
|
| 19 |
path = "scripts/post-merge.sh"
|
| 20 |
timeoutMs = 20000
|
| 21 |
+
|
| 22 |
+
[[ports]]
|
| 23 |
+
localPort = 8080
|
| 24 |
+
externalPort = 8080
|
| 25 |
+
|
| 26 |
+
[[ports]]
|
| 27 |
+
localPort = 8081
|
| 28 |
+
externalPort = 80
|
FORGE-v4/.gitignore
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Python
|
| 2 |
+
__pycache__/
|
| 3 |
+
*.py[cod]
|
| 4 |
+
*.pyo
|
| 5 |
+
*.pyd
|
| 6 |
+
.Python
|
| 7 |
+
*.egg-info/
|
| 8 |
+
dist/
|
| 9 |
+
build/
|
| 10 |
+
.eggs/
|
| 11 |
+
*.whl
|
| 12 |
+
|
| 13 |
+
# Virtual environments
|
| 14 |
+
venv/
|
| 15 |
+
.venv/
|
| 16 |
+
env/
|
| 17 |
+
|
| 18 |
+
# Data / runtime outputs (keep directories, ignore contents)
|
| 19 |
+
data/*.json
|
| 20 |
+
logs/*.log
|
| 21 |
+
logs/*.jsonl
|
| 22 |
+
models/*
|
| 23 |
+
!models/.gitkeep
|
| 24 |
+
outputs/*
|
| 25 |
+
!outputs/.gitkeep
|
| 26 |
+
|
| 27 |
+
# Jupyter / Colab
|
| 28 |
+
.ipynb_checkpoints/
|
| 29 |
+
*.ipynb
|
| 30 |
+
|
| 31 |
+
# IDE
|
| 32 |
+
.vscode/
|
| 33 |
+
.idea/
|
| 34 |
+
|
| 35 |
+
# OS
|
| 36 |
+
.DS_Store
|
| 37 |
+
Thumbs.db
|
FORGE-v4/README.md
ADDED
|
@@ -0,0 +1,193 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# FORGE-v4
|
| 2 |
+
|
| 3 |
+
**Adversarial Code Generation Environment for Reinforcement Learning**
|
| 4 |
+
|
| 5 |
+
A hackathon project built on an **OpenEnv-style** reinforcement learning framework where two competing agents β a Coder and a Breaker β are trained adversarially on Python sorting tasks.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Overview
|
| 10 |
+
|
| 11 |
+
FORGE-v4 pits two agents against each other:
|
| 12 |
+
|
| 13 |
+
| Agent | Role |
|
| 14 |
+
|-------|------|
|
| 15 |
+
| **Coder** | Writes Python code to solve integer array sorting tasks |
|
| 16 |
+
| **Breaker** | Generates adversarial test cases to expose flaws in the Coder's solution |
|
| 17 |
+
|
| 18 |
+
Each episode the Coder earns rewards for passing hidden tests; the Breaker earns rewards for breaking the Coder's solution. A **Coach Memory** module accumulates lessons learned across episodes to guide future training.
|
| 19 |
+
|
| 20 |
+
The skeleton is designed to be **drop-in ready for TRL / Unsloth fine-tuning** and **Hugging Face deployment**.
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
## Architecture
|
| 25 |
+
|
| 26 |
+
```
|
| 27 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 28 |
+
β FORGEEnv (env.py) β
|
| 29 |
+
β β
|
| 30 |
+
β ββββββββββββββββ ββββββββββββββββββββ β
|
| 31 |
+
β β Coder Agent β β Breaker Agent β β
|
| 32 |
+
β β (policy fn) β β (policy fn) β β
|
| 33 |
+
β ββββββββ¬ββββββββ ββββββββββ¬βββββββββββ β
|
| 34 |
+
β β code (str) β test cases β
|
| 35 |
+
β βΌ βΌ β
|
| 36 |
+
β ββββββββββββββββββββββββββββββββββββββββββββ β
|
| 37 |
+
β β Sandbox (sandbox.py) β β
|
| 38 |
+
β β subprocess Β· timeout Β· pass/fail/error β β
|
| 39 |
+
β ββββββββββββββββββββ¬ββββββββββββββββββββββββ β
|
| 40 |
+
β β results β
|
| 41 |
+
β βΌ β
|
| 42 |
+
β ββββββββββββββββββββββββββββββββββββββββββββ β
|
| 43 |
+
β β Rewards (rewards.py) β β
|
| 44 |
+
β β coder_reward() Β· breaker_reward() β β
|
| 45 |
+
β ββββββββββββββββββββ¬ββββββββββββββββββββββββ β
|
| 46 |
+
β β β
|
| 47 |
+
β βΌ β
|
| 48 |
+
β ββββββββββββββββββββββββββββββββββββββββββββ β
|
| 49 |
+
β β Coach Memory (memory.py) β β
|
| 50 |
+
β β JSON-backed Β· lessons Β· summary() β β
|
| 51 |
+
β ββββββββββββββββββββββββββββββββββββββββββββ β
|
| 52 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
## File Structure
|
| 58 |
+
|
| 59 |
+
```
|
| 60 |
+
FORGE-v4/
|
| 61 |
+
βββ app.py # CLI entry point β runs one demo episode
|
| 62 |
+
βββ env.py # FORGEEnv: reset() / step() / get_state()
|
| 63 |
+
βββ tasks.py # Task generator + hidden test sampler
|
| 64 |
+
βββ rewards.py # coder_reward() and breaker_reward()
|
| 65 |
+
βββ sandbox.py # Safe subprocess code execution with timeout
|
| 66 |
+
βββ memory.py # CoachMemory: JSON-backed lessons store
|
| 67 |
+
βββ trainer.py # Training loop + TRL/Unsloth hook placeholders
|
| 68 |
+
βββ config.py # All constants (timeout, rewards, tier thresholds)
|
| 69 |
+
βββ requirements.txt # Dependencies
|
| 70 |
+
βββ README.md # This file
|
| 71 |
+
βββ data/ # coach_memory.json (auto-created)
|
| 72 |
+
βββ logs/ # Episode logs
|
| 73 |
+
βββ models/ # Saved model checkpoints
|
| 74 |
+
βββ outputs/ # Generated code outputs
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
---
|
| 78 |
+
|
| 79 |
+
## How to Run
|
| 80 |
+
|
| 81 |
+
### 1. Install dependencies
|
| 82 |
+
|
| 83 |
+
```bash
|
| 84 |
+
pip install -r requirements.txt
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
> **Note:** The core skeleton has minimal dependencies. ML packages (TRL, Unsloth, PyTorch) are commented out in `requirements.txt` β uncomment them when adding LLM training.
|
| 88 |
+
|
| 89 |
+
### 2. Run a demo episode
|
| 90 |
+
|
| 91 |
+
```bash
|
| 92 |
+
python app.py
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
This runs a single episode with placeholder Coder and Breaker policies (the Coder always uses `sorted()`, the Breaker sends fixed edge cases). You should see per-step reward output and a coach memory summary.
|
| 96 |
+
|
| 97 |
+
### 3. Optional: override step count
|
| 98 |
+
|
| 99 |
+
```bash
|
| 100 |
+
python app.py --steps 3
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
---
|
| 104 |
+
|
| 105 |
+
## Configuration
|
| 106 |
+
|
| 107 |
+
Edit `config.py` to adjust environment constants:
|
| 108 |
+
|
| 109 |
+
| Constant | Default | Description |
|
| 110 |
+
|----------|---------|-------------|
|
| 111 |
+
| `SANDBOX_TIMEOUT_SECONDS` | `5` | Max execution time per code run |
|
| 112 |
+
| `MAX_ARRAY_SIZE` | `20` | Largest generated array |
|
| 113 |
+
| `NUM_HIDDEN_TESTS` | `5` | Hidden test cases per task |
|
| 114 |
+
| `CODER_PASS_REWARD` | `1.0` | Reward per passing test |
|
| 115 |
+
| `BREAKER_BREAK_REWARD` | `1.0` | Reward per test that breaks coder |
|
| 116 |
+
| `MAX_EPISODES` | `100` | Default training episode count |
|
| 117 |
+
|
| 118 |
+
---
|
| 119 |
+
|
| 120 |
+
## Extending with LLM Agents
|
| 121 |
+
|
| 122 |
+
Replace the placeholder policies in `trainer.py`:
|
| 123 |
+
|
| 124 |
+
```python
|
| 125 |
+
# trainer.py
|
| 126 |
+
def my_coder_policy(state: dict) -> str:
|
| 127 |
+
prompt = state["task_prompt"]
|
| 128 |
+
# Call your LLM here (TRL model, OpenAI API, Unsloth, etc.)
|
| 129 |
+
return generated_code
|
| 130 |
+
|
| 131 |
+
def my_breaker_policy(state: dict) -> list[dict]:
|
| 132 |
+
prompt = state["task_prompt"]
|
| 133 |
+
# Call your adversarial LLM here
|
| 134 |
+
return [{"input": arr} for arr in generated_arrays]
|
| 135 |
+
```
|
| 136 |
+
|
| 137 |
+
Then run:
|
| 138 |
+
|
| 139 |
+
```python
|
| 140 |
+
from trainer import train
|
| 141 |
+
summary = train(
|
| 142 |
+
coder_policy=my_coder_policy,
|
| 143 |
+
breaker_policy=my_breaker_policy,
|
| 144 |
+
num_episodes=50,
|
| 145 |
+
)
|
| 146 |
+
```
|
| 147 |
+
|
| 148 |
+
---
|
| 149 |
+
|
| 150 |
+
## TRL / Unsloth Integration (Future)
|
| 151 |
+
|
| 152 |
+
Hook points are prepared in `trainer.py`:
|
| 153 |
+
|
| 154 |
+
- `_on_episode_end()` β plug in `PPOTrainer.step()` or `GRPOTrainer` updates
|
| 155 |
+
- `_on_step_end()` β plug in per-step reward logging (W&B, TensorBoard)
|
| 156 |
+
|
| 157 |
+
```python
|
| 158 |
+
# Example (uncomment in trainer.py after installing TRL):
|
| 159 |
+
# from trl import PPOTrainer, PPOConfig
|
| 160 |
+
# trainer = PPOTrainer(config=PPOConfig(...), model=model, ...)
|
| 161 |
+
# trainer.step(queries, responses, rewards)
|
| 162 |
+
```
|
| 163 |
+
|
| 164 |
+
---
|
| 165 |
+
|
| 166 |
+
## Google Colab
|
| 167 |
+
|
| 168 |
+
1. Clone or upload the project to Colab.
|
| 169 |
+
2. Install Unsloth:
|
| 170 |
+
```
|
| 171 |
+
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
|
| 172 |
+
```
|
| 173 |
+
3. Mount Drive and set `MEMORY_FILE` / `MODELS_DIR` in `config.py` to paths under `/content/drive/MyDrive/`.
|
| 174 |
+
4. Run `python app.py` or import and call `train()` directly.
|
| 175 |
+
|
| 176 |
+
---
|
| 177 |
+
|
| 178 |
+
## Hugging Face Deployment
|
| 179 |
+
|
| 180 |
+
After training, push your model with:
|
| 181 |
+
|
| 182 |
+
```python
|
| 183 |
+
model.push_to_hub("your-username/forge-v4-coder")
|
| 184 |
+
tokenizer.push_to_hub("your-username/forge-v4-coder")
|
| 185 |
+
```
|
| 186 |
+
|
| 187 |
+
The repo structure (`models/`, `outputs/`) maps directly to HF Hub conventions.
|
| 188 |
+
|
| 189 |
+
---
|
| 190 |
+
|
| 191 |
+
## License
|
| 192 |
+
|
| 193 |
+
MIT
|
FORGE-v4/app.py
ADDED
|
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# app.py
|
| 2 |
+
# Main runner script for FORGE-v4.
|
| 3 |
+
# Runs a minimal CLI demo of one sample episode.
|
| 4 |
+
|
| 5 |
+
import sys
|
| 6 |
+
import json
|
| 7 |
+
from env import FORGEEnv
|
| 8 |
+
from memory import CoachMemory
|
| 9 |
+
from trainer import default_coder_policy, default_breaker_policy
|
| 10 |
+
from config import STEPS_PER_EPISODE
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
def run_demo_episode() -> None:
|
| 14 |
+
"""
|
| 15 |
+
Execute a single demo episode and print the results to stdout.
|
| 16 |
+
"""
|
| 17 |
+
print("=" * 60)
|
| 18 |
+
print(" FORGE-v4 | Adversarial Code Generation Environment")
|
| 19 |
+
print("=" * 60)
|
| 20 |
+
|
| 21 |
+
# Initialise coach memory and environment
|
| 22 |
+
memory = CoachMemory()
|
| 23 |
+
env = FORGEEnv(memory=memory)
|
| 24 |
+
|
| 25 |
+
# Reset to start the episode
|
| 26 |
+
state = env.reset()
|
| 27 |
+
|
| 28 |
+
print(f"\n[Episode {state['episode']}] Task prompt:\n")
|
| 29 |
+
print(state["task_prompt"])
|
| 30 |
+
print()
|
| 31 |
+
|
| 32 |
+
for step in range(1, STEPS_PER_EPISODE + 1):
|
| 33 |
+
print(f"ββ Step {step}/{STEPS_PER_EPISODE} " + "β" * 40)
|
| 34 |
+
|
| 35 |
+
# Agents produce their actions (placeholder policies for the demo)
|
| 36 |
+
coder_code = default_coder_policy(state)
|
| 37 |
+
breaker_tests = default_breaker_policy(state)
|
| 38 |
+
|
| 39 |
+
action = {
|
| 40 |
+
"coder_code": coder_code,
|
| 41 |
+
"breaker_tests": breaker_tests,
|
| 42 |
+
}
|
| 43 |
+
|
| 44 |
+
result = env.step(action)
|
| 45 |
+
|
| 46 |
+
cr = result["coder_reward"]
|
| 47 |
+
br = result["breaker_reward"]
|
| 48 |
+
|
| 49 |
+
print(
|
| 50 |
+
f" Coder β pass_rate: {cr['pass_rate']:.2f} "
|
| 51 |
+
f"| passes: {cr['pass_count']} "
|
| 52 |
+
f"| fails: {cr['fail_count']} "
|
| 53 |
+
f"| errors: {cr['error_count']} "
|
| 54 |
+
f"| reward: {cr['total_reward']:+.2f}"
|
| 55 |
+
)
|
| 56 |
+
print(
|
| 57 |
+
f" Breaker β break_rate: {br['break_rate']:.2f} "
|
| 58 |
+
f"| breaks: {br['breaks']} "
|
| 59 |
+
f"| passes: {br['passes']} "
|
| 60 |
+
f"| reward: {br['total_reward']:+.2f}"
|
| 61 |
+
)
|
| 62 |
+
|
| 63 |
+
if result["done"]:
|
| 64 |
+
break
|
| 65 |
+
|
| 66 |
+
print("\n" + "=" * 60)
|
| 67 |
+
print(" Episode complete. Coach memory summary:")
|
| 68 |
+
print(json.dumps(memory.summary(), indent=2))
|
| 69 |
+
print("=" * 60)
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
def main() -> None:
|
| 73 |
+
"""Entry point β parse minimal CLI args and run."""
|
| 74 |
+
args = sys.argv[1:]
|
| 75 |
+
|
| 76 |
+
if "--help" in args or "-h" in args:
|
| 77 |
+
print("Usage: python app.py [--steps N]")
|
| 78 |
+
print(" --steps N Override STEPS_PER_EPISODE for this run (default: from config.py)")
|
| 79 |
+
sys.exit(0)
|
| 80 |
+
|
| 81 |
+
# Optional: override step count via CLI
|
| 82 |
+
if "--steps" in args:
|
| 83 |
+
idx = args.index("--steps")
|
| 84 |
+
try:
|
| 85 |
+
import config
|
| 86 |
+
config.STEPS_PER_EPISODE = int(args[idx + 1])
|
| 87 |
+
except (IndexError, ValueError):
|
| 88 |
+
print("Error: --steps requires an integer argument.")
|
| 89 |
+
sys.exit(1)
|
| 90 |
+
|
| 91 |
+
run_demo_episode()
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
if __name__ == "__main__":
|
| 95 |
+
main()
|
FORGE-v4/config.py
ADDED
|
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# config.py
|
| 2 |
+
# Central configuration constants for FORGE-v4
|
| 3 |
+
|
| 4 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 5 |
+
# Sandbox settings
|
| 6 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 7 |
+
SANDBOX_TIMEOUT_SECONDS = 5 # Max time allowed for code execution
|
| 8 |
+
SANDBOX_MAX_OUTPUT_CHARS = 4096 # Truncate stdout/stderr beyond this length
|
| 9 |
+
|
| 10 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 11 |
+
# Task / environment settings
|
| 12 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 13 |
+
MAX_ARRAY_SIZE = 20 # Max length of generated integer arrays
|
| 14 |
+
MIN_ARRAY_SIZE = 3 # Min length of generated integer arrays
|
| 15 |
+
ARRAY_VALUE_RANGE = (-100, 100) # (min, max) integers in generated arrays
|
| 16 |
+
NUM_HIDDEN_TESTS = 5 # Number of hidden test cases per task
|
| 17 |
+
|
| 18 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 19 |
+
# Reward settings
|
| 20 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 21 |
+
# Coder reward weights
|
| 22 |
+
CODER_PASS_REWARD = 1.0 # Reward per passing hidden test
|
| 23 |
+
CODER_FAIL_PENALTY = -0.5 # Penalty per failing hidden test
|
| 24 |
+
CODER_ERROR_PENALTY = -1.0 # Penalty when code raises an error
|
| 25 |
+
|
| 26 |
+
# Breaker reward weights
|
| 27 |
+
BREAKER_BREAK_REWARD = 1.0 # Reward when breaker's test breaks coder
|
| 28 |
+
BREAKER_FAIL_PENALTY = -0.3 # Penalty when breaker's test does NOT break coder
|
| 29 |
+
|
| 30 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 31 |
+
# Tier thresholds (coder skill levels)
|
| 32 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 33 |
+
TIER_THRESHOLDS = {
|
| 34 |
+
"novice": (0.0, 0.4), # pass-rate range [low, high)
|
| 35 |
+
"intermediate": (0.4, 0.7),
|
| 36 |
+
"advanced": (0.7, 0.9),
|
| 37 |
+
"expert": (0.9, 1.01),
|
| 38 |
+
}
|
| 39 |
+
|
| 40 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 41 |
+
# Memory / logging
|
| 42 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 43 |
+
MEMORY_FILE = "data/coach_memory.json" # Persistent memory path
|
| 44 |
+
LOG_DIR = "logs/" # Directory for episode logs
|
| 45 |
+
MODELS_DIR = "models/" # Saved model checkpoints
|
| 46 |
+
OUTPUTS_DIR = "outputs/" # Generated code outputs
|
| 47 |
+
|
| 48 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 49 |
+
# Training placeholders
|
| 50 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 51 |
+
MAX_EPISODES = 100 # Default training episode count
|
| 52 |
+
STEPS_PER_EPISODE = 10 # Steps per episode
|
FORGE-v4/env.py
ADDED
|
@@ -0,0 +1,175 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# env.py
|
| 2 |
+
# Main OpenEnv-style reinforcement learning environment for FORGE-v4.
|
| 3 |
+
# Manages the interaction between the Coder Agent, Breaker Agent, and Sandbox.
|
| 4 |
+
|
| 5 |
+
from typing import Any
|
| 6 |
+
from tasks import generate_task, generate_breaker_task
|
| 7 |
+
from sandbox import run_code_against_tests
|
| 8 |
+
from rewards import coder_reward, breaker_reward
|
| 9 |
+
from memory import CoachMemory
|
| 10 |
+
from config import STEPS_PER_EPISODE
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
class FORGEEnv:
|
| 14 |
+
"""
|
| 15 |
+
Two-agent adversarial environment for code generation tasks.
|
| 16 |
+
|
| 17 |
+
Agents:
|
| 18 |
+
- Coder: writes Python code to solve array-sorting tasks.
|
| 19 |
+
- Breaker: generates adversarial test cases to break the Coder's solution.
|
| 20 |
+
|
| 21 |
+
Episode flow:
|
| 22 |
+
1. reset() β returns the initial task state
|
| 23 |
+
2. step(action) Γ STEPS_PER_EPISODE steps
|
| 24 |
+
3. Rewards assigned to both agents at each step
|
| 25 |
+
|
| 26 |
+
Action format:
|
| 27 |
+
{
|
| 28 |
+
"coder_code": str | None, # Python source defining solution(arr)
|
| 29 |
+
"breaker_tests": list | None, # List of {"input": [...]} dicts
|
| 30 |
+
}
|
| 31 |
+
"""
|
| 32 |
+
|
| 33 |
+
def __init__(self, memory: CoachMemory | None = None):
|
| 34 |
+
self.memory = memory or CoachMemory()
|
| 35 |
+
self.episode: int = 0
|
| 36 |
+
self.step_count: int = 0
|
| 37 |
+
self.current_task: dict[str, Any] = {}
|
| 38 |
+
self.done: bool = True
|
| 39 |
+
self._last_coder_code: str = ""
|
| 40 |
+
self._last_coder_pass_rate: float = 0.0
|
| 41 |
+
|
| 42 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 43 |
+
# Core env methods
|
| 44 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 45 |
+
|
| 46 |
+
def reset(self) -> dict[str, Any]:
|
| 47 |
+
"""
|
| 48 |
+
Start a new episode.
|
| 49 |
+
|
| 50 |
+
Returns:
|
| 51 |
+
Initial state dict containing the task prompt and public example.
|
| 52 |
+
"""
|
| 53 |
+
self.episode += 1
|
| 54 |
+
self.step_count = 0
|
| 55 |
+
self.done = False
|
| 56 |
+
self._last_coder_code = ""
|
| 57 |
+
self._last_coder_pass_rate = 0.0
|
| 58 |
+
|
| 59 |
+
self.current_task = generate_task()
|
| 60 |
+
|
| 61 |
+
state = self.get_state()
|
| 62 |
+
return state
|
| 63 |
+
|
| 64 |
+
def step(self, action: dict[str, Any]) -> dict[str, Any]:
|
| 65 |
+
"""
|
| 66 |
+
Advance the environment by one step.
|
| 67 |
+
|
| 68 |
+
Args:
|
| 69 |
+
action: dict with optional keys:
|
| 70 |
+
"coder_code" β Python source defining solution(arr)
|
| 71 |
+
"breaker_tests" β list of {"input": [...]} dicts
|
| 72 |
+
|
| 73 |
+
Returns:
|
| 74 |
+
{
|
| 75 |
+
"state": current env state,
|
| 76 |
+
"coder_reward": coder reward info dict,
|
| 77 |
+
"breaker_reward": breaker reward info dict,
|
| 78 |
+
"done": bool (True when episode ends),
|
| 79 |
+
"info": extra diagnostics,
|
| 80 |
+
}
|
| 81 |
+
"""
|
| 82 |
+
if self.done:
|
| 83 |
+
raise RuntimeError("Episode is done. Call reset() before step().")
|
| 84 |
+
|
| 85 |
+
self.step_count += 1
|
| 86 |
+
coder_code = action.get("coder_code", "")
|
| 87 |
+
breaker_tests = action.get("breaker_tests", [])
|
| 88 |
+
|
| 89 |
+
# ββ Evaluate Coder ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 90 |
+
coder_info = self._evaluate_coder(coder_code)
|
| 91 |
+
|
| 92 |
+
# ββ Evaluate Breaker ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 93 |
+
breaker_info = self._evaluate_breaker(coder_code, breaker_tests, coder_info)
|
| 94 |
+
|
| 95 |
+
# ββ Log to Coach Memory βββββββββββββββββββββββββββββββββββββββββββ
|
| 96 |
+
self.memory.add_lesson(
|
| 97 |
+
episode=self.episode,
|
| 98 |
+
agent="env",
|
| 99 |
+
observation=(
|
| 100 |
+
f"Step {self.step_count}: "
|
| 101 |
+
f"coder pass_rate={coder_info['pass_rate']:.2f}, "
|
| 102 |
+
f"breaker break_rate={breaker_info['break_rate']:.2f}"
|
| 103 |
+
),
|
| 104 |
+
coder_reward=coder_info["total_reward"],
|
| 105 |
+
breaker_reward=breaker_info["total_reward"],
|
| 106 |
+
extra={
|
| 107 |
+
"step": self.step_count,
|
| 108 |
+
"coder_pass_rate": coder_info["pass_rate"],
|
| 109 |
+
"breaker_break_rate": breaker_info["break_rate"],
|
| 110 |
+
},
|
| 111 |
+
)
|
| 112 |
+
|
| 113 |
+
# ββ Check done ββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 114 |
+
if self.step_count >= STEPS_PER_EPISODE:
|
| 115 |
+
self.done = True
|
| 116 |
+
|
| 117 |
+
return {
|
| 118 |
+
"state": self.get_state(),
|
| 119 |
+
"coder_reward": coder_info,
|
| 120 |
+
"breaker_reward": breaker_info,
|
| 121 |
+
"done": self.done,
|
| 122 |
+
"info": {
|
| 123 |
+
"episode": self.episode,
|
| 124 |
+
"step": self.step_count,
|
| 125 |
+
},
|
| 126 |
+
}
|
| 127 |
+
|
| 128 |
+
def get_state(self) -> dict[str, Any]:
|
| 129 |
+
"""
|
| 130 |
+
Return the current observable state of the environment.
|
| 131 |
+
"""
|
| 132 |
+
return {
|
| 133 |
+
"episode": self.episode,
|
| 134 |
+
"step": self.step_count,
|
| 135 |
+
"done": self.done,
|
| 136 |
+
"task_prompt": self.current_task.get("prompt", ""),
|
| 137 |
+
"public_example": self.current_task.get("public_example", {}),
|
| 138 |
+
"last_pass_rate": self._last_coder_pass_rate,
|
| 139 |
+
}
|
| 140 |
+
|
| 141 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 142 |
+
# Private helpers
|
| 143 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 144 |
+
|
| 145 |
+
def _evaluate_coder(self, code: str) -> dict[str, Any]:
|
| 146 |
+
"""Run the coder's code against hidden tests and compute reward."""
|
| 147 |
+
hidden_tests = self.current_task.get("hidden_tests", [])
|
| 148 |
+
|
| 149 |
+
if not code or not hidden_tests:
|
| 150 |
+
# No code submitted β max penalty
|
| 151 |
+
dummy_results = [{"status": "error"} for _ in hidden_tests or [{}]]
|
| 152 |
+
info = coder_reward(dummy_results)
|
| 153 |
+
else:
|
| 154 |
+
results = run_code_against_tests(code, hidden_tests)
|
| 155 |
+
info = coder_reward(results)
|
| 156 |
+
|
| 157 |
+
# Cache for Breaker quality multiplier
|
| 158 |
+
self._last_coder_code = code
|
| 159 |
+
self._last_coder_pass_rate = info["pass_rate"]
|
| 160 |
+
return info
|
| 161 |
+
|
| 162 |
+
def _evaluate_breaker(
|
| 163 |
+
self,
|
| 164 |
+
coder_code: str,
|
| 165 |
+
breaker_tests: list[dict[str, Any]],
|
| 166 |
+
coder_info: dict[str, Any],
|
| 167 |
+
) -> dict[str, Any]:
|
| 168 |
+
"""Run the coder's code against the breaker's adversarial tests."""
|
| 169 |
+
if not coder_code or not breaker_tests:
|
| 170 |
+
# No submission from one of the agents
|
| 171 |
+
dummy = [{"status": "pass"} for _ in breaker_tests or [{}]]
|
| 172 |
+
return breaker_reward(dummy, coder_base_pass_rate=coder_info["pass_rate"])
|
| 173 |
+
|
| 174 |
+
results = run_code_against_tests(coder_code, breaker_tests)
|
| 175 |
+
return breaker_reward(results, coder_base_pass_rate=coder_info["pass_rate"])
|
FORGE-v4/logs/.gitkeep
ADDED
|
File without changes
|
FORGE-v4/memory.py
ADDED
|
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# memory.py
|
| 2 |
+
# Coach Memory system for FORGE-v4.
|
| 3 |
+
# Stores lessons learned across episodes in a JSON file.
|
| 4 |
+
|
| 5 |
+
import json
|
| 6 |
+
import os
|
| 7 |
+
from datetime import datetime
|
| 8 |
+
from typing import Any
|
| 9 |
+
from config import MEMORY_FILE
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
class CoachMemory:
|
| 13 |
+
"""
|
| 14 |
+
Persistent memory that accumulates lessons learned across training episodes.
|
| 15 |
+
|
| 16 |
+
Lessons are stored as a list of dicts in a JSON file and loaded on startup.
|
| 17 |
+
"""
|
| 18 |
+
|
| 19 |
+
def __init__(self, filepath: str = MEMORY_FILE):
|
| 20 |
+
self.filepath = filepath
|
| 21 |
+
self.lessons: list[dict[str, Any]] = []
|
| 22 |
+
self._ensure_data_dir()
|
| 23 |
+
self.load()
|
| 24 |
+
|
| 25 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 26 |
+
# Public API
|
| 27 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 28 |
+
|
| 29 |
+
def add_lesson(
|
| 30 |
+
self,
|
| 31 |
+
episode: int,
|
| 32 |
+
agent: str,
|
| 33 |
+
observation: str,
|
| 34 |
+
coder_reward: float,
|
| 35 |
+
breaker_reward: float,
|
| 36 |
+
extra: dict[str, Any] | None = None,
|
| 37 |
+
) -> None:
|
| 38 |
+
"""
|
| 39 |
+
Record a lesson from one episode step.
|
| 40 |
+
|
| 41 |
+
Args:
|
| 42 |
+
episode: Episode index.
|
| 43 |
+
agent: "coder" | "breaker" | "env".
|
| 44 |
+
observation: Human-readable description of what happened.
|
| 45 |
+
coder_reward: Total coder reward for this step.
|
| 46 |
+
breaker_reward: Total breaker reward for this step.
|
| 47 |
+
extra: Optional additional metadata.
|
| 48 |
+
"""
|
| 49 |
+
lesson = {
|
| 50 |
+
"timestamp": datetime.utcnow().isoformat(),
|
| 51 |
+
"episode": episode,
|
| 52 |
+
"agent": agent,
|
| 53 |
+
"observation": observation,
|
| 54 |
+
"coder_reward": coder_reward,
|
| 55 |
+
"breaker_reward": breaker_reward,
|
| 56 |
+
}
|
| 57 |
+
if extra:
|
| 58 |
+
lesson["extra"] = extra
|
| 59 |
+
|
| 60 |
+
self.lessons.append(lesson)
|
| 61 |
+
self.save()
|
| 62 |
+
|
| 63 |
+
def get_lessons(self, agent: str | None = None, last_n: int | None = None) -> list[dict[str, Any]]:
|
| 64 |
+
"""
|
| 65 |
+
Retrieve stored lessons, optionally filtered by agent and/or limited to the last N.
|
| 66 |
+
|
| 67 |
+
Args:
|
| 68 |
+
agent: Filter to a specific agent ("coder", "breaker", "env"), or None for all.
|
| 69 |
+
last_n: Return only the last N lessons if provided.
|
| 70 |
+
|
| 71 |
+
Returns:
|
| 72 |
+
List of lesson dicts.
|
| 73 |
+
"""
|
| 74 |
+
result = self.lessons
|
| 75 |
+
if agent is not None:
|
| 76 |
+
result = [l for l in result if l.get("agent") == agent]
|
| 77 |
+
if last_n is not None:
|
| 78 |
+
result = result[-last_n:]
|
| 79 |
+
return result
|
| 80 |
+
|
| 81 |
+
def summary(self) -> dict[str, Any]:
|
| 82 |
+
"""
|
| 83 |
+
Return a high-level summary of stored lessons.
|
| 84 |
+
"""
|
| 85 |
+
if not self.lessons:
|
| 86 |
+
return {"total_lessons": 0, "episodes_seen": 0}
|
| 87 |
+
|
| 88 |
+
episodes = {l["episode"] for l in self.lessons}
|
| 89 |
+
coder_rewards = [l["coder_reward"] for l in self.lessons]
|
| 90 |
+
breaker_rewards = [l["breaker_reward"] for l in self.lessons]
|
| 91 |
+
|
| 92 |
+
return {
|
| 93 |
+
"total_lessons": len(self.lessons),
|
| 94 |
+
"episodes_seen": len(episodes),
|
| 95 |
+
"avg_coder_reward": round(sum(coder_rewards) / len(coder_rewards), 4),
|
| 96 |
+
"avg_breaker_reward": round(sum(breaker_rewards) / len(breaker_rewards), 4),
|
| 97 |
+
}
|
| 98 |
+
|
| 99 |
+
def clear(self) -> None:
|
| 100 |
+
"""
|
| 101 |
+
Wipe all stored lessons (use with caution).
|
| 102 |
+
"""
|
| 103 |
+
self.lessons = []
|
| 104 |
+
self.save()
|
| 105 |
+
|
| 106 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 107 |
+
# Persistence helpers
|
| 108 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 109 |
+
|
| 110 |
+
def save(self) -> None:
|
| 111 |
+
"""Persist lessons to JSON file."""
|
| 112 |
+
with open(self.filepath, "w", encoding="utf-8") as f:
|
| 113 |
+
json.dump(self.lessons, f, indent=2)
|
| 114 |
+
|
| 115 |
+
def load(self) -> None:
|
| 116 |
+
"""Load lessons from JSON file if it exists."""
|
| 117 |
+
if os.path.exists(self.filepath):
|
| 118 |
+
try:
|
| 119 |
+
with open(self.filepath, "r", encoding="utf-8") as f:
|
| 120 |
+
self.lessons = json.load(f)
|
| 121 |
+
except (json.JSONDecodeError, IOError):
|
| 122 |
+
# Start fresh if file is corrupted
|
| 123 |
+
self.lessons = []
|
| 124 |
+
else:
|
| 125 |
+
self.lessons = []
|
| 126 |
+
|
| 127 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 128 |
+
# Internal helpers
|
| 129 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 130 |
+
|
| 131 |
+
def _ensure_data_dir(self) -> None:
|
| 132 |
+
"""Create the directory for the memory file if it doesn't exist."""
|
| 133 |
+
directory = os.path.dirname(self.filepath)
|
| 134 |
+
if directory:
|
| 135 |
+
os.makedirs(directory, exist_ok=True)
|
FORGE-v4/models/.gitkeep
ADDED
|
File without changes
|
FORGE-v4/outputs/.gitkeep
ADDED
|
File without changes
|
FORGE-v4/requirements.txt
ADDED
|
@@ -0,0 +1,40 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# FORGE-v4 requirements
|
| 2 |
+
# Core environment β no heavy ML deps needed to run the skeleton
|
| 3 |
+
# Uncomment TRL / Unsloth blocks when adding LLM training.
|
| 4 |
+
|
| 5 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 6 |
+
# Standard library extensions
|
| 7 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 8 |
+
tqdm>=4.66.0 # Progress bars for training loops
|
| 9 |
+
|
| 10 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 11 |
+
# Data / logging utilities
|
| 12 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 13 |
+
numpy>=1.26.0 # Array math utilities
|
| 14 |
+
pandas>=2.2.0 # Episode log analysis
|
| 15 |
+
|
| 16 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 17 |
+
# Experiment tracking (optional but recommended)
|
| 18 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 19 |
+
# wandb>=0.17.0 # Weights & Biases β uncomment to enable
|
| 20 |
+
|
| 21 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 22 |
+
# LLM / RL training (future integration)
|
| 23 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 24 |
+
# torch>=2.3.0
|
| 25 |
+
# transformers>=4.41.0
|
| 26 |
+
# trl>=0.9.0 # TRL PPO / GRPO trainer
|
| 27 |
+
# datasets>=2.19.0 # Hugging Face Datasets
|
| 28 |
+
# accelerate>=0.30.0 # Multi-GPU / mixed precision
|
| 29 |
+
|
| 30 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 31 |
+
# Unsloth (Google Colab / fast fine-tuning)
|
| 32 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 33 |
+
# unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git
|
| 34 |
+
# Install separately in Colab:
|
| 35 |
+
# !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
|
| 36 |
+
|
| 37 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 38 |
+
# Hugging Face Hub (model push / pull)
|
| 39 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 40 |
+
# huggingface_hub>=0.23.0
|
FORGE-v4/rewards.py
ADDED
|
@@ -0,0 +1,116 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# rewards.py
|
| 2 |
+
# Reward functions for the Coder Agent and the Breaker Agent in FORGE-v4.
|
| 3 |
+
|
| 4 |
+
from typing import Any
|
| 5 |
+
from config import (
|
| 6 |
+
CODER_PASS_REWARD,
|
| 7 |
+
CODER_FAIL_PENALTY,
|
| 8 |
+
CODER_ERROR_PENALTY,
|
| 9 |
+
BREAKER_BREAK_REWARD,
|
| 10 |
+
BREAKER_FAIL_PENALTY,
|
| 11 |
+
)
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
def coder_reward(test_results: list[dict[str, Any]]) -> dict[str, Any]:
|
| 15 |
+
"""
|
| 16 |
+
Compute the Coder agent's reward from sandbox test results.
|
| 17 |
+
|
| 18 |
+
Args:
|
| 19 |
+
test_results: list of result dicts from sandbox.run_code_against_tests().
|
| 20 |
+
Each dict has a "status" key: "pass" | "fail" | "error" | "timeout".
|
| 21 |
+
|
| 22 |
+
Returns:
|
| 23 |
+
{
|
| 24 |
+
"total_reward": float,
|
| 25 |
+
"pass_count": int,
|
| 26 |
+
"fail_count": int,
|
| 27 |
+
"error_count": int,
|
| 28 |
+
"pass_rate": float, # fraction of tests passed
|
| 29 |
+
"breakdown": list of per-test reward floats,
|
| 30 |
+
}
|
| 31 |
+
"""
|
| 32 |
+
breakdown = []
|
| 33 |
+
pass_count = fail_count = error_count = 0
|
| 34 |
+
|
| 35 |
+
for r in test_results:
|
| 36 |
+
status = r.get("status", "error")
|
| 37 |
+
if status == "pass":
|
| 38 |
+
breakdown.append(CODER_PASS_REWARD)
|
| 39 |
+
pass_count += 1
|
| 40 |
+
elif status in ("error", "timeout"):
|
| 41 |
+
breakdown.append(CODER_ERROR_PENALTY)
|
| 42 |
+
error_count += 1
|
| 43 |
+
else: # "fail"
|
| 44 |
+
breakdown.append(CODER_FAIL_PENALTY)
|
| 45 |
+
fail_count += 1
|
| 46 |
+
|
| 47 |
+
total = sum(breakdown)
|
| 48 |
+
n = len(test_results)
|
| 49 |
+
pass_rate = pass_count / n if n > 0 else 0.0
|
| 50 |
+
|
| 51 |
+
return {
|
| 52 |
+
"total_reward": round(total, 4),
|
| 53 |
+
"pass_count": pass_count,
|
| 54 |
+
"fail_count": fail_count,
|
| 55 |
+
"error_count": error_count,
|
| 56 |
+
"pass_rate": round(pass_rate, 4),
|
| 57 |
+
"breakdown": breakdown,
|
| 58 |
+
}
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
def breaker_reward(
|
| 62 |
+
adversarial_results: list[dict[str, Any]],
|
| 63 |
+
coder_base_pass_rate: float,
|
| 64 |
+
) -> dict[str, Any]:
|
| 65 |
+
"""
|
| 66 |
+
Compute the Breaker agent's reward.
|
| 67 |
+
|
| 68 |
+
The Breaker earns credit for tests that break the coder (non-pass outcomes).
|
| 69 |
+
It is penalised for tests that the coder still passes, because those tests
|
| 70 |
+
are not adversarial enough.
|
| 71 |
+
|
| 72 |
+
Args:
|
| 73 |
+
adversarial_results: results when the coder's code is run against the
|
| 74 |
+
Breaker's adversarial test cases.
|
| 75 |
+
coder_base_pass_rate: the coder's pass-rate on the standard hidden tests
|
| 76 |
+
(used to scale the Breaker's reward β breaking a
|
| 77 |
+
strong coder is worth more).
|
| 78 |
+
|
| 79 |
+
Returns:
|
| 80 |
+
{
|
| 81 |
+
"total_reward": float,
|
| 82 |
+
"breaks": int, # number of tests that broke the coder
|
| 83 |
+
"passes": int, # number of tests the coder still passed
|
| 84 |
+
"break_rate": float,
|
| 85 |
+
"breakdown": list of per-test reward floats,
|
| 86 |
+
}
|
| 87 |
+
"""
|
| 88 |
+
breakdown = []
|
| 89 |
+
breaks = passes = 0
|
| 90 |
+
|
| 91 |
+
# A higher-quality coder means a bigger multiplier for breaking them
|
| 92 |
+
quality_multiplier = max(1.0, 1.0 + coder_base_pass_rate)
|
| 93 |
+
|
| 94 |
+
for r in adversarial_results:
|
| 95 |
+
status = r.get("status", "error")
|
| 96 |
+
if status != "pass":
|
| 97 |
+
# Breaker successfully broke the coder
|
| 98 |
+
reward = BREAKER_BREAK_REWARD * quality_multiplier
|
| 99 |
+
breakdown.append(round(reward, 4))
|
| 100 |
+
breaks += 1
|
| 101 |
+
else:
|
| 102 |
+
# Coder survived β penalise the Breaker
|
| 103 |
+
breakdown.append(BREAKER_FAIL_PENALTY)
|
| 104 |
+
passes += 1
|
| 105 |
+
|
| 106 |
+
total = sum(breakdown)
|
| 107 |
+
n = len(adversarial_results)
|
| 108 |
+
break_rate = breaks / n if n > 0 else 0.0
|
| 109 |
+
|
| 110 |
+
return {
|
| 111 |
+
"total_reward": round(total, 4),
|
| 112 |
+
"breaks": breaks,
|
| 113 |
+
"passes": passes,
|
| 114 |
+
"break_rate": round(break_rate, 4),
|
| 115 |
+
"breakdown": breakdown,
|
| 116 |
+
}
|
FORGE-v4/sandbox.py
ADDED
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# sandbox.py
|
| 2 |
+
# Safely execute agent-generated Python code in a restricted subprocess.
|
| 3 |
+
# Returns structured pass/fail/error results with timeout handling.
|
| 4 |
+
|
| 5 |
+
import subprocess
|
| 6 |
+
import sys
|
| 7 |
+
import textwrap
|
| 8 |
+
import json
|
| 9 |
+
import os
|
| 10 |
+
from typing import Any
|
| 11 |
+
from config import SANDBOX_TIMEOUT_SECONDS, SANDBOX_MAX_OUTPUT_CHARS
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
def run_code(code: str, test_input: list[int]) -> dict[str, Any]:
|
| 15 |
+
"""
|
| 16 |
+
Execute agent-generated code against a single test input.
|
| 17 |
+
|
| 18 |
+
Args:
|
| 19 |
+
code: Python source code that defines a `solution(arr)` function.
|
| 20 |
+
test_input: The integer list to pass to `solution`.
|
| 21 |
+
|
| 22 |
+
Returns:
|
| 23 |
+
{
|
| 24 |
+
"status": "pass" | "fail" | "error" | "timeout",
|
| 25 |
+
"output": the value returned by solution(arr), or None,
|
| 26 |
+
"expected": sorted(test_input),
|
| 27 |
+
"error_msg": exception string if status is "error", else "",
|
| 28 |
+
}
|
| 29 |
+
"""
|
| 30 |
+
expected = sorted(test_input)
|
| 31 |
+
|
| 32 |
+
# Build a self-contained runner script
|
| 33 |
+
runner = textwrap.dedent(f"""
|
| 34 |
+
import json, sys
|
| 35 |
+
|
| 36 |
+
{code}
|
| 37 |
+
|
| 38 |
+
test_input = {test_input!r}
|
| 39 |
+
expected = {expected!r}
|
| 40 |
+
try:
|
| 41 |
+
result = solution(test_input)
|
| 42 |
+
if result == expected:
|
| 43 |
+
print(json.dumps({{"status": "pass", "output": result, "expected": expected, "error_msg": ""}}))
|
| 44 |
+
else:
|
| 45 |
+
print(json.dumps({{"status": "fail", "output": result, "expected": expected, "error_msg": ""}}))
|
| 46 |
+
except Exception as exc:
|
| 47 |
+
print(json.dumps({{"status": "error", "output": None, "expected": expected, "error_msg": str(exc)}}))
|
| 48 |
+
""")
|
| 49 |
+
|
| 50 |
+
try:
|
| 51 |
+
proc = subprocess.run(
|
| 52 |
+
[sys.executable, "-c", runner],
|
| 53 |
+
capture_output=True,
|
| 54 |
+
text=True,
|
| 55 |
+
timeout=SANDBOX_TIMEOUT_SECONDS,
|
| 56 |
+
)
|
| 57 |
+
raw = proc.stdout.strip()
|
| 58 |
+
|
| 59 |
+
# Truncate excessive output
|
| 60 |
+
if len(raw) > SANDBOX_MAX_OUTPUT_CHARS:
|
| 61 |
+
raw = raw[:SANDBOX_MAX_OUTPUT_CHARS]
|
| 62 |
+
|
| 63 |
+
if raw:
|
| 64 |
+
result = json.loads(raw)
|
| 65 |
+
else:
|
| 66 |
+
# No stdout β treat stderr as the error message
|
| 67 |
+
err = proc.stderr.strip()[:SANDBOX_MAX_OUTPUT_CHARS]
|
| 68 |
+
result = {
|
| 69 |
+
"status": "error",
|
| 70 |
+
"output": None,
|
| 71 |
+
"expected": expected,
|
| 72 |
+
"error_msg": err or "No output produced.",
|
| 73 |
+
}
|
| 74 |
+
|
| 75 |
+
except subprocess.TimeoutExpired:
|
| 76 |
+
result = {
|
| 77 |
+
"status": "timeout",
|
| 78 |
+
"output": None,
|
| 79 |
+
"expected": expected,
|
| 80 |
+
"error_msg": f"Code exceeded {SANDBOX_TIMEOUT_SECONDS}s timeout.",
|
| 81 |
+
}
|
| 82 |
+
except json.JSONDecodeError as exc:
|
| 83 |
+
result = {
|
| 84 |
+
"status": "error",
|
| 85 |
+
"output": None,
|
| 86 |
+
"expected": expected,
|
| 87 |
+
"error_msg": f"JSON decode error: {exc} raw='{raw}'",
|
| 88 |
+
}
|
| 89 |
+
|
| 90 |
+
return result
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
def run_code_against_tests(code: str, tests: list[dict[str, Any]]) -> list[dict[str, Any]]:
|
| 94 |
+
"""
|
| 95 |
+
Run agent code against a list of test cases.
|
| 96 |
+
|
| 97 |
+
Args:
|
| 98 |
+
code: Python source defining `solution(arr)`.
|
| 99 |
+
tests: list of {"input": [...], "expected_output": [...]} dicts.
|
| 100 |
+
|
| 101 |
+
Returns:
|
| 102 |
+
List of result dicts, one per test.
|
| 103 |
+
"""
|
| 104 |
+
results = []
|
| 105 |
+
for test in tests:
|
| 106 |
+
result = run_code(code, test["input"])
|
| 107 |
+
results.append(result)
|
| 108 |
+
return results
|
FORGE-v4/tasks.py
ADDED
|
@@ -0,0 +1,102 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# tasks.py
|
| 2 |
+
# Generates integer array sorting tasks and hidden test cases for FORGE-v4.
|
| 3 |
+
|
| 4 |
+
import random
|
| 5 |
+
from typing import Any
|
| 6 |
+
from config import MAX_ARRAY_SIZE, MIN_ARRAY_SIZE, ARRAY_VALUE_RANGE, NUM_HIDDEN_TESTS
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
def generate_task() -> dict[str, Any]:
|
| 10 |
+
"""
|
| 11 |
+
Generate a single sorting task.
|
| 12 |
+
|
| 13 |
+
Returns a dict with:
|
| 14 |
+
- prompt: natural-language task description
|
| 15 |
+
- public_example: one visible (input, expected_output) pair
|
| 16 |
+
- hidden_tests: list of (input, expected_output) pairs kept secret from agents
|
| 17 |
+
"""
|
| 18 |
+
size = random.randint(MIN_ARRAY_SIZE, MAX_ARRAY_SIZE)
|
| 19 |
+
arr = [random.randint(*ARRAY_VALUE_RANGE) for _ in range(size)]
|
| 20 |
+
|
| 21 |
+
public_example = {
|
| 22 |
+
"input": arr,
|
| 23 |
+
"expected_output": sorted(arr),
|
| 24 |
+
}
|
| 25 |
+
|
| 26 |
+
hidden_tests = _generate_hidden_tests(NUM_HIDDEN_TESTS)
|
| 27 |
+
|
| 28 |
+
task = {
|
| 29 |
+
"prompt": (
|
| 30 |
+
"Write a Python function named `solution(arr)` that takes a list of integers "
|
| 31 |
+
"and returns a new list sorted in ascending order. "
|
| 32 |
+
"Do not use `arr.sort()` in-place β return a new sorted list.\n\n"
|
| 33 |
+
f"Example:\n Input: {arr}\n Output: {sorted(arr)}"
|
| 34 |
+
),
|
| 35 |
+
"public_example": public_example,
|
| 36 |
+
"hidden_tests": hidden_tests,
|
| 37 |
+
}
|
| 38 |
+
return task
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
def _generate_hidden_tests(n: int) -> list[dict[str, Any]]:
|
| 42 |
+
"""
|
| 43 |
+
Generate n hidden test cases including edge-case variants.
|
| 44 |
+
"""
|
| 45 |
+
tests = []
|
| 46 |
+
|
| 47 |
+
# Standard random arrays
|
| 48 |
+
for _ in range(n - 3):
|
| 49 |
+
size = random.randint(MIN_ARRAY_SIZE, MAX_ARRAY_SIZE)
|
| 50 |
+
arr = [random.randint(*ARRAY_VALUE_RANGE) for _ in range(size)]
|
| 51 |
+
tests.append({"input": arr, "expected_output": sorted(arr)})
|
| 52 |
+
|
| 53 |
+
# Edge case: already-sorted array
|
| 54 |
+
arr = sorted([random.randint(*ARRAY_VALUE_RANGE) for _ in range(5)])
|
| 55 |
+
tests.append({"input": arr, "expected_output": sorted(arr)})
|
| 56 |
+
|
| 57 |
+
# Edge case: reverse-sorted array
|
| 58 |
+
arr = sorted([random.randint(*ARRAY_VALUE_RANGE) for _ in range(5)], reverse=True)
|
| 59 |
+
tests.append({"input": arr, "expected_output": sorted(arr)})
|
| 60 |
+
|
| 61 |
+
# Edge case: single element
|
| 62 |
+
arr = [random.randint(*ARRAY_VALUE_RANGE)]
|
| 63 |
+
tests.append({"input": arr, "expected_output": sorted(arr)})
|
| 64 |
+
|
| 65 |
+
return tests
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
def generate_breaker_task(original_task: dict[str, Any]) -> dict[str, Any]:
|
| 69 |
+
"""
|
| 70 |
+
Given an existing task, produce adversarial test cases for the Breaker agent.
|
| 71 |
+
|
| 72 |
+
The Breaker is asked to produce arrays that are likely to break a naive solution.
|
| 73 |
+
Returns a dict with the adversarial prompt and a set of candidate adversarial arrays.
|
| 74 |
+
"""
|
| 75 |
+
adversarial_candidates = [
|
| 76 |
+
# All identical elements
|
| 77 |
+
[0] * random.randint(3, 8),
|
| 78 |
+
# All negative values
|
| 79 |
+
[random.randint(-100, -1) for _ in range(random.randint(3, 8))],
|
| 80 |
+
# Large array
|
| 81 |
+
[random.randint(*ARRAY_VALUE_RANGE) for _ in range(MAX_ARRAY_SIZE)],
|
| 82 |
+
# Duplicate-heavy array
|
| 83 |
+
[random.choice([1, 2, 3]) for _ in range(random.randint(4, 10))],
|
| 84 |
+
# Mixed positive/negative with duplicates
|
| 85 |
+
[random.randint(-5, 5) for _ in range(random.randint(4, 12))],
|
| 86 |
+
]
|
| 87 |
+
|
| 88 |
+
adversarial_tests = [
|
| 89 |
+
{"input": arr, "expected_output": sorted(arr)}
|
| 90 |
+
for arr in adversarial_candidates
|
| 91 |
+
]
|
| 92 |
+
|
| 93 |
+
breaker_task = {
|
| 94 |
+
"prompt": (
|
| 95 |
+
"You are the Breaker agent. Generate adversarial integer arrays that are "
|
| 96 |
+
"likely to expose flaws in a naive sorting implementation. "
|
| 97 |
+
"Focus on edge cases: duplicates, negatives, large inputs, already-sorted, "
|
| 98 |
+
"reverse-sorted, and single-element arrays."
|
| 99 |
+
),
|
| 100 |
+
"adversarial_tests": adversarial_tests,
|
| 101 |
+
}
|
| 102 |
+
return breaker_task
|
FORGE-v4/trainer.py
ADDED
|
@@ -0,0 +1,158 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# trainer.py
|
| 2 |
+
# Placeholder training loop hooks for FORGE-v4.
|
| 3 |
+
# Ready for future TRL / Unsloth / Hugging Face integration.
|
| 4 |
+
|
| 5 |
+
from typing import Any, Callable
|
| 6 |
+
from env import FORGEEnv
|
| 7 |
+
from memory import CoachMemory
|
| 8 |
+
from config import MAX_EPISODES, STEPS_PER_EPISODE
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 12 |
+
# Placeholder agent policy functions
|
| 13 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 14 |
+
|
| 15 |
+
def default_coder_policy(state: dict[str, Any]) -> str:
|
| 16 |
+
"""
|
| 17 |
+
Placeholder Coder policy.
|
| 18 |
+
|
| 19 |
+
In production this will call a fine-tuned LLM (e.g. via TRL/Unsloth) to
|
| 20 |
+
generate Python code from the task prompt.
|
| 21 |
+
|
| 22 |
+
Currently returns a trivial reference solution so the environment runs.
|
| 23 |
+
"""
|
| 24 |
+
# TODO: Replace with LLM inference call
|
| 25 |
+
return "def solution(arr):\n return sorted(arr)\n"
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
def default_breaker_policy(state: dict[str, Any]) -> list[dict[str, Any]]:
|
| 29 |
+
"""
|
| 30 |
+
Placeholder Breaker policy.
|
| 31 |
+
|
| 32 |
+
In production this will call a fine-tuned adversarial LLM to generate
|
| 33 |
+
adversarial test cases from the task prompt.
|
| 34 |
+
|
| 35 |
+
Currently returns a fixed set of edge-case test inputs.
|
| 36 |
+
"""
|
| 37 |
+
# TODO: Replace with adversarial LLM inference call
|
| 38 |
+
return [
|
| 39 |
+
{"input": [], "expected_output": []},
|
| 40 |
+
{"input": [1], "expected_output": [1]},
|
| 41 |
+
{"input": [3, 1, 2], "expected_output": [1, 2, 3]},
|
| 42 |
+
{"input": [-5, -1, -3], "expected_output": [-5, -3, -1]},
|
| 43 |
+
{"input": [0, 0, 0, 0], "expected_output": [0, 0, 0, 0]},
|
| 44 |
+
]
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 48 |
+
# Core training loop
|
| 49 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 50 |
+
|
| 51 |
+
def train(
|
| 52 |
+
coder_policy: Callable[[dict[str, Any]], str] = default_coder_policy,
|
| 53 |
+
breaker_policy: Callable[[dict[str, Any]], list[dict[str, Any]]] = default_breaker_policy,
|
| 54 |
+
num_episodes: int = MAX_EPISODES,
|
| 55 |
+
verbose: bool = True,
|
| 56 |
+
) -> dict[str, Any]:
|
| 57 |
+
"""
|
| 58 |
+
Run the FORGE-v4 training loop.
|
| 59 |
+
|
| 60 |
+
Args:
|
| 61 |
+
coder_policy: Callable(state) β Python source string.
|
| 62 |
+
breaker_policy: Callable(state) β list of test-case dicts.
|
| 63 |
+
num_episodes: Number of training episodes to run.
|
| 64 |
+
verbose: Print per-episode summaries when True.
|
| 65 |
+
|
| 66 |
+
Returns:
|
| 67 |
+
Training summary dict with per-episode reward histories.
|
| 68 |
+
"""
|
| 69 |
+
memory = CoachMemory()
|
| 70 |
+
env = FORGEEnv(memory=memory)
|
| 71 |
+
|
| 72 |
+
episode_history: list[dict[str, Any]] = []
|
| 73 |
+
|
| 74 |
+
for ep in range(1, num_episodes + 1):
|
| 75 |
+
state = env.reset()
|
| 76 |
+
episode_coder_rewards = []
|
| 77 |
+
episode_breaker_rewards = []
|
| 78 |
+
|
| 79 |
+
for _ in range(STEPS_PER_EPISODE):
|
| 80 |
+
# ββ Agent decisions ββββββββββββββββββββββββββββββββββββββββββββ
|
| 81 |
+
coder_code = coder_policy(state)
|
| 82 |
+
breaker_tests = breaker_policy(state)
|
| 83 |
+
|
| 84 |
+
action = {
|
| 85 |
+
"coder_code": coder_code,
|
| 86 |
+
"breaker_tests": breaker_tests,
|
| 87 |
+
}
|
| 88 |
+
|
| 89 |
+
# ββ Environment step βββββββββββββββββββββββββββββββββββββββββββ
|
| 90 |
+
result = env.step(action)
|
| 91 |
+
state = result["state"]
|
| 92 |
+
|
| 93 |
+
episode_coder_rewards.append(result["coder_reward"]["total_reward"])
|
| 94 |
+
episode_breaker_rewards.append(result["breaker_reward"]["total_reward"])
|
| 95 |
+
|
| 96 |
+
if result["done"]:
|
| 97 |
+
break
|
| 98 |
+
|
| 99 |
+
# ββ Episode summary ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 100 |
+
avg_cr = round(sum(episode_coder_rewards) / len(episode_coder_rewards), 4)
|
| 101 |
+
avg_br = round(sum(episode_breaker_rewards) / len(episode_breaker_rewards), 4)
|
| 102 |
+
|
| 103 |
+
ep_summary = {
|
| 104 |
+
"episode": ep,
|
| 105 |
+
"avg_coder_reward": avg_cr,
|
| 106 |
+
"avg_breaker_reward": avg_br,
|
| 107 |
+
"steps": env.step_count,
|
| 108 |
+
}
|
| 109 |
+
episode_history.append(ep_summary)
|
| 110 |
+
|
| 111 |
+
if verbose:
|
| 112 |
+
print(
|
| 113 |
+
f"[Episode {ep:>4}/{num_episodes}] "
|
| 114 |
+
f"Coder avg reward: {avg_cr:+.4f} | "
|
| 115 |
+
f"Breaker avg reward: {avg_br:+.4f}"
|
| 116 |
+
)
|
| 117 |
+
|
| 118 |
+
# ββ TRL / Unsloth hook placeholders βββββββββββββββββββββββββββββββ
|
| 119 |
+
_on_episode_end(ep, ep_summary, memory)
|
| 120 |
+
|
| 121 |
+
training_summary = {
|
| 122 |
+
"total_episodes": num_episodes,
|
| 123 |
+
"episode_history": episode_history,
|
| 124 |
+
"memory_summary": memory.summary(),
|
| 125 |
+
}
|
| 126 |
+
return training_summary
|
| 127 |
+
|
| 128 |
+
|
| 129 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 130 |
+
# Hook placeholders for future RL framework integration
|
| 131 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 132 |
+
|
| 133 |
+
def _on_episode_end(
|
| 134 |
+
episode: int,
|
| 135 |
+
summary: dict[str, Any],
|
| 136 |
+
memory: CoachMemory,
|
| 137 |
+
) -> None:
|
| 138 |
+
"""
|
| 139 |
+
Called at the end of every episode.
|
| 140 |
+
|
| 141 |
+
TODO: Plug in TRL PPOTrainer / Unsloth model updates here.
|
| 142 |
+
E.g.:
|
| 143 |
+
trainer.step(queries, responses, rewards)
|
| 144 |
+
model.save_pretrained(f"models/checkpoint-ep{episode}")
|
| 145 |
+
"""
|
| 146 |
+
pass # placeholder
|
| 147 |
+
|
| 148 |
+
|
| 149 |
+
def _on_step_end(
|
| 150 |
+
step: int,
|
| 151 |
+
result: dict[str, Any],
|
| 152 |
+
) -> None:
|
| 153 |
+
"""
|
| 154 |
+
Called after every environment step.
|
| 155 |
+
|
| 156 |
+
TODO: Plug in per-step reward logging (e.g. W&B, TensorBoard) here.
|
| 157 |
+
"""
|
| 158 |
+
pass # placeholder
|
attached_assets/Pasted-Create-a-Python-project-named-FORGE-v4-Build-the-comple_1777105563327.txt
ADDED
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Create a Python project named **FORGE-v4**.
|
| 2 |
+
|
| 3 |
+
Build the complete project skeleton with this exact structure:
|
| 4 |
+
|
| 5 |
+
FORGE-v4/
|
| 6 |
+
βββ app.py
|
| 7 |
+
βββ env.py
|
| 8 |
+
βββ tasks.py
|
| 9 |
+
βββ rewards.py
|
| 10 |
+
βββ sandbox.py
|
| 11 |
+
βββ memory.py
|
| 12 |
+
βββ trainer.py
|
| 13 |
+
βββ config.py
|
| 14 |
+
βββ requirements.txt
|
| 15 |
+
βββ README.md
|
| 16 |
+
βββ data/
|
| 17 |
+
βββ logs/
|
| 18 |
+
βββ models/
|
| 19 |
+
βββ outputs/
|
| 20 |
+
|
| 21 |
+
Project Purpose:
|
| 22 |
+
FORGE-v4 is a hackathon project based on an OpenEnv-style reinforcement learning environment where:
|
| 23 |
+
|
| 24 |
+
1. A Coder Agent writes Python code to solve integer array sorting tasks.
|
| 25 |
+
2. A Breaker Agent creates adversarial test cases to break the solution.
|
| 26 |
+
3. A Sandbox safely runs generated code.
|
| 27 |
+
4. Rewards are assigned to both agents.
|
| 28 |
+
5. Coach Memory stores lessons learned across episodes.
|
| 29 |
+
|
| 30 |
+
Generate clean, modular starter code for all files.
|
| 31 |
+
|
| 32 |
+
Required file responsibilities:
|
| 33 |
+
|
| 34 |
+
1. app.py
|
| 35 |
+
|
| 36 |
+
* Main runner script
|
| 37 |
+
* Minimal CLI demo
|
| 38 |
+
* Runs one sample episode
|
| 39 |
+
|
| 40 |
+
2. env.py
|
| 41 |
+
|
| 42 |
+
* Main environment class
|
| 43 |
+
* Include methods:
|
| 44 |
+
|
| 45 |
+
* reset()
|
| 46 |
+
* step(action)
|
| 47 |
+
* get_state()
|
| 48 |
+
|
| 49 |
+
3. tasks.py
|
| 50 |
+
|
| 51 |
+
* Generate integer array sorting tasks
|
| 52 |
+
* Sample hidden test cases
|
| 53 |
+
|
| 54 |
+
4. rewards.py
|
| 55 |
+
|
| 56 |
+
* Functions for coder_reward()
|
| 57 |
+
* Functions for breaker_reward()
|
| 58 |
+
|
| 59 |
+
5. sandbox.py
|
| 60 |
+
|
| 61 |
+
* Safely execute Python-generated code
|
| 62 |
+
* Include timeout handling
|
| 63 |
+
* Return pass/fail/error info
|
| 64 |
+
|
| 65 |
+
6. memory.py
|
| 66 |
+
|
| 67 |
+
* Coach memory system
|
| 68 |
+
* Store lessons learned in JSON/list format
|
| 69 |
+
* Load/save memory helpers
|
| 70 |
+
|
| 71 |
+
7. trainer.py
|
| 72 |
+
|
| 73 |
+
* Placeholder training loop hooks
|
| 74 |
+
* Future TRL / Unsloth integration ready
|
| 75 |
+
|
| 76 |
+
8. config.py
|
| 77 |
+
|
| 78 |
+
* Store constants such as:
|
| 79 |
+
|
| 80 |
+
* timeout seconds
|
| 81 |
+
* max array size
|
| 82 |
+
* tier thresholds
|
| 83 |
+
* reward weights
|
| 84 |
+
|
| 85 |
+
9. requirements.txt
|
| 86 |
+
Include useful starter dependencies only.
|
| 87 |
+
|
| 88 |
+
10. README.md
|
| 89 |
+
Professional first draft including:
|
| 90 |
+
|
| 91 |
+
* Project overview
|
| 92 |
+
* Architecture
|
| 93 |
+
* File structure
|
| 94 |
+
* How to run
|
| 95 |
+
|
| 96 |
+
Important Rules:
|
| 97 |
+
|
| 98 |
+
* Generate WORKING starter code, not empty files.
|
| 99 |
+
* Use Python best practices.
|
| 100 |
+
* Add comments throughout code.
|
| 101 |
+
* Keep modular design.
|
| 102 |
+
* Keep ready for future Google Colab training and Hugging Face deployment.
|
| 103 |
+
* No frontend needed now.
|
| 104 |
+
* Focus only on backend environment skeleton.
|
| 105 |
+
|
| 106 |
+
After generating files, ensure the project runs successfully with:
|
| 107 |
+
|
| 108 |
+
python app.py
|