ksanjuma1234 commited on
Commit
fc01d79
Β·
1 Parent(s): 8337c4b

Add a new adversarial code generation environment for reinforcement learning

Browse files

Create the FORGE-v4 Python environment, including core components for agent interaction, code execution, reward calculation, and memory storage.

Replit-Commit-Author: Agent
Replit-Commit-Session-Id: a7518b1f-70c7-4487-82d2-42195935723e
Replit-Commit-Checkpoint-Type: full_checkpoint
Replit-Commit-Event-Id: dbf3c097-a076-4e9c-b916-ee3775367bd1
Replit-Helium-Checkpoint-Created: true

.replit CHANGED
@@ -1,4 +1,4 @@
1
- modules = ["nodejs-24"]
2
 
3
  [deployment]
4
  router = "application"
@@ -18,3 +18,11 @@ expertMode = true
18
  [postMerge]
19
  path = "scripts/post-merge.sh"
20
  timeoutMs = 20000
 
 
 
 
 
 
 
 
 
1
+ modules = ["nodejs-24", "python-3.11"]
2
 
3
  [deployment]
4
  router = "application"
 
18
  [postMerge]
19
  path = "scripts/post-merge.sh"
20
  timeoutMs = 20000
21
+
22
+ [[ports]]
23
+ localPort = 8080
24
+ externalPort = 8080
25
+
26
+ [[ports]]
27
+ localPort = 8081
28
+ externalPort = 80
FORGE-v4/.gitignore ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *.pyo
5
+ *.pyd
6
+ .Python
7
+ *.egg-info/
8
+ dist/
9
+ build/
10
+ .eggs/
11
+ *.whl
12
+
13
+ # Virtual environments
14
+ venv/
15
+ .venv/
16
+ env/
17
+
18
+ # Data / runtime outputs (keep directories, ignore contents)
19
+ data/*.json
20
+ logs/*.log
21
+ logs/*.jsonl
22
+ models/*
23
+ !models/.gitkeep
24
+ outputs/*
25
+ !outputs/.gitkeep
26
+
27
+ # Jupyter / Colab
28
+ .ipynb_checkpoints/
29
+ *.ipynb
30
+
31
+ # IDE
32
+ .vscode/
33
+ .idea/
34
+
35
+ # OS
36
+ .DS_Store
37
+ Thumbs.db
FORGE-v4/README.md ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # FORGE-v4
2
+
3
+ **Adversarial Code Generation Environment for Reinforcement Learning**
4
+
5
+ A hackathon project built on an **OpenEnv-style** reinforcement learning framework where two competing agents β€” a Coder and a Breaker β€” are trained adversarially on Python sorting tasks.
6
+
7
+ ---
8
+
9
+ ## Overview
10
+
11
+ FORGE-v4 pits two agents against each other:
12
+
13
+ | Agent | Role |
14
+ |-------|------|
15
+ | **Coder** | Writes Python code to solve integer array sorting tasks |
16
+ | **Breaker** | Generates adversarial test cases to expose flaws in the Coder's solution |
17
+
18
+ Each episode the Coder earns rewards for passing hidden tests; the Breaker earns rewards for breaking the Coder's solution. A **Coach Memory** module accumulates lessons learned across episodes to guide future training.
19
+
20
+ The skeleton is designed to be **drop-in ready for TRL / Unsloth fine-tuning** and **Hugging Face deployment**.
21
+
22
+ ---
23
+
24
+ ## Architecture
25
+
26
+ ```
27
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
28
+ β”‚ FORGEEnv (env.py) β”‚
29
+ β”‚ β”‚
30
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
31
+ β”‚ β”‚ Coder Agent β”‚ β”‚ Breaker Agent β”‚ β”‚
32
+ β”‚ β”‚ (policy fn) β”‚ β”‚ (policy fn) β”‚ β”‚
33
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
34
+ β”‚ β”‚ code (str) β”‚ test cases β”‚
35
+ β”‚ β–Ό β–Ό β”‚
36
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
37
+ β”‚ β”‚ Sandbox (sandbox.py) β”‚ β”‚
38
+ β”‚ β”‚ subprocess Β· timeout Β· pass/fail/error β”‚ β”‚
39
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
40
+ β”‚ β”‚ results β”‚
41
+ β”‚ β–Ό β”‚
42
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
43
+ β”‚ β”‚ Rewards (rewards.py) β”‚ β”‚
44
+ β”‚ β”‚ coder_reward() Β· breaker_reward() β”‚ β”‚
45
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
46
+ β”‚ β”‚ β”‚
47
+ β”‚ β–Ό β”‚
48
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
49
+ β”‚ β”‚ Coach Memory (memory.py) β”‚ β”‚
50
+ β”‚ β”‚ JSON-backed Β· lessons Β· summary() β”‚ β”‚
51
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
52
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
53
+ ```
54
+
55
+ ---
56
+
57
+ ## File Structure
58
+
59
+ ```
60
+ FORGE-v4/
61
+ β”œβ”€β”€ app.py # CLI entry point β€” runs one demo episode
62
+ β”œβ”€β”€ env.py # FORGEEnv: reset() / step() / get_state()
63
+ β”œβ”€β”€ tasks.py # Task generator + hidden test sampler
64
+ β”œβ”€β”€ rewards.py # coder_reward() and breaker_reward()
65
+ β”œβ”€β”€ sandbox.py # Safe subprocess code execution with timeout
66
+ β”œβ”€β”€ memory.py # CoachMemory: JSON-backed lessons store
67
+ β”œβ”€β”€ trainer.py # Training loop + TRL/Unsloth hook placeholders
68
+ β”œβ”€β”€ config.py # All constants (timeout, rewards, tier thresholds)
69
+ β”œβ”€β”€ requirements.txt # Dependencies
70
+ β”œβ”€β”€ README.md # This file
71
+ β”œβ”€β”€ data/ # coach_memory.json (auto-created)
72
+ β”œβ”€β”€ logs/ # Episode logs
73
+ β”œβ”€β”€ models/ # Saved model checkpoints
74
+ └── outputs/ # Generated code outputs
75
+ ```
76
+
77
+ ---
78
+
79
+ ## How to Run
80
+
81
+ ### 1. Install dependencies
82
+
83
+ ```bash
84
+ pip install -r requirements.txt
85
+ ```
86
+
87
+ > **Note:** The core skeleton has minimal dependencies. ML packages (TRL, Unsloth, PyTorch) are commented out in `requirements.txt` β€” uncomment them when adding LLM training.
88
+
89
+ ### 2. Run a demo episode
90
+
91
+ ```bash
92
+ python app.py
93
+ ```
94
+
95
+ This runs a single episode with placeholder Coder and Breaker policies (the Coder always uses `sorted()`, the Breaker sends fixed edge cases). You should see per-step reward output and a coach memory summary.
96
+
97
+ ### 3. Optional: override step count
98
+
99
+ ```bash
100
+ python app.py --steps 3
101
+ ```
102
+
103
+ ---
104
+
105
+ ## Configuration
106
+
107
+ Edit `config.py` to adjust environment constants:
108
+
109
+ | Constant | Default | Description |
110
+ |----------|---------|-------------|
111
+ | `SANDBOX_TIMEOUT_SECONDS` | `5` | Max execution time per code run |
112
+ | `MAX_ARRAY_SIZE` | `20` | Largest generated array |
113
+ | `NUM_HIDDEN_TESTS` | `5` | Hidden test cases per task |
114
+ | `CODER_PASS_REWARD` | `1.0` | Reward per passing test |
115
+ | `BREAKER_BREAK_REWARD` | `1.0` | Reward per test that breaks coder |
116
+ | `MAX_EPISODES` | `100` | Default training episode count |
117
+
118
+ ---
119
+
120
+ ## Extending with LLM Agents
121
+
122
+ Replace the placeholder policies in `trainer.py`:
123
+
124
+ ```python
125
+ # trainer.py
126
+ def my_coder_policy(state: dict) -> str:
127
+ prompt = state["task_prompt"]
128
+ # Call your LLM here (TRL model, OpenAI API, Unsloth, etc.)
129
+ return generated_code
130
+
131
+ def my_breaker_policy(state: dict) -> list[dict]:
132
+ prompt = state["task_prompt"]
133
+ # Call your adversarial LLM here
134
+ return [{"input": arr} for arr in generated_arrays]
135
+ ```
136
+
137
+ Then run:
138
+
139
+ ```python
140
+ from trainer import train
141
+ summary = train(
142
+ coder_policy=my_coder_policy,
143
+ breaker_policy=my_breaker_policy,
144
+ num_episodes=50,
145
+ )
146
+ ```
147
+
148
+ ---
149
+
150
+ ## TRL / Unsloth Integration (Future)
151
+
152
+ Hook points are prepared in `trainer.py`:
153
+
154
+ - `_on_episode_end()` β€” plug in `PPOTrainer.step()` or `GRPOTrainer` updates
155
+ - `_on_step_end()` β€” plug in per-step reward logging (W&B, TensorBoard)
156
+
157
+ ```python
158
+ # Example (uncomment in trainer.py after installing TRL):
159
+ # from trl import PPOTrainer, PPOConfig
160
+ # trainer = PPOTrainer(config=PPOConfig(...), model=model, ...)
161
+ # trainer.step(queries, responses, rewards)
162
+ ```
163
+
164
+ ---
165
+
166
+ ## Google Colab
167
+
168
+ 1. Clone or upload the project to Colab.
169
+ 2. Install Unsloth:
170
+ ```
171
+ !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
172
+ ```
173
+ 3. Mount Drive and set `MEMORY_FILE` / `MODELS_DIR` in `config.py` to paths under `/content/drive/MyDrive/`.
174
+ 4. Run `python app.py` or import and call `train()` directly.
175
+
176
+ ---
177
+
178
+ ## Hugging Face Deployment
179
+
180
+ After training, push your model with:
181
+
182
+ ```python
183
+ model.push_to_hub("your-username/forge-v4-coder")
184
+ tokenizer.push_to_hub("your-username/forge-v4-coder")
185
+ ```
186
+
187
+ The repo structure (`models/`, `outputs/`) maps directly to HF Hub conventions.
188
+
189
+ ---
190
+
191
+ ## License
192
+
193
+ MIT
FORGE-v4/app.py ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # app.py
2
+ # Main runner script for FORGE-v4.
3
+ # Runs a minimal CLI demo of one sample episode.
4
+
5
+ import sys
6
+ import json
7
+ from env import FORGEEnv
8
+ from memory import CoachMemory
9
+ from trainer import default_coder_policy, default_breaker_policy
10
+ from config import STEPS_PER_EPISODE
11
+
12
+
13
+ def run_demo_episode() -> None:
14
+ """
15
+ Execute a single demo episode and print the results to stdout.
16
+ """
17
+ print("=" * 60)
18
+ print(" FORGE-v4 | Adversarial Code Generation Environment")
19
+ print("=" * 60)
20
+
21
+ # Initialise coach memory and environment
22
+ memory = CoachMemory()
23
+ env = FORGEEnv(memory=memory)
24
+
25
+ # Reset to start the episode
26
+ state = env.reset()
27
+
28
+ print(f"\n[Episode {state['episode']}] Task prompt:\n")
29
+ print(state["task_prompt"])
30
+ print()
31
+
32
+ for step in range(1, STEPS_PER_EPISODE + 1):
33
+ print(f"── Step {step}/{STEPS_PER_EPISODE} " + "─" * 40)
34
+
35
+ # Agents produce their actions (placeholder policies for the demo)
36
+ coder_code = default_coder_policy(state)
37
+ breaker_tests = default_breaker_policy(state)
38
+
39
+ action = {
40
+ "coder_code": coder_code,
41
+ "breaker_tests": breaker_tests,
42
+ }
43
+
44
+ result = env.step(action)
45
+
46
+ cr = result["coder_reward"]
47
+ br = result["breaker_reward"]
48
+
49
+ print(
50
+ f" Coder β†’ pass_rate: {cr['pass_rate']:.2f} "
51
+ f"| passes: {cr['pass_count']} "
52
+ f"| fails: {cr['fail_count']} "
53
+ f"| errors: {cr['error_count']} "
54
+ f"| reward: {cr['total_reward']:+.2f}"
55
+ )
56
+ print(
57
+ f" Breaker β†’ break_rate: {br['break_rate']:.2f} "
58
+ f"| breaks: {br['breaks']} "
59
+ f"| passes: {br['passes']} "
60
+ f"| reward: {br['total_reward']:+.2f}"
61
+ )
62
+
63
+ if result["done"]:
64
+ break
65
+
66
+ print("\n" + "=" * 60)
67
+ print(" Episode complete. Coach memory summary:")
68
+ print(json.dumps(memory.summary(), indent=2))
69
+ print("=" * 60)
70
+
71
+
72
+ def main() -> None:
73
+ """Entry point β€” parse minimal CLI args and run."""
74
+ args = sys.argv[1:]
75
+
76
+ if "--help" in args or "-h" in args:
77
+ print("Usage: python app.py [--steps N]")
78
+ print(" --steps N Override STEPS_PER_EPISODE for this run (default: from config.py)")
79
+ sys.exit(0)
80
+
81
+ # Optional: override step count via CLI
82
+ if "--steps" in args:
83
+ idx = args.index("--steps")
84
+ try:
85
+ import config
86
+ config.STEPS_PER_EPISODE = int(args[idx + 1])
87
+ except (IndexError, ValueError):
88
+ print("Error: --steps requires an integer argument.")
89
+ sys.exit(1)
90
+
91
+ run_demo_episode()
92
+
93
+
94
+ if __name__ == "__main__":
95
+ main()
FORGE-v4/config.py ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # config.py
2
+ # Central configuration constants for FORGE-v4
3
+
4
+ # ──────────────────────────────────────────────
5
+ # Sandbox settings
6
+ # ──────────────────────────────────────────────
7
+ SANDBOX_TIMEOUT_SECONDS = 5 # Max time allowed for code execution
8
+ SANDBOX_MAX_OUTPUT_CHARS = 4096 # Truncate stdout/stderr beyond this length
9
+
10
+ # ──────────────────────────────────────────────
11
+ # Task / environment settings
12
+ # ──────────────────────────────────────────────
13
+ MAX_ARRAY_SIZE = 20 # Max length of generated integer arrays
14
+ MIN_ARRAY_SIZE = 3 # Min length of generated integer arrays
15
+ ARRAY_VALUE_RANGE = (-100, 100) # (min, max) integers in generated arrays
16
+ NUM_HIDDEN_TESTS = 5 # Number of hidden test cases per task
17
+
18
+ # ──────────────────────────────────────────────
19
+ # Reward settings
20
+ # ──────────────────────────────────────────────
21
+ # Coder reward weights
22
+ CODER_PASS_REWARD = 1.0 # Reward per passing hidden test
23
+ CODER_FAIL_PENALTY = -0.5 # Penalty per failing hidden test
24
+ CODER_ERROR_PENALTY = -1.0 # Penalty when code raises an error
25
+
26
+ # Breaker reward weights
27
+ BREAKER_BREAK_REWARD = 1.0 # Reward when breaker's test breaks coder
28
+ BREAKER_FAIL_PENALTY = -0.3 # Penalty when breaker's test does NOT break coder
29
+
30
+ # ──────────────────────────────────────────────
31
+ # Tier thresholds (coder skill levels)
32
+ # ──────────────────────────────────────────────
33
+ TIER_THRESHOLDS = {
34
+ "novice": (0.0, 0.4), # pass-rate range [low, high)
35
+ "intermediate": (0.4, 0.7),
36
+ "advanced": (0.7, 0.9),
37
+ "expert": (0.9, 1.01),
38
+ }
39
+
40
+ # ──────────────────────────────────────────────
41
+ # Memory / logging
42
+ # ──────────────────────────────────────────────
43
+ MEMORY_FILE = "data/coach_memory.json" # Persistent memory path
44
+ LOG_DIR = "logs/" # Directory for episode logs
45
+ MODELS_DIR = "models/" # Saved model checkpoints
46
+ OUTPUTS_DIR = "outputs/" # Generated code outputs
47
+
48
+ # ──────────────────────────────────────────────
49
+ # Training placeholders
50
+ # ──────────────────────────────────────────────
51
+ MAX_EPISODES = 100 # Default training episode count
52
+ STEPS_PER_EPISODE = 10 # Steps per episode
FORGE-v4/env.py ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # env.py
2
+ # Main OpenEnv-style reinforcement learning environment for FORGE-v4.
3
+ # Manages the interaction between the Coder Agent, Breaker Agent, and Sandbox.
4
+
5
+ from typing import Any
6
+ from tasks import generate_task, generate_breaker_task
7
+ from sandbox import run_code_against_tests
8
+ from rewards import coder_reward, breaker_reward
9
+ from memory import CoachMemory
10
+ from config import STEPS_PER_EPISODE
11
+
12
+
13
+ class FORGEEnv:
14
+ """
15
+ Two-agent adversarial environment for code generation tasks.
16
+
17
+ Agents:
18
+ - Coder: writes Python code to solve array-sorting tasks.
19
+ - Breaker: generates adversarial test cases to break the Coder's solution.
20
+
21
+ Episode flow:
22
+ 1. reset() β†’ returns the initial task state
23
+ 2. step(action) Γ— STEPS_PER_EPISODE steps
24
+ 3. Rewards assigned to both agents at each step
25
+
26
+ Action format:
27
+ {
28
+ "coder_code": str | None, # Python source defining solution(arr)
29
+ "breaker_tests": list | None, # List of {"input": [...]} dicts
30
+ }
31
+ """
32
+
33
+ def __init__(self, memory: CoachMemory | None = None):
34
+ self.memory = memory or CoachMemory()
35
+ self.episode: int = 0
36
+ self.step_count: int = 0
37
+ self.current_task: dict[str, Any] = {}
38
+ self.done: bool = True
39
+ self._last_coder_code: str = ""
40
+ self._last_coder_pass_rate: float = 0.0
41
+
42
+ # ──────────────────────────────────────────────
43
+ # Core env methods
44
+ # ──────────────────────────────────────────────
45
+
46
+ def reset(self) -> dict[str, Any]:
47
+ """
48
+ Start a new episode.
49
+
50
+ Returns:
51
+ Initial state dict containing the task prompt and public example.
52
+ """
53
+ self.episode += 1
54
+ self.step_count = 0
55
+ self.done = False
56
+ self._last_coder_code = ""
57
+ self._last_coder_pass_rate = 0.0
58
+
59
+ self.current_task = generate_task()
60
+
61
+ state = self.get_state()
62
+ return state
63
+
64
+ def step(self, action: dict[str, Any]) -> dict[str, Any]:
65
+ """
66
+ Advance the environment by one step.
67
+
68
+ Args:
69
+ action: dict with optional keys:
70
+ "coder_code" – Python source defining solution(arr)
71
+ "breaker_tests" – list of {"input": [...]} dicts
72
+
73
+ Returns:
74
+ {
75
+ "state": current env state,
76
+ "coder_reward": coder reward info dict,
77
+ "breaker_reward": breaker reward info dict,
78
+ "done": bool (True when episode ends),
79
+ "info": extra diagnostics,
80
+ }
81
+ """
82
+ if self.done:
83
+ raise RuntimeError("Episode is done. Call reset() before step().")
84
+
85
+ self.step_count += 1
86
+ coder_code = action.get("coder_code", "")
87
+ breaker_tests = action.get("breaker_tests", [])
88
+
89
+ # ── Evaluate Coder ────────────────────────────────────────────────
90
+ coder_info = self._evaluate_coder(coder_code)
91
+
92
+ # ── Evaluate Breaker ──────────────────────────────────────────────
93
+ breaker_info = self._evaluate_breaker(coder_code, breaker_tests, coder_info)
94
+
95
+ # ── Log to Coach Memory ───────────────────────────────────────────
96
+ self.memory.add_lesson(
97
+ episode=self.episode,
98
+ agent="env",
99
+ observation=(
100
+ f"Step {self.step_count}: "
101
+ f"coder pass_rate={coder_info['pass_rate']:.2f}, "
102
+ f"breaker break_rate={breaker_info['break_rate']:.2f}"
103
+ ),
104
+ coder_reward=coder_info["total_reward"],
105
+ breaker_reward=breaker_info["total_reward"],
106
+ extra={
107
+ "step": self.step_count,
108
+ "coder_pass_rate": coder_info["pass_rate"],
109
+ "breaker_break_rate": breaker_info["break_rate"],
110
+ },
111
+ )
112
+
113
+ # ── Check done ────────────────────────────────────────────────────
114
+ if self.step_count >= STEPS_PER_EPISODE:
115
+ self.done = True
116
+
117
+ return {
118
+ "state": self.get_state(),
119
+ "coder_reward": coder_info,
120
+ "breaker_reward": breaker_info,
121
+ "done": self.done,
122
+ "info": {
123
+ "episode": self.episode,
124
+ "step": self.step_count,
125
+ },
126
+ }
127
+
128
+ def get_state(self) -> dict[str, Any]:
129
+ """
130
+ Return the current observable state of the environment.
131
+ """
132
+ return {
133
+ "episode": self.episode,
134
+ "step": self.step_count,
135
+ "done": self.done,
136
+ "task_prompt": self.current_task.get("prompt", ""),
137
+ "public_example": self.current_task.get("public_example", {}),
138
+ "last_pass_rate": self._last_coder_pass_rate,
139
+ }
140
+
141
+ # ──────────────────────────────────────────────
142
+ # Private helpers
143
+ # ──────────────────────────────────────────────
144
+
145
+ def _evaluate_coder(self, code: str) -> dict[str, Any]:
146
+ """Run the coder's code against hidden tests and compute reward."""
147
+ hidden_tests = self.current_task.get("hidden_tests", [])
148
+
149
+ if not code or not hidden_tests:
150
+ # No code submitted β€” max penalty
151
+ dummy_results = [{"status": "error"} for _ in hidden_tests or [{}]]
152
+ info = coder_reward(dummy_results)
153
+ else:
154
+ results = run_code_against_tests(code, hidden_tests)
155
+ info = coder_reward(results)
156
+
157
+ # Cache for Breaker quality multiplier
158
+ self._last_coder_code = code
159
+ self._last_coder_pass_rate = info["pass_rate"]
160
+ return info
161
+
162
+ def _evaluate_breaker(
163
+ self,
164
+ coder_code: str,
165
+ breaker_tests: list[dict[str, Any]],
166
+ coder_info: dict[str, Any],
167
+ ) -> dict[str, Any]:
168
+ """Run the coder's code against the breaker's adversarial tests."""
169
+ if not coder_code or not breaker_tests:
170
+ # No submission from one of the agents
171
+ dummy = [{"status": "pass"} for _ in breaker_tests or [{}]]
172
+ return breaker_reward(dummy, coder_base_pass_rate=coder_info["pass_rate"])
173
+
174
+ results = run_code_against_tests(coder_code, breaker_tests)
175
+ return breaker_reward(results, coder_base_pass_rate=coder_info["pass_rate"])
FORGE-v4/logs/.gitkeep ADDED
File without changes
FORGE-v4/memory.py ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # memory.py
2
+ # Coach Memory system for FORGE-v4.
3
+ # Stores lessons learned across episodes in a JSON file.
4
+
5
+ import json
6
+ import os
7
+ from datetime import datetime
8
+ from typing import Any
9
+ from config import MEMORY_FILE
10
+
11
+
12
+ class CoachMemory:
13
+ """
14
+ Persistent memory that accumulates lessons learned across training episodes.
15
+
16
+ Lessons are stored as a list of dicts in a JSON file and loaded on startup.
17
+ """
18
+
19
+ def __init__(self, filepath: str = MEMORY_FILE):
20
+ self.filepath = filepath
21
+ self.lessons: list[dict[str, Any]] = []
22
+ self._ensure_data_dir()
23
+ self.load()
24
+
25
+ # ──────────────────────────────────────────────
26
+ # Public API
27
+ # ──────────────────────────────────────────────
28
+
29
+ def add_lesson(
30
+ self,
31
+ episode: int,
32
+ agent: str,
33
+ observation: str,
34
+ coder_reward: float,
35
+ breaker_reward: float,
36
+ extra: dict[str, Any] | None = None,
37
+ ) -> None:
38
+ """
39
+ Record a lesson from one episode step.
40
+
41
+ Args:
42
+ episode: Episode index.
43
+ agent: "coder" | "breaker" | "env".
44
+ observation: Human-readable description of what happened.
45
+ coder_reward: Total coder reward for this step.
46
+ breaker_reward: Total breaker reward for this step.
47
+ extra: Optional additional metadata.
48
+ """
49
+ lesson = {
50
+ "timestamp": datetime.utcnow().isoformat(),
51
+ "episode": episode,
52
+ "agent": agent,
53
+ "observation": observation,
54
+ "coder_reward": coder_reward,
55
+ "breaker_reward": breaker_reward,
56
+ }
57
+ if extra:
58
+ lesson["extra"] = extra
59
+
60
+ self.lessons.append(lesson)
61
+ self.save()
62
+
63
+ def get_lessons(self, agent: str | None = None, last_n: int | None = None) -> list[dict[str, Any]]:
64
+ """
65
+ Retrieve stored lessons, optionally filtered by agent and/or limited to the last N.
66
+
67
+ Args:
68
+ agent: Filter to a specific agent ("coder", "breaker", "env"), or None for all.
69
+ last_n: Return only the last N lessons if provided.
70
+
71
+ Returns:
72
+ List of lesson dicts.
73
+ """
74
+ result = self.lessons
75
+ if agent is not None:
76
+ result = [l for l in result if l.get("agent") == agent]
77
+ if last_n is not None:
78
+ result = result[-last_n:]
79
+ return result
80
+
81
+ def summary(self) -> dict[str, Any]:
82
+ """
83
+ Return a high-level summary of stored lessons.
84
+ """
85
+ if not self.lessons:
86
+ return {"total_lessons": 0, "episodes_seen": 0}
87
+
88
+ episodes = {l["episode"] for l in self.lessons}
89
+ coder_rewards = [l["coder_reward"] for l in self.lessons]
90
+ breaker_rewards = [l["breaker_reward"] for l in self.lessons]
91
+
92
+ return {
93
+ "total_lessons": len(self.lessons),
94
+ "episodes_seen": len(episodes),
95
+ "avg_coder_reward": round(sum(coder_rewards) / len(coder_rewards), 4),
96
+ "avg_breaker_reward": round(sum(breaker_rewards) / len(breaker_rewards), 4),
97
+ }
98
+
99
+ def clear(self) -> None:
100
+ """
101
+ Wipe all stored lessons (use with caution).
102
+ """
103
+ self.lessons = []
104
+ self.save()
105
+
106
+ # ──────────────────────────────────────────────
107
+ # Persistence helpers
108
+ # ──────────────────────────────────────────────
109
+
110
+ def save(self) -> None:
111
+ """Persist lessons to JSON file."""
112
+ with open(self.filepath, "w", encoding="utf-8") as f:
113
+ json.dump(self.lessons, f, indent=2)
114
+
115
+ def load(self) -> None:
116
+ """Load lessons from JSON file if it exists."""
117
+ if os.path.exists(self.filepath):
118
+ try:
119
+ with open(self.filepath, "r", encoding="utf-8") as f:
120
+ self.lessons = json.load(f)
121
+ except (json.JSONDecodeError, IOError):
122
+ # Start fresh if file is corrupted
123
+ self.lessons = []
124
+ else:
125
+ self.lessons = []
126
+
127
+ # ──────────────────────────────────────────────
128
+ # Internal helpers
129
+ # ──────────────────────────────────────────────
130
+
131
+ def _ensure_data_dir(self) -> None:
132
+ """Create the directory for the memory file if it doesn't exist."""
133
+ directory = os.path.dirname(self.filepath)
134
+ if directory:
135
+ os.makedirs(directory, exist_ok=True)
FORGE-v4/models/.gitkeep ADDED
File without changes
FORGE-v4/outputs/.gitkeep ADDED
File without changes
FORGE-v4/requirements.txt ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # FORGE-v4 requirements
2
+ # Core environment β€” no heavy ML deps needed to run the skeleton
3
+ # Uncomment TRL / Unsloth blocks when adding LLM training.
4
+
5
+ # ──────────────────────────────────────────────
6
+ # Standard library extensions
7
+ # ──────────────────────────────────────────────
8
+ tqdm>=4.66.0 # Progress bars for training loops
9
+
10
+ # ──────────────────────────────────────────────
11
+ # Data / logging utilities
12
+ # ──────────────────────────────────────────────
13
+ numpy>=1.26.0 # Array math utilities
14
+ pandas>=2.2.0 # Episode log analysis
15
+
16
+ # ──────────────────────────────────────────────
17
+ # Experiment tracking (optional but recommended)
18
+ # ──────────────────────────────────────────────
19
+ # wandb>=0.17.0 # Weights & Biases β€” uncomment to enable
20
+
21
+ # ──────────────────────────────────────────────
22
+ # LLM / RL training (future integration)
23
+ # ──────────────────────────────────────────────
24
+ # torch>=2.3.0
25
+ # transformers>=4.41.0
26
+ # trl>=0.9.0 # TRL PPO / GRPO trainer
27
+ # datasets>=2.19.0 # Hugging Face Datasets
28
+ # accelerate>=0.30.0 # Multi-GPU / mixed precision
29
+
30
+ # ──────────────────────────────────────────────
31
+ # Unsloth (Google Colab / fast fine-tuning)
32
+ # ──────────────────────────────────────────────
33
+ # unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git
34
+ # Install separately in Colab:
35
+ # !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
36
+
37
+ # ──────────────────────────────────────────────
38
+ # Hugging Face Hub (model push / pull)
39
+ # ──────────────────────────────────────────────
40
+ # huggingface_hub>=0.23.0
FORGE-v4/rewards.py ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # rewards.py
2
+ # Reward functions for the Coder Agent and the Breaker Agent in FORGE-v4.
3
+
4
+ from typing import Any
5
+ from config import (
6
+ CODER_PASS_REWARD,
7
+ CODER_FAIL_PENALTY,
8
+ CODER_ERROR_PENALTY,
9
+ BREAKER_BREAK_REWARD,
10
+ BREAKER_FAIL_PENALTY,
11
+ )
12
+
13
+
14
+ def coder_reward(test_results: list[dict[str, Any]]) -> dict[str, Any]:
15
+ """
16
+ Compute the Coder agent's reward from sandbox test results.
17
+
18
+ Args:
19
+ test_results: list of result dicts from sandbox.run_code_against_tests().
20
+ Each dict has a "status" key: "pass" | "fail" | "error" | "timeout".
21
+
22
+ Returns:
23
+ {
24
+ "total_reward": float,
25
+ "pass_count": int,
26
+ "fail_count": int,
27
+ "error_count": int,
28
+ "pass_rate": float, # fraction of tests passed
29
+ "breakdown": list of per-test reward floats,
30
+ }
31
+ """
32
+ breakdown = []
33
+ pass_count = fail_count = error_count = 0
34
+
35
+ for r in test_results:
36
+ status = r.get("status", "error")
37
+ if status == "pass":
38
+ breakdown.append(CODER_PASS_REWARD)
39
+ pass_count += 1
40
+ elif status in ("error", "timeout"):
41
+ breakdown.append(CODER_ERROR_PENALTY)
42
+ error_count += 1
43
+ else: # "fail"
44
+ breakdown.append(CODER_FAIL_PENALTY)
45
+ fail_count += 1
46
+
47
+ total = sum(breakdown)
48
+ n = len(test_results)
49
+ pass_rate = pass_count / n if n > 0 else 0.0
50
+
51
+ return {
52
+ "total_reward": round(total, 4),
53
+ "pass_count": pass_count,
54
+ "fail_count": fail_count,
55
+ "error_count": error_count,
56
+ "pass_rate": round(pass_rate, 4),
57
+ "breakdown": breakdown,
58
+ }
59
+
60
+
61
+ def breaker_reward(
62
+ adversarial_results: list[dict[str, Any]],
63
+ coder_base_pass_rate: float,
64
+ ) -> dict[str, Any]:
65
+ """
66
+ Compute the Breaker agent's reward.
67
+
68
+ The Breaker earns credit for tests that break the coder (non-pass outcomes).
69
+ It is penalised for tests that the coder still passes, because those tests
70
+ are not adversarial enough.
71
+
72
+ Args:
73
+ adversarial_results: results when the coder's code is run against the
74
+ Breaker's adversarial test cases.
75
+ coder_base_pass_rate: the coder's pass-rate on the standard hidden tests
76
+ (used to scale the Breaker's reward β€” breaking a
77
+ strong coder is worth more).
78
+
79
+ Returns:
80
+ {
81
+ "total_reward": float,
82
+ "breaks": int, # number of tests that broke the coder
83
+ "passes": int, # number of tests the coder still passed
84
+ "break_rate": float,
85
+ "breakdown": list of per-test reward floats,
86
+ }
87
+ """
88
+ breakdown = []
89
+ breaks = passes = 0
90
+
91
+ # A higher-quality coder means a bigger multiplier for breaking them
92
+ quality_multiplier = max(1.0, 1.0 + coder_base_pass_rate)
93
+
94
+ for r in adversarial_results:
95
+ status = r.get("status", "error")
96
+ if status != "pass":
97
+ # Breaker successfully broke the coder
98
+ reward = BREAKER_BREAK_REWARD * quality_multiplier
99
+ breakdown.append(round(reward, 4))
100
+ breaks += 1
101
+ else:
102
+ # Coder survived β€” penalise the Breaker
103
+ breakdown.append(BREAKER_FAIL_PENALTY)
104
+ passes += 1
105
+
106
+ total = sum(breakdown)
107
+ n = len(adversarial_results)
108
+ break_rate = breaks / n if n > 0 else 0.0
109
+
110
+ return {
111
+ "total_reward": round(total, 4),
112
+ "breaks": breaks,
113
+ "passes": passes,
114
+ "break_rate": round(break_rate, 4),
115
+ "breakdown": breakdown,
116
+ }
FORGE-v4/sandbox.py ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # sandbox.py
2
+ # Safely execute agent-generated Python code in a restricted subprocess.
3
+ # Returns structured pass/fail/error results with timeout handling.
4
+
5
+ import subprocess
6
+ import sys
7
+ import textwrap
8
+ import json
9
+ import os
10
+ from typing import Any
11
+ from config import SANDBOX_TIMEOUT_SECONDS, SANDBOX_MAX_OUTPUT_CHARS
12
+
13
+
14
+ def run_code(code: str, test_input: list[int]) -> dict[str, Any]:
15
+ """
16
+ Execute agent-generated code against a single test input.
17
+
18
+ Args:
19
+ code: Python source code that defines a `solution(arr)` function.
20
+ test_input: The integer list to pass to `solution`.
21
+
22
+ Returns:
23
+ {
24
+ "status": "pass" | "fail" | "error" | "timeout",
25
+ "output": the value returned by solution(arr), or None,
26
+ "expected": sorted(test_input),
27
+ "error_msg": exception string if status is "error", else "",
28
+ }
29
+ """
30
+ expected = sorted(test_input)
31
+
32
+ # Build a self-contained runner script
33
+ runner = textwrap.dedent(f"""
34
+ import json, sys
35
+
36
+ {code}
37
+
38
+ test_input = {test_input!r}
39
+ expected = {expected!r}
40
+ try:
41
+ result = solution(test_input)
42
+ if result == expected:
43
+ print(json.dumps({{"status": "pass", "output": result, "expected": expected, "error_msg": ""}}))
44
+ else:
45
+ print(json.dumps({{"status": "fail", "output": result, "expected": expected, "error_msg": ""}}))
46
+ except Exception as exc:
47
+ print(json.dumps({{"status": "error", "output": None, "expected": expected, "error_msg": str(exc)}}))
48
+ """)
49
+
50
+ try:
51
+ proc = subprocess.run(
52
+ [sys.executable, "-c", runner],
53
+ capture_output=True,
54
+ text=True,
55
+ timeout=SANDBOX_TIMEOUT_SECONDS,
56
+ )
57
+ raw = proc.stdout.strip()
58
+
59
+ # Truncate excessive output
60
+ if len(raw) > SANDBOX_MAX_OUTPUT_CHARS:
61
+ raw = raw[:SANDBOX_MAX_OUTPUT_CHARS]
62
+
63
+ if raw:
64
+ result = json.loads(raw)
65
+ else:
66
+ # No stdout β€” treat stderr as the error message
67
+ err = proc.stderr.strip()[:SANDBOX_MAX_OUTPUT_CHARS]
68
+ result = {
69
+ "status": "error",
70
+ "output": None,
71
+ "expected": expected,
72
+ "error_msg": err or "No output produced.",
73
+ }
74
+
75
+ except subprocess.TimeoutExpired:
76
+ result = {
77
+ "status": "timeout",
78
+ "output": None,
79
+ "expected": expected,
80
+ "error_msg": f"Code exceeded {SANDBOX_TIMEOUT_SECONDS}s timeout.",
81
+ }
82
+ except json.JSONDecodeError as exc:
83
+ result = {
84
+ "status": "error",
85
+ "output": None,
86
+ "expected": expected,
87
+ "error_msg": f"JSON decode error: {exc} raw='{raw}'",
88
+ }
89
+
90
+ return result
91
+
92
+
93
+ def run_code_against_tests(code: str, tests: list[dict[str, Any]]) -> list[dict[str, Any]]:
94
+ """
95
+ Run agent code against a list of test cases.
96
+
97
+ Args:
98
+ code: Python source defining `solution(arr)`.
99
+ tests: list of {"input": [...], "expected_output": [...]} dicts.
100
+
101
+ Returns:
102
+ List of result dicts, one per test.
103
+ """
104
+ results = []
105
+ for test in tests:
106
+ result = run_code(code, test["input"])
107
+ results.append(result)
108
+ return results
FORGE-v4/tasks.py ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # tasks.py
2
+ # Generates integer array sorting tasks and hidden test cases for FORGE-v4.
3
+
4
+ import random
5
+ from typing import Any
6
+ from config import MAX_ARRAY_SIZE, MIN_ARRAY_SIZE, ARRAY_VALUE_RANGE, NUM_HIDDEN_TESTS
7
+
8
+
9
+ def generate_task() -> dict[str, Any]:
10
+ """
11
+ Generate a single sorting task.
12
+
13
+ Returns a dict with:
14
+ - prompt: natural-language task description
15
+ - public_example: one visible (input, expected_output) pair
16
+ - hidden_tests: list of (input, expected_output) pairs kept secret from agents
17
+ """
18
+ size = random.randint(MIN_ARRAY_SIZE, MAX_ARRAY_SIZE)
19
+ arr = [random.randint(*ARRAY_VALUE_RANGE) for _ in range(size)]
20
+
21
+ public_example = {
22
+ "input": arr,
23
+ "expected_output": sorted(arr),
24
+ }
25
+
26
+ hidden_tests = _generate_hidden_tests(NUM_HIDDEN_TESTS)
27
+
28
+ task = {
29
+ "prompt": (
30
+ "Write a Python function named `solution(arr)` that takes a list of integers "
31
+ "and returns a new list sorted in ascending order. "
32
+ "Do not use `arr.sort()` in-place β€” return a new sorted list.\n\n"
33
+ f"Example:\n Input: {arr}\n Output: {sorted(arr)}"
34
+ ),
35
+ "public_example": public_example,
36
+ "hidden_tests": hidden_tests,
37
+ }
38
+ return task
39
+
40
+
41
+ def _generate_hidden_tests(n: int) -> list[dict[str, Any]]:
42
+ """
43
+ Generate n hidden test cases including edge-case variants.
44
+ """
45
+ tests = []
46
+
47
+ # Standard random arrays
48
+ for _ in range(n - 3):
49
+ size = random.randint(MIN_ARRAY_SIZE, MAX_ARRAY_SIZE)
50
+ arr = [random.randint(*ARRAY_VALUE_RANGE) for _ in range(size)]
51
+ tests.append({"input": arr, "expected_output": sorted(arr)})
52
+
53
+ # Edge case: already-sorted array
54
+ arr = sorted([random.randint(*ARRAY_VALUE_RANGE) for _ in range(5)])
55
+ tests.append({"input": arr, "expected_output": sorted(arr)})
56
+
57
+ # Edge case: reverse-sorted array
58
+ arr = sorted([random.randint(*ARRAY_VALUE_RANGE) for _ in range(5)], reverse=True)
59
+ tests.append({"input": arr, "expected_output": sorted(arr)})
60
+
61
+ # Edge case: single element
62
+ arr = [random.randint(*ARRAY_VALUE_RANGE)]
63
+ tests.append({"input": arr, "expected_output": sorted(arr)})
64
+
65
+ return tests
66
+
67
+
68
+ def generate_breaker_task(original_task: dict[str, Any]) -> dict[str, Any]:
69
+ """
70
+ Given an existing task, produce adversarial test cases for the Breaker agent.
71
+
72
+ The Breaker is asked to produce arrays that are likely to break a naive solution.
73
+ Returns a dict with the adversarial prompt and a set of candidate adversarial arrays.
74
+ """
75
+ adversarial_candidates = [
76
+ # All identical elements
77
+ [0] * random.randint(3, 8),
78
+ # All negative values
79
+ [random.randint(-100, -1) for _ in range(random.randint(3, 8))],
80
+ # Large array
81
+ [random.randint(*ARRAY_VALUE_RANGE) for _ in range(MAX_ARRAY_SIZE)],
82
+ # Duplicate-heavy array
83
+ [random.choice([1, 2, 3]) for _ in range(random.randint(4, 10))],
84
+ # Mixed positive/negative with duplicates
85
+ [random.randint(-5, 5) for _ in range(random.randint(4, 12))],
86
+ ]
87
+
88
+ adversarial_tests = [
89
+ {"input": arr, "expected_output": sorted(arr)}
90
+ for arr in adversarial_candidates
91
+ ]
92
+
93
+ breaker_task = {
94
+ "prompt": (
95
+ "You are the Breaker agent. Generate adversarial integer arrays that are "
96
+ "likely to expose flaws in a naive sorting implementation. "
97
+ "Focus on edge cases: duplicates, negatives, large inputs, already-sorted, "
98
+ "reverse-sorted, and single-element arrays."
99
+ ),
100
+ "adversarial_tests": adversarial_tests,
101
+ }
102
+ return breaker_task
FORGE-v4/trainer.py ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # trainer.py
2
+ # Placeholder training loop hooks for FORGE-v4.
3
+ # Ready for future TRL / Unsloth / Hugging Face integration.
4
+
5
+ from typing import Any, Callable
6
+ from env import FORGEEnv
7
+ from memory import CoachMemory
8
+ from config import MAX_EPISODES, STEPS_PER_EPISODE
9
+
10
+
11
+ # ──────────────────────────────────────────────
12
+ # Placeholder agent policy functions
13
+ # ──────────────────────────────────────────────
14
+
15
+ def default_coder_policy(state: dict[str, Any]) -> str:
16
+ """
17
+ Placeholder Coder policy.
18
+
19
+ In production this will call a fine-tuned LLM (e.g. via TRL/Unsloth) to
20
+ generate Python code from the task prompt.
21
+
22
+ Currently returns a trivial reference solution so the environment runs.
23
+ """
24
+ # TODO: Replace with LLM inference call
25
+ return "def solution(arr):\n return sorted(arr)\n"
26
+
27
+
28
+ def default_breaker_policy(state: dict[str, Any]) -> list[dict[str, Any]]:
29
+ """
30
+ Placeholder Breaker policy.
31
+
32
+ In production this will call a fine-tuned adversarial LLM to generate
33
+ adversarial test cases from the task prompt.
34
+
35
+ Currently returns a fixed set of edge-case test inputs.
36
+ """
37
+ # TODO: Replace with adversarial LLM inference call
38
+ return [
39
+ {"input": [], "expected_output": []},
40
+ {"input": [1], "expected_output": [1]},
41
+ {"input": [3, 1, 2], "expected_output": [1, 2, 3]},
42
+ {"input": [-5, -1, -3], "expected_output": [-5, -3, -1]},
43
+ {"input": [0, 0, 0, 0], "expected_output": [0, 0, 0, 0]},
44
+ ]
45
+
46
+
47
+ # ──────────────────────────────────────────────
48
+ # Core training loop
49
+ # ──────────────────────────────────────────────
50
+
51
+ def train(
52
+ coder_policy: Callable[[dict[str, Any]], str] = default_coder_policy,
53
+ breaker_policy: Callable[[dict[str, Any]], list[dict[str, Any]]] = default_breaker_policy,
54
+ num_episodes: int = MAX_EPISODES,
55
+ verbose: bool = True,
56
+ ) -> dict[str, Any]:
57
+ """
58
+ Run the FORGE-v4 training loop.
59
+
60
+ Args:
61
+ coder_policy: Callable(state) β†’ Python source string.
62
+ breaker_policy: Callable(state) β†’ list of test-case dicts.
63
+ num_episodes: Number of training episodes to run.
64
+ verbose: Print per-episode summaries when True.
65
+
66
+ Returns:
67
+ Training summary dict with per-episode reward histories.
68
+ """
69
+ memory = CoachMemory()
70
+ env = FORGEEnv(memory=memory)
71
+
72
+ episode_history: list[dict[str, Any]] = []
73
+
74
+ for ep in range(1, num_episodes + 1):
75
+ state = env.reset()
76
+ episode_coder_rewards = []
77
+ episode_breaker_rewards = []
78
+
79
+ for _ in range(STEPS_PER_EPISODE):
80
+ # ── Agent decisions ────────────────────────────────────────────
81
+ coder_code = coder_policy(state)
82
+ breaker_tests = breaker_policy(state)
83
+
84
+ action = {
85
+ "coder_code": coder_code,
86
+ "breaker_tests": breaker_tests,
87
+ }
88
+
89
+ # ── Environment step ───────────────────────────────────────────
90
+ result = env.step(action)
91
+ state = result["state"]
92
+
93
+ episode_coder_rewards.append(result["coder_reward"]["total_reward"])
94
+ episode_breaker_rewards.append(result["breaker_reward"]["total_reward"])
95
+
96
+ if result["done"]:
97
+ break
98
+
99
+ # ── Episode summary ────────────────────────────────────────────────
100
+ avg_cr = round(sum(episode_coder_rewards) / len(episode_coder_rewards), 4)
101
+ avg_br = round(sum(episode_breaker_rewards) / len(episode_breaker_rewards), 4)
102
+
103
+ ep_summary = {
104
+ "episode": ep,
105
+ "avg_coder_reward": avg_cr,
106
+ "avg_breaker_reward": avg_br,
107
+ "steps": env.step_count,
108
+ }
109
+ episode_history.append(ep_summary)
110
+
111
+ if verbose:
112
+ print(
113
+ f"[Episode {ep:>4}/{num_episodes}] "
114
+ f"Coder avg reward: {avg_cr:+.4f} | "
115
+ f"Breaker avg reward: {avg_br:+.4f}"
116
+ )
117
+
118
+ # ── TRL / Unsloth hook placeholders ───────────────────────────────
119
+ _on_episode_end(ep, ep_summary, memory)
120
+
121
+ training_summary = {
122
+ "total_episodes": num_episodes,
123
+ "episode_history": episode_history,
124
+ "memory_summary": memory.summary(),
125
+ }
126
+ return training_summary
127
+
128
+
129
+ # ──────────────────────────────────────────────
130
+ # Hook placeholders for future RL framework integration
131
+ # ──────────────────────────────────────────────
132
+
133
+ def _on_episode_end(
134
+ episode: int,
135
+ summary: dict[str, Any],
136
+ memory: CoachMemory,
137
+ ) -> None:
138
+ """
139
+ Called at the end of every episode.
140
+
141
+ TODO: Plug in TRL PPOTrainer / Unsloth model updates here.
142
+ E.g.:
143
+ trainer.step(queries, responses, rewards)
144
+ model.save_pretrained(f"models/checkpoint-ep{episode}")
145
+ """
146
+ pass # placeholder
147
+
148
+
149
+ def _on_step_end(
150
+ step: int,
151
+ result: dict[str, Any],
152
+ ) -> None:
153
+ """
154
+ Called after every environment step.
155
+
156
+ TODO: Plug in per-step reward logging (e.g. W&B, TensorBoard) here.
157
+ """
158
+ pass # placeholder
attached_assets/Pasted-Create-a-Python-project-named-FORGE-v4-Build-the-comple_1777105563327.txt ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Create a Python project named **FORGE-v4**.
2
+
3
+ Build the complete project skeleton with this exact structure:
4
+
5
+ FORGE-v4/
6
+ │── app.py
7
+ │── env.py
8
+ │── tasks.py
9
+ │── rewards.py
10
+ │── sandbox.py
11
+ │── memory.py
12
+ │── trainer.py
13
+ │── config.py
14
+ │── requirements.txt
15
+ │── README.md
16
+ │── data/
17
+ │── logs/
18
+ │── models/
19
+ │── outputs/
20
+
21
+ Project Purpose:
22
+ FORGE-v4 is a hackathon project based on an OpenEnv-style reinforcement learning environment where:
23
+
24
+ 1. A Coder Agent writes Python code to solve integer array sorting tasks.
25
+ 2. A Breaker Agent creates adversarial test cases to break the solution.
26
+ 3. A Sandbox safely runs generated code.
27
+ 4. Rewards are assigned to both agents.
28
+ 5. Coach Memory stores lessons learned across episodes.
29
+
30
+ Generate clean, modular starter code for all files.
31
+
32
+ Required file responsibilities:
33
+
34
+ 1. app.py
35
+
36
+ * Main runner script
37
+ * Minimal CLI demo
38
+ * Runs one sample episode
39
+
40
+ 2. env.py
41
+
42
+ * Main environment class
43
+ * Include methods:
44
+
45
+ * reset()
46
+ * step(action)
47
+ * get_state()
48
+
49
+ 3. tasks.py
50
+
51
+ * Generate integer array sorting tasks
52
+ * Sample hidden test cases
53
+
54
+ 4. rewards.py
55
+
56
+ * Functions for coder_reward()
57
+ * Functions for breaker_reward()
58
+
59
+ 5. sandbox.py
60
+
61
+ * Safely execute Python-generated code
62
+ * Include timeout handling
63
+ * Return pass/fail/error info
64
+
65
+ 6. memory.py
66
+
67
+ * Coach memory system
68
+ * Store lessons learned in JSON/list format
69
+ * Load/save memory helpers
70
+
71
+ 7. trainer.py
72
+
73
+ * Placeholder training loop hooks
74
+ * Future TRL / Unsloth integration ready
75
+
76
+ 8. config.py
77
+
78
+ * Store constants such as:
79
+
80
+ * timeout seconds
81
+ * max array size
82
+ * tier thresholds
83
+ * reward weights
84
+
85
+ 9. requirements.txt
86
+ Include useful starter dependencies only.
87
+
88
+ 10. README.md
89
+ Professional first draft including:
90
+
91
+ * Project overview
92
+ * Architecture
93
+ * File structure
94
+ * How to run
95
+
96
+ Important Rules:
97
+
98
+ * Generate WORKING starter code, not empty files.
99
+ * Use Python best practices.
100
+ * Add comments throughout code.
101
+ * Keep modular design.
102
+ * Keep ready for future Google Colab training and Hugging Face deployment.
103
+ * No frontend needed now.
104
+ * Focus only on backend environment skeleton.
105
+
106
+ After generating files, ensure the project runs successfully with:
107
+
108
+ python app.py