Spaces:
Sleeping
Sleeping
File size: 9,420 Bytes
e64783d 8f24287 e64783d 8f24287 e64783d 8f24287 685c05b 8f24287 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 | ---
title: ML Training Optimizer
emoji: π§
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
base_path: /web
---
# ML Training Optimizer β OpenEnv Environment
An OpenEnv environment where AI agents learn to optimize the training of real ML/DL models by tuning hyperparameters, selecting optimizers, managing learning rate schedules, and applying regularization techniques.
**This environment trains REAL PyTorch models on CPU** β not simulations. The agent observes actual training curves, loss values, and validation metrics from real forward/backward passes, then decides what to change next.
This models a real workflow ML practitioners perform every day: hyperparameter tuning under limited compute, noisy validation curves, and real overfitting risk.
## Motivation
ML practitioners spend enormous time on hyperparameter tuning. This environment recreates that workflow:
- Agent observes training metrics (loss, accuracy, convergence signals)
- Agent decides what to change (optimizer, LR, regularization, etc.)
- Agent runs more training and iterates
The small dataset subsets (5kβ10k samples) make overfitting a **real, tangible problem** the agent must address β exactly like real low-data regimes practitioners face daily.
## Vision & Scalability
The long-term vision for this environment is to **teach AI agents to monitor and optimize the training of large-scale models on distributed systems** β multi-GPU clusters, sharded data pipelines, and fault-tolerant training loops. In production ML, human engineers spend significant time babysitting training runs: watching for loss spikes, adjusting learning rates, restarting from checkpoints, and rebalancing resources across nodes. An agent that masters these skills could dramatically accelerate the development cycle of foundation models.
To fit within current compute constraints (and the OpenEnv specification), the environment currently operates on small models trainable on standard CPUs. However, the core abstractions β observing training curves, adjusting hyperparameters mid-run, detecting convergence/divergence, and deciding when to stop β are **identical to those required at scale**. An agent that learns effective optimization strategies here can transfer those skills to larger, distributed settings as the environment scales up.
## Tasks
### Task 1: MNIST Digit Classifier (Easy)
- **Model**: 2-layer MLP (~100k params)
- **Dataset**: MNIST 5k subset (4k train / 1k val)
- **Budget**: 100 epochs
- **Goal**: Maximize validation accuracy (target β₯ 96%)
- **Grading**: Linear scale 88%β97.5% β score 0.0β1.0
### Task 2: Fashion Item Classifier (Medium)
- **Model**: Small CNN (~200k params)
- **Dataset**: FashionMNIST 8k subset (6.5k train / 1.5k val)
- **Budget**: 80 epochs
- **Goal**: Maximize accuracy while keeping overfitting gap < 5%
- **Grading**: 60% accuracy score + 40% generalization score
### Task 3: CIFAR-10 Under Budget (Hard)
- **Model**: Deeper CNN (~500k params)
- **Dataset**: CIFAR-10 10k subset (8k train / 2k val)
- **Budget**: 60 epochs
- **Goal**: Maximize accuracy under tight budget
- **Grading**: 50% accuracy + 30% efficiency + 20% stability
## Action Space (MCP Tools)
| Tool | Parameters | Description |
|---|---|---|
| `configure_training` | optimizer, learning_rate, batch_size, weight_decay, dropout, lr_schedule, warmup_epochs, augmentation, augmentation_strength | Set/update training config |
| `run_epochs` | num_epochs (1β20) | Run N epochs of real PyTorch training |
| `adjust_learning_rate` | new_lr | Change LR mid-training |
| `toggle_augmentation` | enabled, strength | Toggle data augmentation |
| `get_training_status` | β | Query current metrics |
| `submit_model` | β | Submit for final grading |
### Configuration Options
**Optimizers**: `sgd` (with momentum=0.9), `adam`, `adamw`
**LR Schedules**: `constant`, `step` (decay by 0.1 every T/3 epochs), `cosine` (cosine annealing), `warmup_cosine` (linear warmup + cosine)
**Regularization**: `weight_decay` (L2), `dropout` (0.0β0.5), `augmentation` (random transforms)
**Batch Sizes**: 32, 64, 128, 256
## Observation Space
After each action, the agent receives:
```json
{
"current_epoch": 30,
"max_epochs": 100,
"remaining_budget": 70,
"train_loss": 0.342,
"val_loss": 0.401,
"train_accuracy": 0.891,
"val_accuracy": 0.864,
"best_val_accuracy": 0.871,
"best_val_epoch": 25,
"loss_history_last_10": [0.45, 0.43, ...],
"val_loss_history_last_10": [0.52, 0.49, ...],
"convergence_signal": "improving",
"is_diverged": false
}
```
**Convergence signals**: `not_started`, `warming_up`, `improving`, `plateaued`, `overfitting`, `stalling`, `diverged`
## Reward Function
Rewards per step (not just at the end):
- **Progress reward**: +0.3 Γ accuracy improvement above previous best
- **Convergence reward**: +0.05 for decreasing validation loss
- **Divergence penalty**: β0.2 if training diverges
- **Overfitting penalty**: β0.05 Γ excess when gap > 8%
- **Submission bonus**: Final grader score (0.0β1.0) added on submit
## Setup & Usage
### Install
```bash
uv sync
```
### Run the server locally
```bash
uvicorn server.app:app --host 0.0.0.0 --port 8000
```
### Run with Docker
```bash
docker build -f server/Dockerfile -t ml-trainer-env .
docker run -p 8000:8000 ml-trainer-env
```
### Run the baseline inference
This baseline uses the OpenAI API by default. With the default `LLM_RPM_LIMIT=5`, it spaces requests to stay under free-tier quotas and uses a small, quota-aware decision budget per task.
Recommended `.env`:
```bash
OPENAI_API_KEY=sk-proj-...
MODEL_NAME=gpt-4o-mini
ENV_URL=http://localhost:8000
LLM_RPM_LIMIT=5
LLM_MAX_RETRIES=3
LLM_REASONING_EFFORT=minimal
LLM_MAX_STEPS_EASY=5
LLM_MAX_STEPS_MEDIUM=6
LLM_MAX_STEPS_HARD=7
```
Then run:
```bash
export ENV_URL=http://localhost:8000
uv run inference.py
```
The script uses the OpenAI Python client against the official OpenAI API by default. You can also point it at other OpenAI-compatible providers (like OpenRouter or Gemini) by setting corresponding `API_BASE_URL`, `OPENROUTER_API_KEY`, or `GEMINI_API_KEY` in your `.env`.
### Required environment variables
| `API_BASE_URL` | no (defaults to OpenAI) | LLM API endpoint |
| `MODEL_NAME` | yes | Model identifier (default: gpt-4o-mini) |
| `ENV_URL` | yes | URL of the running OpenEnv environment |
| `OPENAI_API_KEY` | yes | Auth for OpenAI (preferred) |
| `OPENROUTER_API_KEY` or `GEMINI_API_KEY` | yes | Fallback Auth for alternative providers |
| `HF_TOKEN` | needed for HF deployment workflows | Hugging Face auth token |
### Optional inference tuning variables
| Variable | Default | Purpose |
|---|---|---|
| `LLM_RPM_LIMIT` | `5` | Hard request cap used by the scheduler |
| `LLM_MAX_RETRIES` | `3` | Rate-limit retries per model request |
| `LLM_REASONING_EFFORT` | `minimal` | Gemini reasoning effort |
| `LLM_MAX_STEPS_EASY` | `5` | Max model decisions for `easy_mnist` |
| `LLM_MAX_STEPS_MEDIUM` | `6` | Max model decisions for `medium_fashion` |
| `LLM_MAX_STEPS_HARD` | `7` | Max model decisions for `hard_cifar` |
### Interact via Python client
```python
from ml_trainer_env import MLTrainerEnv
with MLTrainerEnv(base_url="http://localhost:8000") as env:
env.reset(task_id="easy_mnist")
tools = env.list_tools()
result = env.call_tool("configure_training",
optimizer="adam", learning_rate=0.001, batch_size=64)
result = env.call_tool("run_epochs", num_epochs=10)
print(result) # Real training metrics!
result = env.call_tool("submit_model")
print(result) # Final score
```
## Baseline Scores
| Task | Expected Score Range | Notes |
|---|---|---|
| easy_mnist | 0.6 β 0.9 | Most models solve this well |
| medium_fashion | 0.4 β 0.7 | Requires regularization awareness |
| hard_cifar | 0.2 β 0.5 | Genuinely challenging under budget |
Scores are reported as expected ranges rather than exact fixed values because training remains real, even though seeds and data subsets are deterministic.
## Architecture
```
openenv-hack/
βββ __init__.py # Package exports
βββ models.py # Pydantic Action/Observation models
βββ client.py # MCPToolClient subclass
βββ openenv.yaml # OpenEnv manifest
βββ pyproject.toml # Dependencies
βββ inference.py # Baseline inference script
βββ README.md
βββ server/
β βββ app.py # FastAPI server
β βββ ml_trainer_environment.py # MCPEnvironment with tools
β βββ trainer.py # Real PyTorch training engine
β βββ models_nn.py # Neural network architectures
β βββ datasets.py # Dataset loading & subsetting
β βββ tasks.py # Task definitions & graders
β βββ Dockerfile
βββ outputs/
βββ logs/
βββ evals/
```
## Technical Details
- **Real training**: Actual PyTorch forward/backward passes on CPU
- **Deterministic**: `torch.manual_seed()` ensures reproducible results
- **Constrained**: `torch.set_num_threads(2)` matches 2 vCPU limit
- **Fast**: ~0.5β3s per epoch depending on task
- **Pre-cached**: Datasets downloaded at Docker build time
- **Quota-aware baseline**: `inference.py` is optimized for low-RPM Gemini quotas and uses function calling with compact state summaries
|