samrat-rm commited on
Commit
8f1e681
Β·
1 Parent(s): 25fff92

feat: update the readme.md

Browse files
Files changed (1) hide show
  1. README.md +101 -205
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
- title: Whydiditfail Environment Server
3
- emoji: πŸ“ 
4
  colorFrom: red
5
  colorTo: indigo
6
  sdk: docker
@@ -11,245 +11,141 @@ tags:
11
  - openenv
12
  ---
13
 
14
- # Whydiditfail Environment
15
 
16
- A simple test environment that echoes back messages. Perfect for testing the env APIs as well as demonstrating environment usage patterns.
17
 
18
- ## Quick Start
19
 
20
- The simplest way to use the Whydiditfail environment is through the `WhydiditfailEnv` class:
21
 
22
- ```python
23
- from WhyDidItFail import WhyDidItFailAction, WhydiditfailEnv
 
 
24
 
25
- try:
26
- # Create environment from Docker image
27
- WhyDidItFailenv = WhydiditfailEnv.from_docker_image("WhyDidItFail-env:latest")
28
 
29
- # Reset
30
- result = WhyDidItFailenv.reset()
31
- print(f"Reset: {result.observation.echoed_message}")
 
 
 
32
 
33
- # Send multiple messages
34
- messages = ["Hello, World!", "Testing echo", "Final message"]
35
 
36
- for msg in messages:
37
- result = WhyDidItFailenv.step(WhyDidItFailAction(message=msg))
38
- print(f"Sent: '{msg}'")
39
- print(f" β†’ Echoed: '{result.observation.echoed_message}'")
40
- print(f" β†’ Length: {result.observation.message_length}")
41
- print(f" β†’ Reward: {result.reward}")
42
 
43
- finally:
44
- # Always clean up
45
- WhyDidItFailenv.close()
46
- ```
47
-
48
- That's it! The `WhydiditfailEnv.from_docker_image()` method handles:
49
- - Starting the Docker container
50
- - Waiting for the server to be ready
51
- - Connecting to the environment
52
- - Container cleanup when you call `close()`
53
-
54
- ## Building the Docker Image
55
-
56
- Before using the environment, you need to build the Docker image:
57
-
58
- ```bash
59
- # From project root
60
- docker build -t WhyDidItFail-env:latest -f server/Dockerfile .
61
- ```
62
-
63
- ## Deploying to Hugging Face Spaces
64
-
65
- You can easily deploy your OpenEnv environment to Hugging Face Spaces using the `openenv push` command:
66
-
67
- ```bash
68
- # From the environment directory (where openenv.yaml is located)
69
- openenv push
70
-
71
- # Or specify options
72
- openenv push --namespace my-org --private
73
- ```
74
 
75
- The `openenv push` command will:
76
- 1. Validate that the directory is an OpenEnv environment (checks for `openenv.yaml`)
77
- 2. Prepare a custom build for Hugging Face Docker space (enables web interface)
78
- 3. Upload to Hugging Face (ensuring you're logged in)
 
 
 
79
 
80
- ### Prerequisites
81
 
82
- - Authenticate with Hugging Face: The command will prompt for login if not already authenticated
83
 
84
- ### Options
 
 
 
 
 
85
 
86
- - `--directory`, `-d`: Directory containing the OpenEnv environment (defaults to current directory)
87
- - `--repo-id`, `-r`: Repository ID in format 'username/repo-name' (defaults to 'username/env-name' from openenv.yaml)
88
- - `--base-image`, `-b`: Base Docker image to use (overrides Dockerfile FROM)
89
- - `--private`: Deploy the space as private (default: public)
90
 
91
- ### Examples
92
 
93
- ```bash
94
- # Push to your personal namespace (defaults to username/env-name from openenv.yaml)
95
- openenv push
96
-
97
- # Push to a specific repository
98
- openenv push --repo-id my-org/my-env
99
-
100
- # Push with a custom base image
101
- openenv push --base-image ghcr.io/meta-pytorch/openenv-base:latest
102
-
103
- # Push as a private space
104
- openenv push --private
105
-
106
- # Combine options
107
- openenv push --repo-id my-org/my-env --base-image custom-base:latest --private
108
- ```
109
-
110
- After deployment, your space will be available at:
111
- `https://huggingface.co/spaces/<repo-id>`
112
 
113
- The deployed space includes:
114
- - **Web Interface** at `/web` - Interactive UI for exploring the environment
115
- - **API Documentation** at `/docs` - Full OpenAPI/Swagger interface
116
- - **Health Check** at `/health` - Container health monitoring
117
- - **WebSocket** at `/ws` - Persistent session endpoint for low-latency interactions
118
 
119
- ## Environment Details
 
 
 
 
120
 
121
- ### Action
122
- **WhyDidItFailAction**: Contains a single field
123
- - `message` (str) - The message to echo back
124
 
125
- ### Observation
126
- **WhyDidItFailObservation**: Contains the echo response and metadata
127
- - `echoed_message` (str) - The message echoed back
128
- - `message_length` (int) - Length of the message
129
- - `reward` (float) - Reward based on message length (length Γ— 0.1)
130
- - `done` (bool) - Always False for echo environment
131
- - `metadata` (dict) - Additional info like step count
132
 
133
- ### Reward
134
- The reward is calculated as: `message_length Γ— 0.1`
135
- - "Hi" β†’ reward: 0.2
136
- - "Hello, World!" β†’ reward: 1.3
137
- - Empty message β†’ reward: 0.0
138
 
139
- ## Advanced Usage
140
-
141
- ### Connecting to an Existing Server
142
-
143
- If you already have a Whydiditfail environment server running, you can connect directly:
144
-
145
- ```python
146
- from WhyDidItFail import WhydiditfailEnv
147
-
148
- # Connect to existing server
149
- WhyDidItFailenv = WhydiditfailEnv(base_url="<ENV_HTTP_URL_HERE>")
150
-
151
- # Use as normal
152
- result = WhyDidItFailenv.reset()
153
- result = WhyDidItFailenv.step(WhyDidItFailAction(message="Hello!"))
154
- ```
155
 
156
- Note: When connecting to an existing server, `WhyDidItFailenv.close()` will NOT stop the server.
 
 
 
 
 
157
 
158
- ### Using the Context Manager
159
-
160
- The client supports context manager usage for automatic connection management:
161
-
162
- ```python
163
- from WhyDidItFail import WhyDidItFailAction, WhydiditfailEnv
164
-
165
- # Connect with context manager (auto-connects and closes)
166
- with WhydiditfailEnv(base_url="http://localhost:8000") as env:
167
- result = env.reset()
168
- print(f"Reset: {result.observation.echoed_message}")
169
- # Multiple steps with low latency
170
- for msg in ["Hello", "World", "!"]:
171
- result = env.step(WhyDidItFailAction(message=msg))
172
- print(f"Echoed: {result.observation.echoed_message}")
173
- ```
174
-
175
- The client uses WebSocket connections for:
176
- - **Lower latency**: No HTTP connection overhead per request
177
- - **Persistent session**: Server maintains your environment state
178
- - **Efficient for episodes**: Better for many sequential steps
179
-
180
- ### Concurrent WebSocket Sessions
181
-
182
- The server supports multiple concurrent WebSocket connections. To enable this,
183
- modify `server/app.py` to use factory mode:
184
-
185
- ```python
186
- # In server/app.py - use factory mode for concurrent sessions
187
- app = create_app(
188
- WhydiditfailEnvironment, # Pass class, not instance
189
- WhyDidItFailAction,
190
- WhyDidItFailObservation,
191
- max_concurrent_envs=4, # Allow 4 concurrent sessions
192
- )
193
- ```
194
-
195
- Then multiple clients can connect simultaneously:
196
-
197
- ```python
198
- from WhyDidItFail import WhyDidItFailAction, WhydiditfailEnv
199
- from concurrent.futures import ThreadPoolExecutor
200
-
201
- def run_episode(client_id: int):
202
- with WhydiditfailEnv(base_url="http://localhost:8000") as env:
203
- result = env.reset()
204
- for i in range(10):
205
- result = env.step(WhyDidItFailAction(message=f"Client {client_id}, step {i}"))
206
- return client_id, result.observation.message_length
207
-
208
- # Run 4 episodes concurrently
209
- with ThreadPoolExecutor(max_workers=4) as executor:
210
- results = list(executor.map(run_episode, range(4)))
211
- ```
212
-
213
- ## Development & Testing
214
-
215
- ### Direct Environment Testing
216
-
217
- Test the environment logic directly without starting the HTTP server:
218
 
219
  ```bash
220
- # From the server directory
221
- python3 server/WhyDidItFail_environment.py
222
- ```
223
 
224
- This verifies that:
225
- - Environment resets correctly
226
- - Step executes actions properly
227
- - State tracking works
228
- - Rewards are calculated correctly
229
 
230
- ### Running Locally
 
 
231
 
232
- Run the server locally for development:
233
 
234
  ```bash
235
- uvicorn server.app:app --reload
 
236
  ```
237
 
238
  ## Project Structure
239
 
240
  ```
241
  WhyDidItFail/
242
- β”œβ”€β”€ .dockerignore # Docker build exclusions
243
- β”œβ”€β”€ __init__.py # Module exports
244
- β”œβ”€β”€ README.md # This file
245
- β”œβ”€β”€ openenv.yaml # OpenEnv manifest
246
- β”œβ”€β”€ pyproject.toml # Project metadata and dependencies
247
- β”œβ”€β”€ uv.lock # Locked dependencies (generated)
248
- β”œβ”€β”€ client.py # WhydiditfailEnv client
249
- β”œβ”€β”€ models.py # Action and Observation models
250
  └── server/
251
- β”œβ”€β”€ __init__.py # Server module exports
252
- β”œβ”€β”€ WhyDidItFail_environment.py # Core environment logic
253
- β”œβ”€β”€ app.py # FastAPI application (HTTP + WebSocket endpoints)
254
- └── Dockerfile # Container image definition
 
255
  ```
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: WhyDidItFail Environment Server
3
+ emoji: πŸ”
4
  colorFrom: red
5
  colorTo: indigo
6
  sdk: docker
 
11
  - openenv
12
  ---
13
 
14
+ # WhyDidItFail β€” ML Training Failure Diagnosis Environment
15
 
16
+ An OpenEnv environment where an AI agent must diagnose why a machine learning training run failed. The agent inspects logs, configs, and gradient statistics to identify the root cause and suggest a fix.
17
 
18
+ ## Overview
19
 
20
+ Real ML engineers spend significant time debugging failed training runs. This environment simulates that workflow: the agent receives partial observability (it must decide what to inspect) and must reason sequentially from evidence to diagnosis.
21
 
22
+ **12 realistic failure modes** across 3 difficulty tiers:
23
+ - **Easy**: identify failure from training logs only (loss/accuracy curves)
24
+ - **Medium**: identify failure from logs + hyperparameter config
25
+ - **Hard**: identify failure from logs + config + gradient norm data, and provide a concrete fix
26
 
27
+ ## Failure Modes
 
 
28
 
29
+ | Category | Failure Mode |
30
+ |---|---|
31
+ | Optimization | exploding gradients, vanishing gradients, learning rate too high/low |
32
+ | Regularization | overfitting, missing regularization |
33
+ | Architecture | dying relu, bad weight initialization |
34
+ | Configuration | optimizer misconfiguration, batch size too small, lr scheduler misconfiguration |
35
 
36
+ ## Action Space
 
37
 
38
+ | Action | Description |
39
+ |---|---|
40
+ | `inspect_logs` | View training/validation loss and accuracy curves by epoch |
41
+ | `inspect_config` | View hyperparameter config (lr, optimizer, batch size, dropout, etc.) |
42
+ | `inspect_gradients` | View gradient norm statistics by layer and epoch |
43
+ | `submit_diagnosis` | Submit final diagnosis with label, suggested fix, and reasoning |
44
 
45
+ ## Observation Space
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
+ Each step returns a `WhyDidItFailObservation` with:
48
+ - `task_description` β€” the current task objective
49
+ - `visible_data` β€” data returned by the last inspect action (JSON)
50
+ - `feedback` β€” partial progress hint (e.g. which sources still need inspection)
51
+ - `steps_taken` β€” step counter
52
+ - `reward` β€” step-level reward
53
+ - `done` β€” episode termination flag
54
 
55
+ ## Reward Function
56
 
57
+ Rewards are provided throughout the episode, not just at completion:
58
 
59
+ | Component | Weight | Signal |
60
+ |---|---|---|
61
+ | Diagnosis score | 0.70 | Correct failure mode label (exact match = 0.40 base, fuzzy = 0.10 per category keyword) |
62
+ | Evidence score | 0.15 | Inspected required sources; penalizes missing or irrelevant inspections |
63
+ | Efficiency score | 0.15 | Minimal steps to diagnosis; decays for wasted actions |
64
+ | Fix bonus | +0.15 | Keyword match on suggested fix (capped at 1.0 total) |
65
 
66
+ Step-level rewards during inspection: +0.10 / +0.07 / +0.05 for each required source discovered (decaying). Re-inspection: βˆ’0.05. Irrelevant inspection: βˆ’0.03.
 
 
 
67
 
68
+ ## Tasks
69
 
70
+ ### Task 1 β€” Easy (`task_easy`)
71
+ - **Objective**: Identify the failure mode from training logs only
72
+ - **Required sources**: `logs`
73
+ - **Max steps**: 10
74
+ - **Failure modes**: exploding gradients, learning rate too high, overfitting, underfitting
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
+ ### Task 2 β€” Medium (`task_medium`)
77
+ - **Objective**: Identify the failure mode from logs + hyperparameter config
78
+ - **Required sources**: `logs`, `config`
79
+ - **Max steps**: 15
80
+ - **Failure modes**: learning rate too low, missing regularization, batch size too small, optimizer misconfiguration
81
 
82
+ ### Task 3 β€” Hard (`task_hard`)
83
+ - **Objective**: Identify failure mode from logs + config + gradients, and provide a concrete fix
84
+ - **Required sources**: `logs`, `config`, `gradients`
85
+ - **Max steps**: 20
86
+ - **Failure modes**: vanishing gradients, dying relu, bad weight initialization, lr scheduler misconfiguration
87
 
88
+ ## Baseline Performance (Qwen/Qwen2.5-72B-Instruct)
 
 
89
 
90
+ | Task | Avg Score | Pass Rate |
91
+ |---|---|---|
92
+ | Easy | ~0.85 | ~80% |
93
+ | Medium | ~0.92 | ~100% |
94
+ | Hard | ~0.93 | ~100% |
 
 
95
 
96
+ ## Setup
 
 
 
 
97
 
98
+ ### Environment Variables
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
+ | Variable | Default | Required |
101
+ |---|---|---|
102
+ | `HF_TOKEN` | β€” | Yes (mandatory) |
103
+ | `API_BASE_URL` | `https://router.huggingface.co/v1` | No |
104
+ | `MODEL_NAME` | `Qwen/Qwen2.5-72B-Instruct` | No |
105
+ | `SERVER_URL` | `http://localhost:8000` | No |
106
 
107
+ ### Running Locally
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
108
 
109
  ```bash
110
+ # Install dependencies
111
+ uv sync
 
112
 
113
+ # Start the environment server
114
+ uvicorn server.app:app --reload
 
 
 
115
 
116
+ # Run inference (in another terminal)
117
+ HF_TOKEN=your_token uv run python inference.py
118
+ ```
119
 
120
+ ### Docker
121
 
122
  ```bash
123
+ docker build -t whydiditfail-env:latest .
124
+ docker run -p 8000:8000 whydiditfail-env:latest
125
  ```
126
 
127
  ## Project Structure
128
 
129
  ```
130
  WhyDidItFail/
131
+ β”œβ”€β”€ inference.py # Baseline inference script
132
+ β”œβ”€β”€ client.py # WhyDidItFailEnv client (WebSocket)
133
+ β”œβ”€β”€ models.py # Action and Observation Pydantic models
134
+ β”œβ”€β”€ openenv.yaml # OpenEnv manifest
135
+ β”œβ”€β”€ Dockerfile # Container image
 
 
 
136
  └── server/
137
+ β”œβ”€β”€ WhyDidItFail_environment.py # Core environment logic (step/reset/state)
138
+ β”œβ”€β”€ app.py # FastAPI server (HTTP + WebSocket)
139
+ β”œβ”€β”€ scenarios.py # 12 scenario definitions
140
+ β”œβ”€β”€ graders.py # Programmatic grader
141
+ └── llm_judge.py # LLM-based reasoning quality judge
142
  ```
143
+
144
+ ## OpenEnv Spec Compliance
145
+
146
+ - Typed `Action`, `Observation` Pydantic models βœ“
147
+ - `step(action)` β†’ `(observation, reward, done, info)` βœ“
148
+ - `reset()` β†’ initial observation βœ“
149
+ - `state()` β†’ current state βœ“
150
+ - `openenv.yaml` with 3 tasks and grader definitions βœ“
151
+ - Passes `openenv validate` βœ“