File size: 4,471 Bytes
1e82f9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
# Architecture

## System Overview

```
Agent (inference.py)
    β”‚
    β”‚  POST /reset, POST /step
    β–Ό
FastAPI Server (app.py)
    β”‚
    β”‚  reset(), step()
    β–Ό
MLOpsEnvironment (mlops_environment.py)
    β”‚
    β”œβ”€β”€ ArtifactGenerator (artifact_generator.py)
    β”‚   └── BUG_CATALOGUE: 9 bug specs across 3 tiers
    β”‚   └── Procedural generation: config, logs, stats, code, eval, model card
    β”‚
    β”œβ”€β”€ Sanity Check Engine (artifact_generator.py)
    β”‚   └── 8 computed diagnostics grounded in generated artifacts
    β”‚
    β”œβ”€β”€ Grader (_handle_submit)
    β”‚   └── 4-component scoring: category + file + field + fix
    β”‚
    └── Models (models.py)
        └── MLOpsAction, MLOpsObservation, MLOpsState, ArtifactMeta
```

## Data Flow

### Episode Lifecycle

```
1. reset(task_id, seed)
   β”œβ”€β”€ Random(seed) selects bug from task pool
   β”œβ”€β”€ ArtifactGenerator creates 6 consistent artifacts with planted fault
   └── Returns: MLOpsObservation with task description + artifact metadata

2. step(action) Γ— N
   β”œβ”€β”€ read_* actions β†’ return artifact content (reward: +0.02 new, -0.02 duplicate)
   β”œβ”€β”€ run_sanity_check β†’ compute diagnostic from artifacts (reward: +0.01 new)
   β”œβ”€β”€ query_artifact β†’ return specific field via dot notation
   └── submit_diagnosis β†’ grade against ground truth (terminal)

3. Grading (_handle_submit)
   β”œβ”€β”€ Compare 4 components against BugSpec ground truth
   β”œβ”€β”€ Apply hard task penalty if score < 0.70
   └── Return: score ∈ (0.01, 0.99), breakdown, ground truth
```

### Determinism Guarantees

- `random.Random(seed)` for bug selection and artifact variation
- `np.random.RandomState(seed)` for numeric distributions
- No external state, no network calls during generation
- Same (task_id, seed) always produces identical episode

## Component Responsibilities

### app.py β€” API Layer
- FastAPI server on port 7860
- REST endpoints: `/reset`, `/step`, `/state`, `/health`, `/tasks`
- WebSocket endpoint: `/ws` for streaming interaction
- Stateless request handling; delegates to MLOpsEnvironment

### mlops_environment.py β€” Core Logic
- Episode state management (step count, artifacts read, score)
- Action routing to handlers
- Grading logic with 4-component scoring
- `grade_task()` standalone grader for OpenEnv validation

### artifact_generator.py β€” Content Generation
- `BugSpec` dataclass: category, file, field, gold_fix, difficulty
- `BUG_CATALOGUE`: 9 bug specifications
- `ArtifactGenerator`: produces 6 artifacts per episode
- `run_sanity_check()`: 8 computed diagnostic checks

### models.py β€” Data Models
- `MLOpsAction`: 8 action types with typed parameters
- `MLOpsObservation`: full agent observation per step
- `MLOpsState`: internal state for debugging/RL harness
- `ArtifactMeta`: artifact metadata (name, description, size hint)

### inference.py β€” Baseline Agent
- LLM-powered agent using Gemini via OpenAI-compatible API
- Investigation phase: reads artifacts, runs sanity checks
- Diagnosis phase: submits structured diagnosis
- Fallback logic for unparseable LLM output
- Rate limiting with exponential backoff

### client.py β€” Client Library
- `MLOpsDebugEnv`: async httpx client
- `SyncMLOpsDebugEnv`: synchronous wrapper
- Context manager support for connection lifecycle

## API Endpoints

| Method | Path | Description |
|--------|------|-------------|
| GET | `/` | API info |
| GET | `/health` | Health check |
| GET | `/tasks` | List available tasks |
| POST | `/reset` | Start new episode |
| POST | `/step` | Execute action |
| GET | `/state` | Current episode state |
| GET | `/openenv/state` | OpenEnv framework state |
| WS | `/ws` | WebSocket interface |

## Reward Architecture

The reward function has two layers:

**Per-step (dense):** Encourages systematic investigation
- New artifact read: +0.02 (explore broadly)
- Duplicate read: -0.02 (don't brute force)
- New sanity check: +0.01 (use diagnostics)

**Terminal (graded):** Evaluates diagnosis quality
- 4 independent components sum to max 1.0
- Keyword/substring matching (no LLM judge)
- Hard task asymmetric penalty (1.5x on missed components)

This two-layer design means an agent that investigates thoroughly but diagnoses wrong still earns per-step rewards, while an agent that submits immediately with a lucky guess earns terminal reward but misses exploration bonuses.