Spaces:
Sleeping
Sleeping
File size: 3,356 Bytes
9256ec9 dc936ba 9256ec9 dc936ba 9256ec9 dc936ba 9256ec9 dc936ba 9256ec9 dc936ba 9256ec9 dc936ba 9256ec9 dc936ba 9256ec9 dc936ba 9256ec9 dc936ba 9256ec9 dc936ba 9256ec9 dc936ba 9256ec9 dc936ba | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | name: mlops-debug-env
version: "1.0.0"
description: >
MLOps Pipeline Debugger: an AI agent acts as a senior ML engineer
investigating a broken training run. The environment procedurally generates
realistic training artifacts (logs, configs, preprocessing code, eval results)
with one planted fault. The agent must systematically investigate and submit
a structured diagnosis. Three tasks: config error (easy) -> data leakage (medium)
-> silent evaluation bug (hard). All graders are fully deterministic.
author: Code Clashers
license: MIT
tags: [openenv, rl, mlops, debugging, machine-learning, agents, pytorch]
grading:
type: deterministic
judge: none
method: keyword_and_substring_matching
reproducible: true
tasks:
- id: easy
name: Config Error Diagnosis
difficulty: easy
max_steps: 20
bug_pool: [exploding_lr, wrong_optimizer, batch_size_overflow]
reward_range: [0.01, 0.99]
description: >
Diagnose a training failure caused by a hyperparameter misconfiguration.
Symptoms are visible in training logs (loss explosion, oscillation, trivial overfitting).
- id: medium
name: Data Leakage Detection
difficulty: medium
max_steps: 30
bug_pool: [data_leakage_scaler, data_leakage_overlap, wrong_split_ratio]
reward_range: [0.01, 0.99]
description: >
Identify data leakage in the preprocessing pipeline. Val accuracy is suspiciously
high from epoch 1, but test performance tells a different story. Requires correlating
logs, eval results, and preprocessing code.
- id: hard
name: Silent Evaluation Bug
difficulty: hard
max_steps: 40
bug_pool: [label_encoder_mismatch, silent_metric_swap, tokenizer_version_drift]
reward_range: [0.01, 0.99]
asymmetric_penalty: true
penalty_multiplier: 1.5
description: >
Find a silent bug in the evaluation pipeline. Training logs look completely normal.
No errors, no warnings. Only a val/test metric gap reveals the issue. Requires
reasoning about what is absent rather than what is present.
action_space:
type: discrete_structured
actions:
- read_config
- read_logs
- check_dataset_stats
- inspect_preprocessing
- read_eval_results
- run_sanity_check
- query_artifact
- submit_diagnosis
sanity_check_types:
- label_consistency
- data_leakage
- gradient_norms
- class_balance
- feature_statistics
- encoder_version_match
- loss_trajectory
- metric_gap_analysis
observation_space:
type: structured_text
fields:
- task_id
- task_description
- run_id
- run_summary
- available_artifacts
- artifacts_read
- last_action_result
- step_count
- max_steps
- done
- messages
reward:
type: dense_and_terminal
per_step:
new_artifact_read: +0.02
duplicate_read: -0.02
new_sanity_check: +0.01
terminal:
failure_category: +0.15
root_cause_file: +0.25
root_cause_field: +0.30
proposed_fix: +0.30
hard_task_penalty: "if score < 0.70, additional 0.5x on missed components"
api:
reset: POST /reset
step: POST /step
state: GET /state
health: GET /health
tasks: GET /tasks
openenv_state: GET /openenv/state
websocket: /ws
runtime:
port: 7860
workers: 1
framework: fastapi
python: "3.11"
container: docker
|