mlops-openenv / server /openenv.yaml
Rockerleo's picture
Upload server/openenv.yaml with huggingface_hub
dc936ba verified
name: mlops-debug-env
version: "1.0.0"
description: >
MLOps Pipeline Debugger: an AI agent acts as a senior ML engineer
investigating a broken training run. The environment procedurally generates
realistic training artifacts (logs, configs, preprocessing code, eval results)
with one planted fault. The agent must systematically investigate and submit
a structured diagnosis. Three tasks: config error (easy) -> data leakage (medium)
-> silent evaluation bug (hard). All graders are fully deterministic.
author: Code Clashers
license: MIT
tags: [openenv, rl, mlops, debugging, machine-learning, agents, pytorch]
grading:
type: deterministic
judge: none
method: keyword_and_substring_matching
reproducible: true
tasks:
- id: easy
name: Config Error Diagnosis
difficulty: easy
max_steps: 20
bug_pool: [exploding_lr, wrong_optimizer, batch_size_overflow]
reward_range: [0.01, 0.99]
description: >
Diagnose a training failure caused by a hyperparameter misconfiguration.
Symptoms are visible in training logs (loss explosion, oscillation, trivial overfitting).
- id: medium
name: Data Leakage Detection
difficulty: medium
max_steps: 30
bug_pool: [data_leakage_scaler, data_leakage_overlap, wrong_split_ratio]
reward_range: [0.01, 0.99]
description: >
Identify data leakage in the preprocessing pipeline. Val accuracy is suspiciously
high from epoch 1, but test performance tells a different story. Requires correlating
logs, eval results, and preprocessing code.
- id: hard
name: Silent Evaluation Bug
difficulty: hard
max_steps: 40
bug_pool: [label_encoder_mismatch, silent_metric_swap, tokenizer_version_drift]
reward_range: [0.01, 0.99]
asymmetric_penalty: true
penalty_multiplier: 1.5
description: >
Find a silent bug in the evaluation pipeline. Training logs look completely normal.
No errors, no warnings. Only a val/test metric gap reveals the issue. Requires
reasoning about what is absent rather than what is present.
action_space:
type: discrete_structured
actions:
- read_config
- read_logs
- check_dataset_stats
- inspect_preprocessing
- read_eval_results
- run_sanity_check
- query_artifact
- submit_diagnosis
sanity_check_types:
- label_consistency
- data_leakage
- gradient_norms
- class_balance
- feature_statistics
- encoder_version_match
- loss_trajectory
- metric_gap_analysis
observation_space:
type: structured_text
fields:
- task_id
- task_description
- run_id
- run_summary
- available_artifacts
- artifacts_read
- last_action_result
- step_count
- max_steps
- done
- messages
reward:
type: dense_and_terminal
per_step:
new_artifact_read: +0.02
duplicate_read: -0.02
new_sanity_check: +0.01
terminal:
failure_category: +0.15
root_cause_file: +0.25
root_cause_field: +0.30
proposed_fix: +0.30
hard_task_penalty: "if score < 0.70, additional 0.5x on missed components"
api:
reset: POST /reset
step: POST /step
state: GET /state
health: GET /health
tasks: GET /tasks
openenv_state: GET /openenv/state
websocket: /ws
runtime:
port: 7860
workers: 1
framework: fastapi
python: "3.11"
container: docker