Spaces:

Rockerleo
/

mlops-openenv

Sleeping

App Files Files Community

mlops-openenv / ARCHITECTURE.md

Rockerleo

Upload folder using huggingface_hub

1e82f9d verified about 1 month ago

preview code

raw

history blame contribute delete

4.47 kB

	# Architecture

	## System Overview

	```
	Agent (inference.py)
	│
	│ POST /reset, POST /step
	▼
	FastAPI Server (app.py)
	│
	│ reset(), step()
	▼
	MLOpsEnvironment (mlops_environment.py)
	│
	├── ArtifactGenerator (artifact_generator.py)
	│ └── BUG_CATALOGUE: 9 bug specs across 3 tiers
	│ └── Procedural generation: config, logs, stats, code, eval, model card
	│
	├── Sanity Check Engine (artifact_generator.py)
	│ └── 8 computed diagnostics grounded in generated artifacts
	│
	├── Grader (_handle_submit)
	│ └── 4-component scoring: category + file + field + fix
	│
	└── Models (models.py)
	└── MLOpsAction, MLOpsObservation, MLOpsState, ArtifactMeta
	```

	## Data Flow

	### Episode Lifecycle

	```
	1. reset(task_id, seed)
	├── Random(seed) selects bug from task pool
	├── ArtifactGenerator creates 6 consistent artifacts with planted fault
	└── Returns: MLOpsObservation with task description + artifact metadata

	2. step(action) × N
	├── read_* actions → return artifact content (reward: +0.02 new, -0.02 duplicate)
	├── run_sanity_check → compute diagnostic from artifacts (reward: +0.01 new)
	├── query_artifact → return specific field via dot notation
	└── submit_diagnosis → grade against ground truth (terminal)

	3. Grading (_handle_submit)
	├── Compare 4 components against BugSpec ground truth
	├── Apply hard task penalty if score < 0.70
	└── Return: score ∈ (0.01, 0.99), breakdown, ground truth
	```

	### Determinism Guarantees

	- `random.Random(seed)` for bug selection and artifact variation
	- `np.random.RandomState(seed)` for numeric distributions
	- No external state, no network calls during generation
	- Same (task_id, seed) always produces identical episode

	## Component Responsibilities

	### app.py — API Layer
	- FastAPI server on port 7860
	- REST endpoints: `/reset`, `/step`, `/state`, `/health`, `/tasks`
	- WebSocket endpoint: `/ws` for streaming interaction
	- Stateless request handling; delegates to MLOpsEnvironment

	### mlops_environment.py — Core Logic
	- Episode state management (step count, artifacts read, score)
	- Action routing to handlers
	- Grading logic with 4-component scoring
	- `grade_task()` standalone grader for OpenEnv validation

	### artifact_generator.py — Content Generation
	- `BugSpec` dataclass: category, file, field, gold_fix, difficulty
	- `BUG_CATALOGUE`: 9 bug specifications
	- `ArtifactGenerator`: produces 6 artifacts per episode
	- `run_sanity_check()`: 8 computed diagnostic checks

	### models.py — Data Models
	- `MLOpsAction`: 8 action types with typed parameters
	- `MLOpsObservation`: full agent observation per step
	- `MLOpsState`: internal state for debugging/RL harness
	- `ArtifactMeta`: artifact metadata (name, description, size hint)

	### inference.py — Baseline Agent
	- LLM-powered agent using Gemini via OpenAI-compatible API
	- Investigation phase: reads artifacts, runs sanity checks
	- Diagnosis phase: submits structured diagnosis
	- Fallback logic for unparseable LLM output
	- Rate limiting with exponential backoff

	### client.py — Client Library
	- `MLOpsDebugEnv`: async httpx client
	- `SyncMLOpsDebugEnv`: synchronous wrapper
	- Context manager support for connection lifecycle

	## API Endpoints

	\| Method \| Path \| Description \|
	\|--------\|------\|-------------\|
	\| GET \| `/` \| API info \|
	\| GET \| `/health` \| Health check \|
	\| GET \| `/tasks` \| List available tasks \|
	\| POST \| `/reset` \| Start new episode \|
	\| POST \| `/step` \| Execute action \|
	\| GET \| `/state` \| Current episode state \|
	\| GET \| `/openenv/state` \| OpenEnv framework state \|
	\| WS \| `/ws` \| WebSocket interface \|

	## Reward Architecture

	The reward function has two layers:

	Per-step (dense): Encourages systematic investigation
	- New artifact read: +0.02 (explore broadly)
	- Duplicate read: -0.02 (don't brute force)
	- New sanity check: +0.01 (use diagnostics)

	Terminal (graded): Evaluates diagnosis quality
	- 4 independent components sum to max 1.0
	- Keyword/substring matching (no LLM judge)
	- Hard task asymmetric penalty (1.5x on missed components)

	This two-layer design means an agent that investigates thoroughly but diagnoses wrong still earns per-step rewards, while an agent that submits immediately with a lucky guess earns terminal reward but misses exploration bonuses.