Spaces:
Sleeping
Sleeping
| spec_version: 1 | |
| name: flaky_sleuth | |
| type: space | |
| runtime: fastapi | |
| app: server.app:app | |
| port: 8000 | |
| version: 0.1.0 | |
| description: > | |
| An RL environment where an LLM agent investigates flaky tests in Python repositories. | |
| The agent uses tool-like actions to read files, search code, and run tests, then submits | |
| a terminal verdict for classification, root-cause detection, or fix proposal. | |
| action_type: FlakySleuthAction | |
| observation_type: FlakySleuthObservation | |
| reward_range: (0.001, 0.999) | |
| episode_max_steps: 20 | |
| baseline_script: inference.py | |
| tasks: | |
| - id: task1_classify | |
| name: Flaky vs Stable Classification | |
| difficulty: easy | |
| description: Classify the target test as flaky or stable. | |
| - id: task2_root_cause | |
| name: Root Cause Category Identification | |
| difficulty: medium | |
| description: Predict flaky-test root-cause category (OD, NOD, TD, TZD, NIO, ID, etc.). | |
| - id: task3_fix_proposal | |
| name: Fix Proposal | |
| difficulty: hard | |
| description: Propose a concrete fix as unified diff for a known flaky test. | |
| infra: | |
| vcpu: 2 | |
| memory_gb: 8 | |
| max_inference_minutes: 20 | |