Spaces:
Sleeping
Sleeping
Upload folder using huggingface_hub
Browse files- README.md +82 -0
- server/rag_debug_env_environment.py +19 -1
README.md
CHANGED
|
@@ -12,6 +12,88 @@ base_path: /web
|
|
| 12 |
|
| 13 |
RAGDebugEnv is an OpenEnv-compatible environment for training and evaluating agents that debug broken retrieval pipelines.
|
| 14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
The environment simulates retrieval behavior with precomputed similarity matrices so each step is fast, while still exposing realistic debugging actions such as threshold tuning, top-k tuning, embedding model swaps, reranking toggles, and query rewrites.
|
| 16 |
|
| 17 |
## Current Status
|
|
|
|
| 12 |
|
| 13 |
RAGDebugEnv is an OpenEnv-compatible environment for training and evaluating agents that debug broken retrieval pipelines.
|
| 14 |
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## Using the Playground
|
| 18 |
+
|
| 19 |
+
The playground lets you manually interact with the environment through the web UI β useful for understanding the task before writing an agent.
|
| 20 |
+
|
| 21 |
+
### Workflow
|
| 22 |
+
|
| 23 |
+
1. **Reset** β Click **Reset** to start a new episode. A task is randomly assigned (1, 2, or 3). The response shows the initial observation: pipeline config, per-query retrieval results, and quality metrics.
|
| 24 |
+
2. **Step** β Fill in **Action Type** and **Params**, then click **Step** to apply an action to the pipeline. The response shows the updated observation and the reward signal.
|
| 25 |
+
3. **Get state** β Click **Get state** at any time to inspect the full server-side state, including the injected faults (hidden from the agent during normal operation).
|
| 26 |
+
4. **Repeat** β Keep stepping until `done: true` appears in the response, or until you are ready to submit.
|
| 27 |
+
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
### Action Reference
|
| 31 |
+
|
| 32 |
+
| Action Type | Params (JSON) | Notes |
|
| 33 |
+
|---|---|---|
|
| 34 |
+
| `adjust_chunk_size` | `{"value": 256}` | int, 64β2048 |
|
| 35 |
+
| `adjust_chunk_overlap` | `{"value": 32}` | int, 0β500; must be < chunk\_size |
|
| 36 |
+
| `adjust_threshold` | `{"value": 0.5}` | float, 0.0β1.0 |
|
| 37 |
+
| `adjust_top_k` | `{"value": 15}` | int, 1β50 |
|
| 38 |
+
| `swap_embedding_model` | `{"model": "medical"}` | `"general"` / `"medical"` / `"legal"` / `"code"` |
|
| 39 |
+
| `toggle_reranking` | `{"enabled": true}` | bool |
|
| 40 |
+
| `adjust_context_limit` | `{"value": 8192}` | int, 512β16384 |
|
| 41 |
+
| `rewrite_query` | `{"query_id": 0, "strategy": "expand"}` | strategy: `"expand"` / `"rephrase"` / `"decompose"` |
|
| 42 |
+
| `submit` | `{}` | Ends the episode. Returns a bonus if coverage threshold met. |
|
| 43 |
+
|
| 44 |
+
Enter **Action Type** as a plain string (e.g. `adjust_threshold`) and **Params** as a JSON object (e.g. `{"value": 0.45}`).
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
### Reading the Observation
|
| 49 |
+
|
| 50 |
+
After each step the Raw JSON response contains:
|
| 51 |
+
|
| 52 |
+
```
|
| 53 |
+
pipeline_config β current knob values (what the agent changed)
|
| 54 |
+
query_results β per-query: retrieved chunks, coverage score, precision score
|
| 55 |
+
metrics β mean_coverage, mean_precision, n_empty_retrievals, n_context_overflows
|
| 56 |
+
corpus_stats β domain, n_chunks, n_queries, has_near_duplicates
|
| 57 |
+
steps_taken / max_steps β step budget (max 10)
|
| 58 |
+
task_id / task_description β which task is running
|
| 59 |
+
done β true when the episode has ended
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
`mean_coverage` is the primary signal. A value below ~0.4 means something is broken. Diagnose from `n_empty_retrievals` and `n_context_overflows` as secondary indicators.
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
### Tasks
|
| 67 |
+
|
| 68 |
+
| Task | Domain | Difficulty | Success threshold |
|
| 69 |
+
|---|---|---|---|
|
| 70 |
+
| 1 | Software (Python docs) | Easy β 1β2 config faults | mean\_coverage β₯ 0.72 |
|
| 71 |
+
| 2 | Climate (IPCC reports) | Medium β compound faults | mean\_coverage β₯ 0.65 |
|
| 72 |
+
| 3 | Medical (MedRAG textbooks) | Hard β wrong embedding model + multi-hop | mean\_coverage β₯ 0.60 |
|
| 73 |
+
|
| 74 |
+
The faults injected are hidden in the observation. Use the metrics to infer them, then fix the config. Click **Get state** to reveal faults for debugging or learning.
|
| 75 |
+
|
| 76 |
+
---
|
| 77 |
+
|
| 78 |
+
### Example session (Task 1)
|
| 79 |
+
|
| 80 |
+
```
|
| 81 |
+
Reset
|
| 82 |
+
β mean_coverage: 0.21, n_empty_retrievals: 3
|
| 83 |
+
(threshold too high or top-k too small)
|
| 84 |
+
|
| 85 |
+
Step: adjust_threshold {"value": 0.25}
|
| 86 |
+
β mean_coverage: 0.54, n_empty_retrievals: 0
|
| 87 |
+
|
| 88 |
+
Step: adjust_top_k {"value": 20}
|
| 89 |
+
β mean_coverage: 0.76
|
| 90 |
+
|
| 91 |
+
Step: submit {}
|
| 92 |
+
β done: true, terminal_bonus applied
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
---
|
| 96 |
+
|
| 97 |
The environment simulates retrieval behavior with precomputed similarity matrices so each step is fast, while still exposing realistic debugging actions such as threshold tuning, top-k tuning, embedding model swaps, reranking toggles, and query rewrites.
|
| 98 |
|
| 99 |
## Current Status
|
server/rag_debug_env_environment.py
CHANGED
|
@@ -30,7 +30,7 @@ from uuid import uuid4
|
|
| 30 |
|
| 31 |
import numpy as np
|
| 32 |
from openenv.core.env_server.interfaces import Environment
|
| 33 |
-
from openenv.core.env_server.types import State
|
| 34 |
|
| 35 |
from server.constants import (
|
| 36 |
_TASK_DOMAIN,
|
|
@@ -264,6 +264,24 @@ class RAGDebugEnvironment(Environment):
|
|
| 264 |
def state(self) -> State:
|
| 265 |
return self._state
|
| 266 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 267 |
# ------------------------------------------------------------------
|
| 268 |
# Action routing
|
| 269 |
# ------------------------------------------------------------------
|
|
|
|
| 30 |
|
| 31 |
import numpy as np
|
| 32 |
from openenv.core.env_server.interfaces import Environment
|
| 33 |
+
from openenv.core.env_server.types import EnvironmentMetadata, State
|
| 34 |
|
| 35 |
from server.constants import (
|
| 36 |
_TASK_DOMAIN,
|
|
|
|
| 264 |
def state(self) -> State:
|
| 265 |
return self._state
|
| 266 |
|
| 267 |
+
def get_metadata(self) -> EnvironmentMetadata:
|
| 268 |
+
readme_path = Path(__file__).parent.parent / "README.md"
|
| 269 |
+
readme_content: Optional[str] = None
|
| 270 |
+
if readme_path.exists():
|
| 271 |
+
raw = readme_path.read_text(encoding="utf-8")
|
| 272 |
+
# Strip YAML frontmatter (--- ... ---) so the UI renders clean Markdown
|
| 273 |
+
if raw.startswith("---"):
|
| 274 |
+
end = raw.find("---", 3)
|
| 275 |
+
if end != -1:
|
| 276 |
+
raw = raw[end + 3:].lstrip("\n")
|
| 277 |
+
readme_content = raw
|
| 278 |
+
return EnvironmentMetadata(
|
| 279 |
+
name="RAGDebugEnv",
|
| 280 |
+
description="Debug broken RAG pipelines by tuning config and swapping embedding models.",
|
| 281 |
+
readme_content=readme_content,
|
| 282 |
+
version="1.0.0",
|
| 283 |
+
)
|
| 284 |
+
|
| 285 |
# ------------------------------------------------------------------
|
| 286 |
# Action routing
|
| 287 |
# ------------------------------------------------------------------
|