vankap-grover commited on
Commit
b401c21
Β·
verified Β·
1 Parent(s): b2de027

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +82 -0
  2. server/rag_debug_env_environment.py +19 -1
README.md CHANGED
@@ -12,6 +12,88 @@ base_path: /web
12
 
13
  RAGDebugEnv is an OpenEnv-compatible environment for training and evaluating agents that debug broken retrieval pipelines.
14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  The environment simulates retrieval behavior with precomputed similarity matrices so each step is fast, while still exposing realistic debugging actions such as threshold tuning, top-k tuning, embedding model swaps, reranking toggles, and query rewrites.
16
 
17
  ## Current Status
 
12
 
13
  RAGDebugEnv is an OpenEnv-compatible environment for training and evaluating agents that debug broken retrieval pipelines.
14
 
15
+ ---
16
+
17
+ ## Using the Playground
18
+
19
+ The playground lets you manually interact with the environment through the web UI β€” useful for understanding the task before writing an agent.
20
+
21
+ ### Workflow
22
+
23
+ 1. **Reset** β€” Click **Reset** to start a new episode. A task is randomly assigned (1, 2, or 3). The response shows the initial observation: pipeline config, per-query retrieval results, and quality metrics.
24
+ 2. **Step** β€” Fill in **Action Type** and **Params**, then click **Step** to apply an action to the pipeline. The response shows the updated observation and the reward signal.
25
+ 3. **Get state** β€” Click **Get state** at any time to inspect the full server-side state, including the injected faults (hidden from the agent during normal operation).
26
+ 4. **Repeat** β€” Keep stepping until `done: true` appears in the response, or until you are ready to submit.
27
+
28
+ ---
29
+
30
+ ### Action Reference
31
+
32
+ | Action Type | Params (JSON) | Notes |
33
+ |---|---|---|
34
+ | `adjust_chunk_size` | `{"value": 256}` | int, 64–2048 |
35
+ | `adjust_chunk_overlap` | `{"value": 32}` | int, 0–500; must be < chunk\_size |
36
+ | `adjust_threshold` | `{"value": 0.5}` | float, 0.0–1.0 |
37
+ | `adjust_top_k` | `{"value": 15}` | int, 1–50 |
38
+ | `swap_embedding_model` | `{"model": "medical"}` | `"general"` / `"medical"` / `"legal"` / `"code"` |
39
+ | `toggle_reranking` | `{"enabled": true}` | bool |
40
+ | `adjust_context_limit` | `{"value": 8192}` | int, 512–16384 |
41
+ | `rewrite_query` | `{"query_id": 0, "strategy": "expand"}` | strategy: `"expand"` / `"rephrase"` / `"decompose"` |
42
+ | `submit` | `{}` | Ends the episode. Returns a bonus if coverage threshold met. |
43
+
44
+ Enter **Action Type** as a plain string (e.g. `adjust_threshold`) and **Params** as a JSON object (e.g. `{"value": 0.45}`).
45
+
46
+ ---
47
+
48
+ ### Reading the Observation
49
+
50
+ After each step the Raw JSON response contains:
51
+
52
+ ```
53
+ pipeline_config β€” current knob values (what the agent changed)
54
+ query_results β€” per-query: retrieved chunks, coverage score, precision score
55
+ metrics β€” mean_coverage, mean_precision, n_empty_retrievals, n_context_overflows
56
+ corpus_stats β€” domain, n_chunks, n_queries, has_near_duplicates
57
+ steps_taken / max_steps β€” step budget (max 10)
58
+ task_id / task_description β€” which task is running
59
+ done β€” true when the episode has ended
60
+ ```
61
+
62
+ `mean_coverage` is the primary signal. A value below ~0.4 means something is broken. Diagnose from `n_empty_retrievals` and `n_context_overflows` as secondary indicators.
63
+
64
+ ---
65
+
66
+ ### Tasks
67
+
68
+ | Task | Domain | Difficulty | Success threshold |
69
+ |---|---|---|---|
70
+ | 1 | Software (Python docs) | Easy β€” 1–2 config faults | mean\_coverage β‰₯ 0.72 |
71
+ | 2 | Climate (IPCC reports) | Medium β€” compound faults | mean\_coverage β‰₯ 0.65 |
72
+ | 3 | Medical (MedRAG textbooks) | Hard β€” wrong embedding model + multi-hop | mean\_coverage β‰₯ 0.60 |
73
+
74
+ The faults injected are hidden in the observation. Use the metrics to infer them, then fix the config. Click **Get state** to reveal faults for debugging or learning.
75
+
76
+ ---
77
+
78
+ ### Example session (Task 1)
79
+
80
+ ```
81
+ Reset
82
+ β†’ mean_coverage: 0.21, n_empty_retrievals: 3
83
+ (threshold too high or top-k too small)
84
+
85
+ Step: adjust_threshold {"value": 0.25}
86
+ β†’ mean_coverage: 0.54, n_empty_retrievals: 0
87
+
88
+ Step: adjust_top_k {"value": 20}
89
+ β†’ mean_coverage: 0.76
90
+
91
+ Step: submit {}
92
+ β†’ done: true, terminal_bonus applied
93
+ ```
94
+
95
+ ---
96
+
97
  The environment simulates retrieval behavior with precomputed similarity matrices so each step is fast, while still exposing realistic debugging actions such as threshold tuning, top-k tuning, embedding model swaps, reranking toggles, and query rewrites.
98
 
99
  ## Current Status
server/rag_debug_env_environment.py CHANGED
@@ -30,7 +30,7 @@ from uuid import uuid4
30
 
31
  import numpy as np
32
  from openenv.core.env_server.interfaces import Environment
33
- from openenv.core.env_server.types import State
34
 
35
  from server.constants import (
36
  _TASK_DOMAIN,
@@ -264,6 +264,24 @@ class RAGDebugEnvironment(Environment):
264
  def state(self) -> State:
265
  return self._state
266
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
267
  # ------------------------------------------------------------------
268
  # Action routing
269
  # ------------------------------------------------------------------
 
30
 
31
  import numpy as np
32
  from openenv.core.env_server.interfaces import Environment
33
+ from openenv.core.env_server.types import EnvironmentMetadata, State
34
 
35
  from server.constants import (
36
  _TASK_DOMAIN,
 
264
  def state(self) -> State:
265
  return self._state
266
 
267
+ def get_metadata(self) -> EnvironmentMetadata:
268
+ readme_path = Path(__file__).parent.parent / "README.md"
269
+ readme_content: Optional[str] = None
270
+ if readme_path.exists():
271
+ raw = readme_path.read_text(encoding="utf-8")
272
+ # Strip YAML frontmatter (--- ... ---) so the UI renders clean Markdown
273
+ if raw.startswith("---"):
274
+ end = raw.find("---", 3)
275
+ if end != -1:
276
+ raw = raw[end + 3:].lstrip("\n")
277
+ readme_content = raw
278
+ return EnvironmentMetadata(
279
+ name="RAGDebugEnv",
280
+ description="Debug broken RAG pipelines by tuning config and swapping embedding models.",
281
+ readme_content=readme_content,
282
+ version="1.0.0",
283
+ )
284
+
285
  # ------------------------------------------------------------------
286
  # Action routing
287
  # ------------------------------------------------------------------