Spaces:
Sleeping
Sleeping
File size: 40,679 Bytes
7e4e3de f23deb1 7e4e3de f23deb1 ac224ce f23deb1 7e4e3de f23deb1 ac224ce f23deb1 b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce b401c21 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 | ---
title: RAGDebugEnv
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
tags:
- openenv
base_path: /web
---
# RAGDebugEnv
> **An OpenEnv-compatible reinforcement learning environment for training AI agents to diagnose and repair broken Retrieval-Augmented Generation pipelines.**
---
## Why This Environment Matters
Retrieval-Augmented Generation has become the architectural backbone of production AI β enterprise search, legal document analysis, clinical decision support, customer support automation, and code assistance all depend on it. But RAG pipelines break silently and often. Retrieval quality degrades without throwing exceptions. Engineers spend hours with no obvious place to look: Is the chunk size wrong? Is the embedding model mismatched to the domain? Is the similarity threshold too aggressive? Is the context window overflowing?
This is not a solved problem. There is no benchmark, no training signal, and no agent that can systematically diagnose and fix a broken RAG pipeline.
**RAGDebugEnv fills that gap.**
It simulates the exact failure modes that engineers encounter in production β wrong embedding models, misconfigured thresholds, duplicate flooding, context overflows β and wraps them in a structured RL environment where agents receive dense, per-step rewards for systematically diagnosing and fixing the pipeline. An agent trained here learns what a senior ML engineer learns over years: how retrieval quality signals map to root causes, and what interventions fix them.
The environment is designed for the research community as a reusable benchmark with:
- **Three task difficulties** spanning software, climate, and medical domains
- **Nine distinct fault types** with mathematically grounded injection mechanisms
- **~1ms per step** via precomputed similarity matrices β thousands of training episodes per minute
- **Full OpenEnv compatibility** for drop-in use with any compliant agent framework
---
## Architecture Overview
```
Agent (inference.py)
β
βββΊ client.py (OpenEnv EnvClient, HTTP/WebSocket)
β
βββΊ server/app.py (FastAPI + OpenEnv HTTP server)
β
βββΊ RAGDebugEnvironment (server/rag_debug_env_environment.py)
β
βββ Precomputed S_true matrices (corpora/<domain>/S_true_*.npy)
β shape: (n_queries, n_chunks), dtype float32
β one matrix per embedding model Γ domain
β
βββ Fault injection math (server/fault_math.py)
β apply_faults(S_true, config, faults) β S_faulted
β
βββ Ground truth R* sets (corpora/<domain>/ground_truth.json)
{query_id: [chunk_id, ...]} built by cross-encoder in Stage 6
```
**Key design principle:** The environment never calls a live vector database. Every `step()` executes in ~1ms because fault injection is pure matrix arithmetic. This is what makes RL feasible: an agent can run tens of thousands of training episodes without waiting for real embedding inference.
---
## How the Corpora Were Built: The 6-Stage Pipeline
The environment's credibility rests on real documents, real embeddings, and real relevance labels. Building that took a six-stage pipeline.
### Stage 1 β Document Loading (`corpora/stages/s1_load.py`)
Raw documents are sourced from authoritative, domain-appropriate sources for each of the three task domains:
**Software domain (Task 1):**
- Python 3 official documentation (text archive, `tutorial/`, `library/`, `reference/`, `howto/` sections)
- HuggingFace documentation dataset (`m-ric/huggingface_doc`)
- Target: 50 documents, 300β5000 words each
**Climate domain (Task 2):**
- Wikipedia articles on climate topics via the Wikipedia REST API β 55 carefully chosen articles spanning core climate science (greenhouse gases, ocean acidification, permafrost), policy (Paris Agreement, carbon tax, emissions trading), energy (solar, wind, nuclear, carbon capture), and ecosystem impacts
- Wikipedia was chosen over alternatives like `climate_fever` (evidence passages of 1β3 sentences, too short) and arXiv papers (deprecated loading scripts)
**Medical domain (Task 3):**
- MedRAG/textbooks dataset β actual medical textbook chapters from Harrison's Principles, Robbins Pathology, Pathoma, Gray's Anatomy, Pharmacology by Katzung, and others. Ten consecutive passages (~1,300 words total) are aggregated per document to produce expository prose rather than isolated exam vignettes
- Wikipedia medical articles (50 articles) as a supplement β diseases, treatments, anatomy, pharmacology
- Medical textbooks were specifically chosen to expose the `WRONG_EMBEDDING_MODEL` fault: dense clinical vocabulary (receptor subtypes, metabolic pathways, cytokines) degrades severely under general-purpose embedding models, making the fault visually obvious from score distributions
Documents pass quality filters: minimum 300 words, maximum 5,000 words, at least 50% alphabetic characters.
### Stage 2 β Token-Level Chunking (`corpora/stages/s2_chunk.py`)
Documents are split using tiktoken's `cl100k_base` encoding β the same tokenizer used by most production embedding models. Token-level chunking (rather than word or character) ensures chunks never overflow a model's context window by accident.
```
chunk_size = 512 tokens (canonical β S_true is computed against this)
chunk_overlap = 50 tokens (sliding window stride = 462 tokens)
min_chunk = 100 tokens (tail chunks shorter than this are dropped)
```
This produces the corpora used at runtime:
- **Software:** 271 chunks
- **Climate:** 612 chunks
- **Medical:** 359 chunks
The canonical chunk_size and overlap are fixed at corpus-build time. The environment's fault injection then *simulates* what happens when an agent changes these parameters β rather than re-chunking actual documents (which would take minutes per step).
### Stage 3 β Synthetic Query Generation (`corpora/stages/s3_queries.py`)
Queries are generated by GPT-4o-mini from seed chunks, then filtered by a cross-encoder. The process:
1. **Select 25 seed chunks** per domain β preferring chunks from diverse source documents, ending with a sentence boundary, and with high alphabetic density
2. **Generate 2 queries per chunk** via GPT-4o-mini:
- `DIRECT` β a specific question the chunk alone completely answers (the answer is explicitly in the text)
- `PARTIAL` β a question where the chunk provides essential but incomplete context
3. **Filter with cross-encoder** (`cross-encoder/ms-marco-MiniLM-L-6-v2`) β only queries where the seed chunk scores β₯ 0.50 relevance are kept. This ensures every query genuinely retrieves something meaningful from its seed chunk, preventing spurious ground truth labels.
Queries are designed to sound natural ("How does insulin resistance develop?", not "According to the passage...").
### Stage 4 β Multi-Hop Query Construction (`corpora/stages/s4_multihop.py`)
Multi-hop queries are built for the medical domain only (Task 3). These are questions that *cannot* be answered by either chunk alone but *are* specifically answered when both are read together β exactly what tests cross-chunk reasoning.
The algorithm:
1. **Embed all medical chunks** with the medical embedding model in-memory
2. **Compute chunk-to-chunk cosine similarity** across the full chunk matrix
3. **Find candidate pairs** satisfying: similarity in [0.85, 0.97] (similar enough to be clinically related, but not so similar that one chunk already answers the other), from different source books, with a minimum index gap of 15, restricted to mechanism-dense books (Pathoma, Pharmacology/Katzung, Immunology/Janeway, Biochemistry/Lippincott, Physiology/Levy, Pathology/Robbins, Cell Biology/Alberts), sharing at least 1 mechanism term
4. **Prompt GPT-4o-mini** to generate a bridging question for each candidate pair, with explicit instructions that the question must be unanswerable from either chunk alone
5. **Validate with cross-encoder** β both chunks must score above threshold for the generated question
6. **Target: 5β8 multi-hop queries** per domain
Multi-hop queries use 30 specific mechanism terms as eligibility filters (receptor, cytokine, kinase, apoptosis, transporter, etc.) to ensure the selected pairs reflect genuine clinical relationships rather than surface vocabulary overlap.
### Stage 5 β Embedding & Similarity Matrix Computation (`corpora/stages/s5_embed.py`)
This is the stage that makes fast RL possible. For each domain, all chunks and queries are embedded with all four models, and cosine similarity is computed between every (query, chunk) pair:
| Model Key | HuggingFace Model | Domain Fit |
|-----------|-------------------|------------|
| `general` | `sentence-transformers/all-MiniLM-L6-v2` | Everyday text |
| `medical` | `NeuML/pubmedbert-base-embeddings` | Biomedical retrieval |
| `legal` | `nlpaueb/legal-bert-base-uncased` | Legal corpora |
| `code` | `sentence-transformers/multi-qa-mpnet-base-dot-v1` | Retrieval-tuned |
Each domain gets four `.npy` files:
```
corpora/<domain>/S_true_general.npy shape: (n_queries, n_chunks), float32
corpora/<domain>/S_true_medical.npy
corpora/<domain>/S_true_legal.npy
corpora/<domain>/S_true_code.npy
```
These are the canonical ground-truth similarity matrices. Every episode loads a subset of rows (the 5 sampled queries) from the appropriate matrix. Fault injection then transforms this subset mathematically β no re-embedding ever happens at runtime.
### Stage 6 β Cross-Encoder Ground Truth Labeling (`corpora/stages/s6_grade.py`)
The ground-truth relevance set R* for each query is built using `BAAI/bge-reranker-v2-m3`, a high-quality cross-encoder reranker that scores every (query, chunk) pair. A chunk is considered relevant if its cross-encoder score exceeds 0.70.
Seed chunks are always included in R* regardless of cross-encoder score (they were explicitly chosen as the answer source for each query). For multi-hop queries, both seed chunks are always included.
After labeling, calibration stats are printed:
- Mean R* size 1β4 for direct queries (target)
- Mean R* size 2β5 for multi-hop queries (target)
- Thresholds can be adjusted and the stage re-run if R* sets are too sparse or too large
The output, `ground_truth.json`, is the oracle against which every agent action is scored.
---
## How Fast Real-Time RAG Simulation Works
The core innovation of this environment is the **precomputed similarity matrix + fault injection** approach. Here is the full reasoning chain:
**The problem with simulating real RAG for RL:**
A real RAG pipeline change β say, increasing chunk size β requires: re-chunking documents, re-embedding all chunks, rebuilding the vector index, running retrieval queries. That takes 30β120 seconds per configuration change. An RL agent that trains for 100,000 steps would need months of wall-clock time.
**The insight:**
The similarity score between a query and a chunk is a continuous function of the embedding model and chunking parameters. Rather than recomputing it exactly for every configuration, we can:
1. Precompute exact similarity scores for one canonical configuration (S_true)
2. Simulate the *effect* of configuration changes as mathematical transformations on those scores
This is `S_faulted = apply_faults(S_true, config, active_faults)`.
Each step executes as:
1. Agent takes an action (e.g., `adjust_threshold(value=0.15)`)
2. `_apply_action()` updates `PipelineConfig`
3. If the action affects similarity scores (chunk_size, chunk_overlap, context_limit, reranking, model swap), `_recompute_S_faulted()` calls `apply_faults()` β pure numpy, ~0.1ms
4. `_simulate_retrieval()` applies the threshold filter and top-k selection to S_faulted β another ~0.1ms
5. Metrics are computed, reward is calculated, observation is returned
Total per-step time: **~1ms**. This enables thousands of training episodes per minute.
The noise arrays (`noise[FaultType.CHUNK_TOO_SMALL]`, etc.) are **pre-generated at `reset()` time** and reused across all recomputations within the same episode. This guarantees that the stochastic elements of fault math are deterministic within an episode β the agent sees a consistent world as it tunes the pipeline β while being seeded differently per episode for training diversity.
---
## The Three Task Types
| | Task 1 | Task 2 | Task 3 |
|---|---|---|---|
| **Domain** | Software (Python docs + HF docs) | Climate (Wikipedia) | Medical (textbooks + Wikipedia) |
| **Difficulty** | Easy | Medium | Hard |
| **Fault Complexity** | 1β2 config faults | Compound config faults | Wrong embedding model + config faults |
| **Episode Queries** | 5 (all direct) | 5 (all direct) | 5 (3 direct + 2 multi-hop) |
| **Success Condition** | `task_score β₯ 0.75` | `task_score β₯ 0.75` | `task_score β₯ 0.70` AND `multi_hop_coverage > 0.60` |
| **Possible Faults** | CHUNK_TOO_LARGE + NO_RERANKING, THRESHOLD_TOO_HIGH, TOP_K_TOO_SMALL, CHUNK_TOO_LARGE | THRESHOLD_TOO_LOW + DUPLICATE_FLOODING, TOP_K_TOO_SMALL + CONTEXT_OVERFLOW, DUPLICATE_FLOODING, CONTEXT_OVERFLOW | WRONG_EMBEDDING_MODEL + CHUNK_TOO_LARGE + THRESHOLD_TOO_HIGH (always) |
**Task Score Formula:**
Tasks 1 & 2 reward efficiency (completing the fix in fewer steps):
```
task_score = 0.60 Γ mean_coverage + 0.25 Γ mean_precision + 0.15 Γ efficiency
efficiency = 1.0 β (steps_taken / max_steps)
```
Task 3 drops the efficiency bonus and adds multi-hop coverage as a first-class signal:
```
task_score = 0.55 Γ mean_coverage + 0.25 Γ mean_precision + 0.20 Γ multi_hop_coverage
```
Task 3 explicitly does not reward efficiency because finding the wrong-embedding-model fault often requires multiple diagnostic steps (checking if the score distribution is compressed, trying a model swap and observing the change).
**Baseline Scores:**
| Task | Random Actions | Zero-Shot Heuristic | Zero-Shot LLM (Qwen2.5-72B) | Target RL Score |
|------|---------------|---------------------|------------------------------|-----------------|
| 1 β Software | ~0.15 | ~0.50 | ~0.72 | **> 0.85** |
| 2 β Climate | ~0.10 | ~0.45 | ~0.65 | **> 0.80** |
| 3 β Medical | ~0.05 | ~0.35 | ~0.55 | **> 0.75** |
Task 3's low zero-shot score reflects the difficulty of identifying the WRONG_EMBEDDING_MODEL fault without systematic experimentation β exactly the kind of structured exploration that RL can learn to do reliably.
---
## The Nine Fault Types
Every fault is a mathematically grounded transformation of the S_true matrix. Faults are injected at `reset()` time and remain active throughout the episode β the agent must diagnose and compensate for them through configuration changes.
### 1. `CHUNK_TOO_LARGE` β Score Smearing via Box Filter
**What it simulates:** When chunk_size is too large, a single chunk spans multiple concepts. The embedding averages over all of them and becomes a blurred representation of the document region rather than a focused semantic unit. Retrieval scores get smeared β a chunk that is highly relevant gets diluted by the irrelevant content around it.
**Injection math:**
```python
filter_size = max(1, round(4 Γ config_chunk_size / 512))
S = uniform_filter1d(S, size=filter_size, axis=1, mode="nearest")
```
A 1D box filter along the chunk axis smears scores toward neighbors. Larger chunk_size β larger filter β more smearing. The agent can undo this by reducing chunk_size.
**Fix:** `adjust_chunk_size` to a smaller value (128β256 range typically works)
---
### 2. `CHUNK_TOO_SMALL` β Gaussian Noise from Unstable Embeddings
**What it simulates:** When chunk_size is very small (64β128 tokens), chunks often end mid-sentence. Embedding models trained on full sentences produce unreliable representations for sentence fragments β high variance in the embedding space manifests as score noise.
**Injection math:**
```python
overlap_reduction = min(0.5, config_chunk_overlap / 1000.0)
sigma = 0.15 Γ min(1.0, 512.0 / max(config_chunk_size, 64)) Γ (1.0 β overlap_reduction)
S = S + sigma Γ noise[CHUNK_TOO_SMALL]
```
Additive Gaussian noise with magnitude inversely proportional to chunk_size. Crucially, higher chunk_overlap *reduces* the noise β more overlap means more context at chunk boundaries, stabilizing the embeddings. The overlap's effect is computed dynamically every time `_recompute_S_faulted()` is called (this was a deliberate design decision requiring careful implementation).
**Fix:** `adjust_chunk_size` upward, and `adjust_chunk_overlap` to reduce boundary instability
---
### 3. `THRESHOLD_TOO_HIGH` β Score Deflation
**What it simulates:** When the corpus uses a domain-specific vocabulary that the embedding model handles poorly (e.g., medical terminology with a general model), all cosine similarity scores are systematically compressed into a low range. Relevant chunks that would normally score 0.7β0.8 now score 0.35β0.45, and any reasonable similarity threshold filters them all out.
**Injection math:**
```python
S = S Γ 0.55
```
Multiplicative deflation of all scores by 55%. After deflation, no chunk exceeds ~0.55 similarity, so a default threshold of 0.3 starts cutting into relevant chunks. The agent observes empty or near-empty retrievals and must lower the threshold to compensate.
**Fix:** `adjust_threshold` to a much lower value (0.05β0.15 range)
---
### 4. `THRESHOLD_TOO_LOW` β Noise Cluttering Retrieval
**What it simulates:** When the similarity threshold is too permissive, irrelevant chunks score high by chance and flood the retrieved set. This tanks precision without necessarily affecting coverage.
**Injection math:**
```python
S = S + 0.10 Γ noise[THRESHOLD_TOO_LOW]
```
Additive Gaussian noise lifts irrelevant chunks into the retrieval band. The agent observes high recall but poor precision and many retrieved chunks, and must raise the threshold.
**Fix:** `adjust_threshold` upward, `toggle_reranking` to filter noise
---
### 5. `TOP_K_TOO_SMALL` β Score Compression
**What it simulates:** Some retrieval backends only return a small number of candidates (top-k=2 or 3), and when the relevant chunks are not all in that tiny set, coverage collapses. The score compression simulates the effect of having so few candidates that the ranking becomes unreliable β relative differences between scores are small, making threshold-based filtering less effective.
**Injection math:**
```python
compress = 0.24 if not config_use_reranking else 0.65
S = 0.5 + (S β 0.5) Γ compress
```
Score range is compressed toward 0.5. Without reranking, compression is severe (24% of original spread); with reranking, a cross-encoder partially restores the true ranking signal (65% of original spread). The agent starts with `top_k=2β3` when this fault is active, so relevant chunks are simply missed.
**Fix:** `adjust_top_k` to a larger value (15β25), `toggle_reranking`
---
### 6. `DUPLICATE_FLOODING` β Boosted Duplicate Chunks
**What it simulates:** Near-duplicate documents in the corpus (e.g., the same article published in multiple places) crowd the top-k retrieved set. Relevant chunks are displaced by high-scoring duplicates that provide no additional information.
**Injection math:**
```python
boost = 0.08 if config_use_reranking else 0.20
S[:, dupe_ids] = minimum(S[:, dupe_ids] + boost, 1.0)
```
A random 14% of chunks are designated as duplicates (selected at `reset()` time). Their scores are boosted by 0.20 (or 0.08 with reranking active). The agent observes high top-k utilization but low precision β many retrieved chunks are near-duplicates of each other.
**Fix:** `toggle_reranking` (cross-encoder sharply reduces the boost effect), `adjust_top_k`
---
### 7. `CONTEXT_OVERFLOW` β Tail Column Zeroing
**What it simulates:** When the context window limit is too small, chunks beyond the cutoff cannot be included in the LLM context window even if retrieved. The environment simulates this by zeroing out similarity scores for chunks whose index falls beyond the context cutoff.
**Injection math:**
```python
cutoff = max(1, int(n_chunks Γ config_context_limit / 16384))
S[:, cutoff:] = 0.0
```
Chunks beyond the cutoff position are effectively invisible to retrieval β their scores are zeroed. The cutoff scales proportionally to `context_window_limit`, so increasing the context limit moves the cutoff further out, restoring access to more chunks.
**Fix:** `adjust_context_limit` to a larger value (8192β16384 range)
---
### 8. `NO_RERANKING` β Additive Noise Without Cross-Encoder
**What it simulates:** Without a cross-encoder reranker, the retrieval relies solely on approximate vector similarity, which is inherently noisier than the full query-document attention a cross-encoder uses. Minor score perturbations can cause relevant chunks to fall just below the threshold.
**Injection math:**
```python
if not config_use_reranking:
S = S + 0.10 Γ noise[NO_RERANKING]
```
Mild additive Gaussian noise is applied only when reranking is disabled. Enabling reranking skips this fault entirely and additionally blends faulted scores back toward S_true (a 35% blending toward the pre-fault matrix, simulating the cross-encoder's ability to recover true relevance).
**Fix:** `toggle_reranking(enabled=true)`
---
### 9. `WRONG_EMBEDDING_MODEL` β Fundamentally Wrong Score Distribution
**What it simulates:** Using a general-purpose embedding model on a specialized domain (e.g., `all-MiniLM-L6-v2` on medical textbook content) produces systematically poor retrieval. The model's vocabulary and training distribution do not align with the domain β clinical terms like "receptor", "cytokine", and "metabolic pathway" are underrepresented in its training data, resulting in compressed, unreliable similarity scores.
**Injection mechanism:** Unlike other faults, WRONG_EMBEDDING_MODEL is **implicit** β Task 3 starts with `embedding_model=LEGAL` (the LEGAL model for medical data, the most mismatched combination). The `_recompute_S_faulted()` method selects which S_true matrix to load:
```python
if WRONG_EMBEDDING_MODEL in active_faults:
model_key = _MODEL_FILE[self._active_model] # uses whatever model is currently active
else:
model_key = "general" # locked to general on tasks 1 & 2
```
The LEGAL model's S_true matrix on medical text has a very different score distribution (compressed range, lower mean relevance for truly relevant chunks) compared to the MEDICAL model. The agent observes this as compressed retrieval scores with low variance (`std < 0.05`) across per-query results β a diagnostic hint the system explicitly surfaces.
**Fix:** `swap_embedding_model(model="medical")` or `swap_embedding_model(model="general")`
---
### Cross-Encoder Reranking Simulation
When `use_reranking=True`, `apply_faults()` blends the faulted matrix back toward the pre-fault scores:
```python
if config_use_reranking:
rerank_alpha = 0.35
S = (1.0 β rerank_alpha) Γ S + rerank_alpha Γ S_clean
```
This is a principled simulation of a cross-encoder's effect: it operates on the full query-document pair (seeing both together rather than separately encoded), which partially undoes noise-based corruption and restores compressed score spreads.
---
## What the Agent Observes
Every `step()` returns a `RAGDebugObservation` with:
### `pipeline_config` β Current Configuration
The full set of tunable parameters:
```
chunk_size int 64β2048 current: 512
chunk_overlap int 0β500 current: 50
similarity_threshold float 0.0β1.0 current: 0.30
top_k int 1β50 current: 10
embedding_model enum general|medical|legal|code
use_reranking bool current: false
context_window_limit int 512β16384 current: 4096
```
### `query_results` β Per-Query Retrieval Results
For each of the 5 episode queries:
```
query_id int
query_text str
retrieved_chunk_ids List[int]
retrieval_scores List[float]
n_retrieved int
coverage_score float |R_agent β© R*| / |R*|
precision_score float |R_agent β© R*| / |R_agent|
is_multi_hop bool
```
### `metrics` β Aggregate Quality
```
mean_coverage float mean of per-query coverage_score
mean_precision float mean of per-query precision_score
mean_recall float numerically equal to mean_coverage (tracked separately for clarity)
n_empty_retrievals int queries where n_retrieved == 0
n_context_overflows int queries where token sum exceeds context_window_limit
multi_hop_coverage float? mean coverage on multi-hop queries only (None for Tasks 1 & 2)
```
### `diagnostic_hints` β Context-Aware Hints
The environment generates up to 3 natural-language hints based on the current metric pattern:
- "N queries have empty retrievals β lower threshold or increase top_k"
- "Score variance is low (std < 0.05) β possible wrong embedding model"
- "Context overflow detected β increase context_window_limit"
- "Coverage low but precision decent β top_k may be too small"
### `reward_components` β Named Reward Breakdown
The full component decomposition of the last step's reward β useful for debugging agent behavior and understanding which aspects of the pipeline the reward function is tracking.
### `last_action_error` β Invalid Action Feedback
If the agent attempted an invalid configuration (e.g., `chunk_overlap >= chunk_size`, out-of-range values, unknown embedding model name), this field contains the validation error message. The agent is expected to learn to avoid such errors.
### Intentional omissions
The `injected_faults` list is **never exposed** in the observation. The agent must infer the fault type purely from the metric signatures β that inference IS the task.
---
## The Agent's Action Space
| Action | Parameters | Valid Range | Effect |
|--------|-----------|-------------|--------|
| `adjust_chunk_size` | `{"value": int}` | 64β2048 | Changes chunk size; modulates CHUNK_TOO_LARGE smearing severity |
| `adjust_chunk_overlap` | `{"value": int}` | 0β500 | Changes overlap; reduces CHUNK_TOO_SMALL noise at boundaries |
| `adjust_threshold` | `{"value": float}` | 0.0β1.0 | Threshold filter for retrieved chunks |
| `adjust_top_k` | `{"value": int}` | 1β50 | Number of candidates to retrieve per query |
| `swap_embedding_model` | `{"model": str}` | general/medical/legal/code | Switches the active embedding model |
| `toggle_reranking` | `{"enabled": bool}` | β | Enables/disables cross-encoder reranking simulation |
| `adjust_context_limit` | `{"value": int}` | 512β16384 | Shifts the context overflow cutoff |
| `rewrite_query` | `{"query_id": int, "strategy": str}` | strategy: "rephrase" | Boosts a specific query's scores toward R* by +0.20 |
| `submit` | `{}` | β | Ends the episode and triggers grading |
**Important constraint:** `chunk_overlap` must be strictly less than `chunk_size`. The environment validates this via Pydantic's model validator and returns a `last_action_error` if violated β the config is not updated in that case.
**REWRITE_QUERY mechanics:** Internally, the environment adds a persistent +0.20 boost overlay to the similarity scores of R* chunks for the specified query. This simulates query expansion or reformulation improving recall for a specific question.
---
## How the Reward Function Works
All rewards are bounded to **[0.0, 1.0]**. The reward design deliberately avoids sparse terminal-only rewards β every step provides a learning signal tied to the current state of the pipeline.
### Non-Terminal Step Reward Components
**`progress_reward` β Absolute Quality Level Signal (range: [0.10, 0.65])**
```python
quality_target = 0.75 # (0.70 for task 3)
current_quality = quality_score(new_metrics)
progress = min(1.0, current_quality / quality_target)
progress_reward = 0.10 + 0.55 Γ progress
```
This is the primary signal. Even a terrible state (quality β 0) receives 0.10 rather than 0.00 β ensuring a gradient everywhere. At the success threshold, progress_reward reaches 0.65. This ensures the full reward range is utilized across the episode rather than being concentrated in the last few steps.
**`delta_bonus` β Direction Signal (range: [-0.15, +0.15])**
```python
q_delta = current_quality β prev_quality
delta_bonus = clip(q_delta Γ 2.0, β0.15, +0.15)
```
Distinguishes an improving step from a no-op at the same quality level. A large positive improvement gives +0.15; a large regression gives -0.15. The Γ2.0 amplification and Β±0.15 cap ensure individual steps cannot dominate the reward signal.
**`empty_retrieval_signal` β Bidirectional Empty Retrieval Feedback (range: [-0.06, +0.06])**
```python
empty_change = prev.n_empty_retrievals β new.n_empty_retrievals
empty_retrieval_signal = clip(empty_change / n_queries, β1.0, +1.0) Γ 0.06
```
Positive when empty retrievals decrease (queries are now returning results). Negative when empty retrievals increase. Normalized by the total number of queries.
**`overflow_signal` β Bidirectional Overflow Feedback (range: [-0.04, +0.04])**
Identical structure to `empty_retrieval_signal`, tracking context overflows instead.
**`step_cost` β Efficiency Pressure (fixed: -0.01)**
A small fixed cost per step that encourages the agent to solve tasks efficiently rather than taking unnecessary actions.
**`redundancy_penalty` (β0.04 if triggered)**
Applied when the agent takes the same action type consecutively. Discourages uninformative repeat actions like adjusting threshold twice in a row.
**`invalid_action_penalty` (β0.05 if triggered)**
Applied when an action violates configuration constraints (out of range, chunk_overlap β₯ chunk_size, etc.).
### Reward Combination
```python
raw = sum(all_components.values())
value = clip(raw, 0.0, 1.0)
```
Typical step reward ranges:
- Terrible state, no improvement: **β 0.09**
- Mid quality, no change: **β 0.42**
- At success threshold: **β 0.64**
- Large improvement step: **up to 0.89**
- Large regression with penalties: **clipped to 0.00**
### Terminal Reward (SUBMIT)
When the agent calls `submit`, the environment computes the final task score and returns a terminal reward in its own zone, clearly separable from non-terminal rewards:
```python
if check_success(metrics, task_score):
terminal_value = clip(0.7 + 0.3 Γ task_score, 0.7, 1.0) # SUCCESS: [0.7, 1.0]
else:
terminal_value = clip(0.2 Γ task_score, 0.0, 0.2) # FAILURE: [0.0, 0.2]
```
Success and failure zones are disjoint β an agent cannot confuse a strong terminal failure with a weak terminal success. Submitting immediately on an unimproved episode typically yields a terminal reward of ~0.04β0.08.
---
## How `reset()` Ensures a Clean Episode
`reset(seed, task_id)` is a comprehensive initialization that makes every episode reproducible and independent:
```
1. Reset bookkeeping
_done = False, _prev_action_type = None, _last_action_error = None
_state = State(episode_id=new_uuid, step_count=0)
2. Seed the RNG
rng = np.random.default_rng(seed)
All randomness in the episode flows from this single seed.
3. Load domain corpus
chunks, queries, ground_truth, corpus_stats, s_true matrices
4. Sample 5 episode queries
Task 3: 3 direct + 2 multi-hop
Tasks 1 & 2: 5 direct queries
5. Slice S_true to episode query rows
_s_true_episode[model_name] = s_true_full[ep_query_row_indices, :]
6. Pre-generate noise arrays (unit normal, deterministic from seed)
_noise[CHUNK_TOO_SMALL] shape: (n_queries, n_chunks)
_noise[THRESHOLD_TOO_LOW] shape: (n_queries, n_chunks)
_noise[NO_RERANKING] shape: (n_queries, n_chunks)
_dupe_ids = random 14% of chunk indices (for DUPLICATE_FLOODING)
7. Initialize config
PipelineConfig() with defaults
Task 3: set embedding_model=LEGAL (the wrong model for medical data)
Randomize top_k (5β8) and threshold (0.34β0.48) from seed
8. Fault-specific config nudges
TOP_K_TOO_SMALL active β top_k = rng.integers(2, 4)
DUPLICATE_FLOODING active β top_k = rng.integers(4, 8)
9. Sample and inject faults
Task 1: one of 4 possible fault configurations (randomly selected)
Task 2: one of 4 possible compound fault configurations
Task 3: always [WRONG_EMBEDDING_MODEL, CHUNK_TOO_LARGE, THRESHOLD_TOO_HIGH]
10. Calibrate initial difficulty
Compute initial metrics; if coverage already exceeds task cap,
nudge threshold upward and reduce top_k to ensure the episode
starts in an improvable state.
11. Initial S_faulted computation
_recompute_S_faulted() β _S_faulted = apply_faults(S_true, config, faults)
12. Return initial observation
Computes query results, metrics, diagnostic hints, and returns
the full RAGDebugObservation for the agent's first step.
```
The noise pre-generation in step 6 is particularly important: stochastic fault components are stable within an episode (the agent sees a consistent world) but vary across episodes (the agent must generalize). The `seed` parameter makes any specific episode fully reproducible for debugging.
---
## GRPO Training Scaffold (`outputs/train_grpo.py`)
The repository includes a working Group Relative Policy Optimization scaffold for training agents using the OpenAI-compatible API as the policy.
GRPO is well-suited to this environment because:
- Episodes provide a dense scalar reward at every step (not just terminal)
- The environment runs at ~1ms/step, enabling many rollouts per batch
- GRPO requires no value network β it normalizes rewards within the group
The core normalization:
```python
def grpo_normalize(rollouts):
rewards = [r.total_reward for r in rollouts]
mean_r = sum(rewards) / len(rewards)
variance = sum((r - mean_r)**2 for r in rewards) / len(rewards)
std_r = variance ** 0.5
for rollout in rollouts:
rollout.normalized_reward = (rollout.total_reward - mean_r) / (std_r + 1e-8)
```
Rollouts better than the group average get positive normalized rewards; those worse get negative. This relative baseline is what makes GRPO work without a learned value function.
Training data is saved to `outputs/grpo_data.jsonl` in a format directly compatible with TRL's `GRPOTrainer` for gradient-based fine-tuning of smaller models (e.g., Qwen2.5-1.5B).
```bash
python outputs/train_grpo.py --task 1 --batches 3 --group-size 4
python outputs/train_grpo.py --task all --batches 2 --group-size 4
```
---
## Setup & Running
### Environment Variables
```bash
export API_BASE_URL="https://router.huggingface.co/v1" # LLM endpoint
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" # Model identifier
export HF_TOKEN="your-hugging-face-token" # API key
```
### Install & Run
```bash
# Install dependencies
uv sync
# Start the environment server
uvicorn server.app:app --host 0.0.0.0 --port 7860
# Run inference (in a separate terminal)
python inference.py
# Run GRPO training scaffold
python outputs/train_grpo.py --task 1 --batches 3 --group-size 4
# Run the test suite
uv run python -m pytest tests/ -v
```
### Docker
```bash
docker build -t rag_debug_env .
docker run -p 7860:7860 -e HF_TOKEN=$HF_TOKEN rag_debug_env
```
### Validate OpenEnv Spec
```bash
openenv validate
```
### Rebuild the Corpus (optional)
The prebuilt corpus artifacts are included in the repository. To rebuild from scratch:
```bash
export OPENAI_API_KEY="your-openai-key" # required for Stage 3 (query gen) and Stage 4 (multi-hop)
python -m corpora.build_corpus --domain all
```
This runs all 6 stages for all domains sequentially. Individual domains can be rebuilt with `--domain software`, `--domain climate`, or `--domain medical`. Each stage caches its output, so stages can be re-run incrementally with `--force-reload`.
---
## Repository Layout
```
rag_debug_env/
β
βββ inference.py # Competition inference script ([START]/[STEP]/[END] logging)
βββ client.py # OpenEnv client (WebSocket-based)
βββ models.py # Typed Pydantic models (Action, Observation, Config, etc.)
βββ openenv.yaml # OpenEnv manifest
βββ pyproject.toml # Package metadata and uv dependencies
βββ Dockerfile # Multi-stage Docker build for HF Spaces
β
βββ server/
β βββ app.py # FastAPI application (create_app with OpenEnv)
β βββ rag_debug_env_environment.py # Core environment: reset(), step(), reward, fault routing
β βββ constants.py # Task definitions, fault sets, thresholds
β βββ fault_math.py # Pure fault injection math: apply_faults(S_true) β S_faulted
β βββ corpus.py # Corpus loader with synthetic fallback
β
βββ corpora/
β βββ build_corpus.py # Orchestrates all 6 stages
β βββ stages/
β β βββ s1_load.py # Stage 1: Load raw documents
β β βββ s2_chunk.py # Stage 2: Token-level chunking
β β βββ s3_queries.py # Stage 3: GPT-4o-mini query generation + cross-encoder filter
β β βββ s4_multihop.py # Stage 4: Multi-hop query construction (medical only)
β β βββ s5_embed.py # Stage 5: Embed all 4 models, save S_true matrices
β β βββ s6_grade.py # Stage 6: Cross-encoder R* labeling β ground_truth.json
β βββ software/
β β βββ chunks.json # 271 chunks
β β βββ queries.json # 48 queries
β β βββ ground_truth.json # R* sets for all queries
β β βββ S_true_general.npy # (48, 271) float32
β β βββ S_true_medical.npy
β β βββ S_true_legal.npy
β β βββ S_true_code.npy
β βββ climate/ # 612 chunks, 44 queries
β βββ medical/ # 359 chunks, 44 queries (6 multi-hop)
β
βββ outputs/
β βββ eval_agent.py # GPT-4o-mini zero-shot evaluation agent
β βββ train_grpo.py # GRPO training scaffold (full rollout + normalization)
β
βββ tests/
β βββ test_fault_math.py # 13 tests: per-fault transformation correctness
β βββ test_reward.py # Reward bounds, terminal rewards, component correctness
β βββ test_environment.py # Episode lifecycle, action routing, bug-fix verification
β βββ test_stdout_format.py # [START]/[STEP]/[END] format compliance
β
βββ docs/
βββ ARCHITECTURE.md # Detailed architecture documentation
βββ MODELS_REFERENCE.md # Embedding model details
```
---
## Example Agent Interaction
```
[START] task=task_1 env=rag_debug_env model=Qwen/Qwen2.5-72B-Instruct
Observation:
Config: chunk_size=512 overlap=50 threshold=0.42 top_k=6 model=general reranking=false
Metrics: coverage=0.340 precision=0.280 empty=2 overflow=0
Hints: "2 queries have empty retrievals β lower threshold or increase top_k"
Step 1: adjust_threshold(value=0.15)
Metrics: coverage=0.620 precision=0.450 empty=0 overflow=0 reward=0.52
Step 2: toggle_reranking(enabled=true)
Metrics: coverage=0.720 precision=0.580 empty=0 overflow=0 reward=0.60
Step 3: adjust_top_k(value=15)
Metrics: coverage=0.840 precision=0.610 empty=0 overflow=0 reward=0.66
Step 4: submit()
task_score = 0.60Γ0.840 + 0.25Γ0.610 + 0.15Γ(1 - 4/10) = 0.504 + 0.153 + 0.090 = 0.747
SUCCESS β terminal_reward = 0.7 + 0.3Γ0.747 = 0.924
[END] success=true steps=4 score=0.747 rewards=0.52,0.60,0.66,0.92
```
---
## Test Coverage
122 tests across 4 files verify correctness of every layer:
| File | Tests | What it covers |
|------|-------|----------------|
| `test_fault_math.py` | 13 | Per-fault transformation correctness, bounds, non-mutation |
| `test_reward.py` | ~50 | Reward bounds (fuzz tested), terminal ranges, component logic, task score formulas |
| `test_environment.py` | ~45 | Episode lifecycle, all 9 action types, ADJUST_CHUNK_OVERLAP bug fix verification |
| `test_stdout_format.py` | ~25 | [START]/[STEP]/[END] field names, ordering, numeric precision |
```bash
uv run python -m pytest tests/ -v
# 122 passed in 1.96s
```
|