rag_debug_env / docs /ARCHITECTURE.md
vankap-grover's picture
Upload folder using huggingface_hub
ac224ce verified
# Architecture
## Runtime Topology
```text
Agent / baseline script
-> client.RAGDebugEnv (openenv.core.EnvClient)
-> WebSocket/HTTP to FastAPI app (server/app.py)
-> RAGDebugEnvironment (server/rag_debug_env_environment.py)
-> Corpus artifacts (corpora/<domain>/*)
```
Server construction uses `openenv.core.env_server.http_server.create_app`:
- Environment class: `RagDebugEnvironment` aliasing `RAGDebugEnvironment`
- Action schema: `RAGDebugAction`
- Observation schema: `RAGDebugObservation`
- `env_name="rag_debug_env"`
- `max_concurrent_envs=1` in `server/app.py`
## Core Simulation Contract
The environment does not call a live vector database during episodes.
Episode-time retrieval is simulated from precomputed matrices:
- `S_true_{general,medical,legal,code}.npy`: query-chunk cosine matrices
- `ground_truth.json`: relevant chunk IDs (`R*`) per query
At reset:
1. Load one domain corpus (`software`, `climate`, `medical`)
2. Sample episode queries (5 total per task)
3. Slice full `S_true` matrices down to episode query rows
4. Sample injected faults
5. Build `S_faulted` via `server/fault_math.py`
6. Return initial `RAGDebugObservation`
At step:
1. Apply action to config/model/rewrite overlay
2. Recompute `S_faulted` when required
3. Simulate retrieval (`top_k` then threshold)
4. Compute per-query coverage/precision and aggregate metrics
5. Compute dense reward (or terminal submit reward)
## Task Configuration
Values below are sourced from `server/constants.py` and `server/rag_debug_env_environment.py`.
### Shared limits
- Episode queries: 5 (`_N_EPISODE_QUERIES` for all tasks)
- Max steps: 10 (`_MAX_STEPS`)
### Task 1 (software)
- Domain: `software`
- Faults sampled from:
- `[chunk_too_large, no_reranking]`
- `[threshold_too_high]`
- `[top_k_too_small]`
- `[chunk_too_large]`
- Success check on submit: `task_score >= 0.75`
### Task 2 (climate)
- Domain: `climate`
- Faults sampled from:
- `[threshold_too_low, duplicate_flooding]`
- `[top_k_too_small, context_overflow]`
- `[duplicate_flooding]`
- `[context_overflow]`
- Success check on submit: `task_score >= 0.75`
### Task 3 (medical)
- Domain: `medical`
- Fixed fault set:
- `wrong_embedding_model`
- `chunk_too_large`
- `threshold_too_high`
- Initial active model is `legal` (intentional mismatch)
- Query sampling forces up to 2 multi-hop queries per episode
- Success check on submit:
- `task_score >= 0.70`
- `multi_hop_coverage > 0.60`
## Reward and Scoring
All rewards are in **[0.0, 1.0]**. Non-terminal steps span **[0.0, ~0.89]**
based on absolute quality progress toward the success threshold.
Dense step reward (`_compute_reward`):
- `progress_reward`: `0.10 + 0.55 × min(1, quality_score / quality_target)` → [0.10, 0.65]
Absolute quality level signal using `_quality_score` (task_score formula minus efficiency).
Ensures the full reward range is utilised across the episode — low-quality states
get low rewards, high-quality states get high rewards.
- `delta_bonus`: `clip(Δquality × 2.0, −0.15, +0.15)`
Direction signal that distinguishes an improving step from a no-op at the same level.
- `empty_retrieval_signal`: bidirectional, weight ×0.06 (rewards fixing empties too)
- `overflow_signal`: bidirectional, weight ×0.04 (rewards fixing overflows too)
- `step_cost = -0.01`
- `redundancy_penalty = -0.04` for same action type twice in a row
Submit reward (`_apply_action`):
- Success: `0.7 + 0.3 × task_score` → [0.7, 1.0]
- Failure: `0.2 × task_score` → [0.0, 0.2]
Task score (`_compute_task_score`):
- Task 1/2: `0.60*coverage + 0.25*precision + 0.15*efficiency`
- Task 3: `0.55*coverage + 0.25*precision + 0.20*multi_hop_coverage`
## Fault Math (Implemented)
All transformations are in `server/fault_math.py`.
- `CHUNK_TOO_LARGE`: 1D uniform filter along chunk axis; severity scales with `chunk_size`
- `CHUNK_TOO_SMALL`: gaussian noise scaled by small chunk size, mitigated by overlap
- `THRESHOLD_TOO_LOW`: additive gaussian noise
- `THRESHOLD_TOO_HIGH`: multiplicative score deflation (`* 0.55`)
- `TOP_K_TOO_SMALL`: score compression toward 0.5; less severe if reranking enabled
- `DUPLICATE_FLOODING`: boosts random duplicate columns; reduced if reranking enabled
- `CONTEXT_OVERFLOW`: zeroes tail columns based on `context_window_limit`
- `NO_RERANKING`: additive noise only when reranking is off
- `WRONG_EMBEDDING_MODEL`: implicit by selecting wrong matrix (not a direct transform)
- **Cross-encoder reranking blend**: after all faults, if `use_reranking=True`,
blends faulted scores back toward pre-fault scores (alpha=0.35). Simulates a
cross-encoder partially recovering true relevance signal. Non-monotonic for
noise-based faults (changes rank order), restores score spread for compression faults.
## Determinism and Fallbacks
- Noise arrays and duplicate indices are sampled once at reset and reused during recomputation for deterministic intra-episode behavior.
- If required corpus files are missing, `server/corpus.py` falls back to synthetic data and emits warnings.
- Synthetic fallback is for smoke testing only, not for real training/evaluation.