Spaces:
Sleeping
Sleeping
| # Architecture | |
| ## Runtime Topology | |
| ```text | |
| Agent / baseline script | |
| -> client.RAGDebugEnv (openenv.core.EnvClient) | |
| -> WebSocket/HTTP to FastAPI app (server/app.py) | |
| -> RAGDebugEnvironment (server/rag_debug_env_environment.py) | |
| -> Corpus artifacts (corpora/<domain>/*) | |
| ``` | |
| Server construction uses `openenv.core.env_server.http_server.create_app`: | |
| - Environment class: `RagDebugEnvironment` aliasing `RAGDebugEnvironment` | |
| - Action schema: `RAGDebugAction` | |
| - Observation schema: `RAGDebugObservation` | |
| - `env_name="rag_debug_env"` | |
| - `max_concurrent_envs=1` in `server/app.py` | |
| ## Core Simulation Contract | |
| The environment does not call a live vector database during episodes. | |
| Episode-time retrieval is simulated from precomputed matrices: | |
| - `S_true_{general,medical,legal,code}.npy`: query-chunk cosine matrices | |
| - `ground_truth.json`: relevant chunk IDs (`R*`) per query | |
| At reset: | |
| 1. Load one domain corpus (`software`, `climate`, `medical`) | |
| 2. Sample episode queries (5 total per task) | |
| 3. Slice full `S_true` matrices down to episode query rows | |
| 4. Sample injected faults | |
| 5. Build `S_faulted` via `server/fault_math.py` | |
| 6. Return initial `RAGDebugObservation` | |
| At step: | |
| 1. Apply action to config/model/rewrite overlay | |
| 2. Recompute `S_faulted` when required | |
| 3. Simulate retrieval (`top_k` then threshold) | |
| 4. Compute per-query coverage/precision and aggregate metrics | |
| 5. Compute dense reward (or terminal submit reward) | |
| ## Task Configuration | |
| Values below are sourced from `server/constants.py` and `server/rag_debug_env_environment.py`. | |
| ### Shared limits | |
| - Episode queries: 5 (`_N_EPISODE_QUERIES` for all tasks) | |
| - Max steps: 10 (`_MAX_STEPS`) | |
| ### Task 1 (software) | |
| - Domain: `software` | |
| - Faults sampled from: | |
| - `[chunk_too_large, no_reranking]` | |
| - `[threshold_too_high]` | |
| - `[top_k_too_small]` | |
| - `[chunk_too_large]` | |
| - Success check on submit: `task_score >= 0.75` | |
| ### Task 2 (climate) | |
| - Domain: `climate` | |
| - Faults sampled from: | |
| - `[threshold_too_low, duplicate_flooding]` | |
| - `[top_k_too_small, context_overflow]` | |
| - `[duplicate_flooding]` | |
| - `[context_overflow]` | |
| - Success check on submit: `task_score >= 0.75` | |
| ### Task 3 (medical) | |
| - Domain: `medical` | |
| - Fixed fault set: | |
| - `wrong_embedding_model` | |
| - `chunk_too_large` | |
| - `threshold_too_high` | |
| - Initial active model is `legal` (intentional mismatch) | |
| - Query sampling forces up to 2 multi-hop queries per episode | |
| - Success check on submit: | |
| - `task_score >= 0.70` | |
| - `multi_hop_coverage > 0.60` | |
| ## Reward and Scoring | |
| All rewards are in **[0.0, 1.0]**. Non-terminal steps span **[0.0, ~0.89]** | |
| based on absolute quality progress toward the success threshold. | |
| Dense step reward (`_compute_reward`): | |
| - `progress_reward`: `0.10 + 0.55 × min(1, quality_score / quality_target)` → [0.10, 0.65] | |
| Absolute quality level signal using `_quality_score` (task_score formula minus efficiency). | |
| Ensures the full reward range is utilised across the episode — low-quality states | |
| get low rewards, high-quality states get high rewards. | |
| - `delta_bonus`: `clip(Δquality × 2.0, −0.15, +0.15)` | |
| Direction signal that distinguishes an improving step from a no-op at the same level. | |
| - `empty_retrieval_signal`: bidirectional, weight ×0.06 (rewards fixing empties too) | |
| - `overflow_signal`: bidirectional, weight ×0.04 (rewards fixing overflows too) | |
| - `step_cost = -0.01` | |
| - `redundancy_penalty = -0.04` for same action type twice in a row | |
| Submit reward (`_apply_action`): | |
| - Success: `0.7 + 0.3 × task_score` → [0.7, 1.0] | |
| - Failure: `0.2 × task_score` → [0.0, 0.2] | |
| Task score (`_compute_task_score`): | |
| - Task 1/2: `0.60*coverage + 0.25*precision + 0.15*efficiency` | |
| - Task 3: `0.55*coverage + 0.25*precision + 0.20*multi_hop_coverage` | |
| ## Fault Math (Implemented) | |
| All transformations are in `server/fault_math.py`. | |
| - `CHUNK_TOO_LARGE`: 1D uniform filter along chunk axis; severity scales with `chunk_size` | |
| - `CHUNK_TOO_SMALL`: gaussian noise scaled by small chunk size, mitigated by overlap | |
| - `THRESHOLD_TOO_LOW`: additive gaussian noise | |
| - `THRESHOLD_TOO_HIGH`: multiplicative score deflation (`* 0.55`) | |
| - `TOP_K_TOO_SMALL`: score compression toward 0.5; less severe if reranking enabled | |
| - `DUPLICATE_FLOODING`: boosts random duplicate columns; reduced if reranking enabled | |
| - `CONTEXT_OVERFLOW`: zeroes tail columns based on `context_window_limit` | |
| - `NO_RERANKING`: additive noise only when reranking is off | |
| - `WRONG_EMBEDDING_MODEL`: implicit by selecting wrong matrix (not a direct transform) | |
| - **Cross-encoder reranking blend**: after all faults, if `use_reranking=True`, | |
| blends faulted scores back toward pre-fault scores (alpha=0.35). Simulates a | |
| cross-encoder partially recovering true relevance signal. Non-monotonic for | |
| noise-based faults (changes rank order), restores score spread for compression faults. | |
| ## Determinism and Fallbacks | |
| - Noise arrays and duplicate indices are sampled once at reset and reused during recomputation for deterministic intra-episode behavior. | |
| - If required corpus files are missing, `server/corpus.py` falls back to synthetic data and emits warnings. | |
| - Synthetic fallback is for smoke testing only, not for real training/evaluation. | |