# Architecture ## Runtime Topology ```text Agent / baseline script -> client.RAGDebugEnv (openenv.core.EnvClient) -> WebSocket/HTTP to FastAPI app (server/app.py) -> RAGDebugEnvironment (server/rag_debug_env_environment.py) -> Corpus artifacts (corpora//*) ``` Server construction uses `openenv.core.env_server.http_server.create_app`: - Environment class: `RagDebugEnvironment` aliasing `RAGDebugEnvironment` - Action schema: `RAGDebugAction` - Observation schema: `RAGDebugObservation` - `env_name="rag_debug_env"` - `max_concurrent_envs=1` in `server/app.py` ## Core Simulation Contract The environment does not call a live vector database during episodes. Episode-time retrieval is simulated from precomputed matrices: - `S_true_{general,medical,legal,code}.npy`: query-chunk cosine matrices - `ground_truth.json`: relevant chunk IDs (`R*`) per query At reset: 1. Load one domain corpus (`software`, `climate`, `medical`) 2. Sample episode queries (5 total per task) 3. Slice full `S_true` matrices down to episode query rows 4. Sample injected faults 5. Build `S_faulted` via `server/fault_math.py` 6. Return initial `RAGDebugObservation` At step: 1. Apply action to config/model/rewrite overlay 2. Recompute `S_faulted` when required 3. Simulate retrieval (`top_k` then threshold) 4. Compute per-query coverage/precision and aggregate metrics 5. Compute dense reward (or terminal submit reward) ## Task Configuration Values below are sourced from `server/constants.py` and `server/rag_debug_env_environment.py`. ### Shared limits - Episode queries: 5 (`_N_EPISODE_QUERIES` for all tasks) - Max steps: 10 (`_MAX_STEPS`) ### Task 1 (software) - Domain: `software` - Faults sampled from: - `[chunk_too_large, no_reranking]` - `[threshold_too_high]` - `[top_k_too_small]` - `[chunk_too_large]` - Success check on submit: `task_score >= 0.75` ### Task 2 (climate) - Domain: `climate` - Faults sampled from: - `[threshold_too_low, duplicate_flooding]` - `[top_k_too_small, context_overflow]` - `[duplicate_flooding]` - `[context_overflow]` - Success check on submit: `task_score >= 0.75` ### Task 3 (medical) - Domain: `medical` - Fixed fault set: - `wrong_embedding_model` - `chunk_too_large` - `threshold_too_high` - Initial active model is `legal` (intentional mismatch) - Query sampling forces up to 2 multi-hop queries per episode - Success check on submit: - `task_score >= 0.70` - `multi_hop_coverage > 0.60` ## Reward and Scoring All rewards are in **[0.0, 1.0]**. Non-terminal steps span **[0.0, ~0.89]** based on absolute quality progress toward the success threshold. Dense step reward (`_compute_reward`): - `progress_reward`: `0.10 + 0.55 × min(1, quality_score / quality_target)` → [0.10, 0.65] Absolute quality level signal using `_quality_score` (task_score formula minus efficiency). Ensures the full reward range is utilised across the episode — low-quality states get low rewards, high-quality states get high rewards. - `delta_bonus`: `clip(Δquality × 2.0, −0.15, +0.15)` Direction signal that distinguishes an improving step from a no-op at the same level. - `empty_retrieval_signal`: bidirectional, weight ×0.06 (rewards fixing empties too) - `overflow_signal`: bidirectional, weight ×0.04 (rewards fixing overflows too) - `step_cost = -0.01` - `redundancy_penalty = -0.04` for same action type twice in a row Submit reward (`_apply_action`): - Success: `0.7 + 0.3 × task_score` → [0.7, 1.0] - Failure: `0.2 × task_score` → [0.0, 0.2] Task score (`_compute_task_score`): - Task 1/2: `0.60*coverage + 0.25*precision + 0.15*efficiency` - Task 3: `0.55*coverage + 0.25*precision + 0.20*multi_hop_coverage` ## Fault Math (Implemented) All transformations are in `server/fault_math.py`. - `CHUNK_TOO_LARGE`: 1D uniform filter along chunk axis; severity scales with `chunk_size` - `CHUNK_TOO_SMALL`: gaussian noise scaled by small chunk size, mitigated by overlap - `THRESHOLD_TOO_LOW`: additive gaussian noise - `THRESHOLD_TOO_HIGH`: multiplicative score deflation (`* 0.55`) - `TOP_K_TOO_SMALL`: score compression toward 0.5; less severe if reranking enabled - `DUPLICATE_FLOODING`: boosts random duplicate columns; reduced if reranking enabled - `CONTEXT_OVERFLOW`: zeroes tail columns based on `context_window_limit` - `NO_RERANKING`: additive noise only when reranking is off - `WRONG_EMBEDDING_MODEL`: implicit by selecting wrong matrix (not a direct transform) - **Cross-encoder reranking blend**: after all faults, if `use_reranking=True`, blends faulted scores back toward pre-fault scores (alpha=0.35). Simulates a cross-encoder partially recovering true relevance signal. Non-monotonic for noise-based faults (changes rank order), restores score spread for compression faults. ## Determinism and Fallbacks - Noise arrays and duplicate indices are sampled once at reset and reused during recomputation for deterministic intra-episode behavior. - If required corpus files are missing, `server/corpus.py` falls back to synthetic data and emits warnings. - Synthetic fallback is for smoke testing only, not for real training/evaluation.