Spaces:
Sleeping
Sleeping
File size: 5,169 Bytes
f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 ac224ce f23deb1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | # Architecture
## Runtime Topology
```text
Agent / baseline script
-> client.RAGDebugEnv (openenv.core.EnvClient)
-> WebSocket/HTTP to FastAPI app (server/app.py)
-> RAGDebugEnvironment (server/rag_debug_env_environment.py)
-> Corpus artifacts (corpora/<domain>/*)
```
Server construction uses `openenv.core.env_server.http_server.create_app`:
- Environment class: `RagDebugEnvironment` aliasing `RAGDebugEnvironment`
- Action schema: `RAGDebugAction`
- Observation schema: `RAGDebugObservation`
- `env_name="rag_debug_env"`
- `max_concurrent_envs=1` in `server/app.py`
## Core Simulation Contract
The environment does not call a live vector database during episodes.
Episode-time retrieval is simulated from precomputed matrices:
- `S_true_{general,medical,legal,code}.npy`: query-chunk cosine matrices
- `ground_truth.json`: relevant chunk IDs (`R*`) per query
At reset:
1. Load one domain corpus (`software`, `climate`, `medical`)
2. Sample episode queries (5 total per task)
3. Slice full `S_true` matrices down to episode query rows
4. Sample injected faults
5. Build `S_faulted` via `server/fault_math.py`
6. Return initial `RAGDebugObservation`
At step:
1. Apply action to config/model/rewrite overlay
2. Recompute `S_faulted` when required
3. Simulate retrieval (`top_k` then threshold)
4. Compute per-query coverage/precision and aggregate metrics
5. Compute dense reward (or terminal submit reward)
## Task Configuration
Values below are sourced from `server/constants.py` and `server/rag_debug_env_environment.py`.
### Shared limits
- Episode queries: 5 (`_N_EPISODE_QUERIES` for all tasks)
- Max steps: 10 (`_MAX_STEPS`)
### Task 1 (software)
- Domain: `software`
- Faults sampled from:
- `[chunk_too_large, no_reranking]`
- `[threshold_too_high]`
- `[top_k_too_small]`
- `[chunk_too_large]`
- Success check on submit: `task_score >= 0.75`
### Task 2 (climate)
- Domain: `climate`
- Faults sampled from:
- `[threshold_too_low, duplicate_flooding]`
- `[top_k_too_small, context_overflow]`
- `[duplicate_flooding]`
- `[context_overflow]`
- Success check on submit: `task_score >= 0.75`
### Task 3 (medical)
- Domain: `medical`
- Fixed fault set:
- `wrong_embedding_model`
- `chunk_too_large`
- `threshold_too_high`
- Initial active model is `legal` (intentional mismatch)
- Query sampling forces up to 2 multi-hop queries per episode
- Success check on submit:
- `task_score >= 0.70`
- `multi_hop_coverage > 0.60`
## Reward and Scoring
All rewards are in **[0.0, 1.0]**. Non-terminal steps span **[0.0, ~0.89]**
based on absolute quality progress toward the success threshold.
Dense step reward (`_compute_reward`):
- `progress_reward`: `0.10 + 0.55 × min(1, quality_score / quality_target)` → [0.10, 0.65]
Absolute quality level signal using `_quality_score` (task_score formula minus efficiency).
Ensures the full reward range is utilised across the episode — low-quality states
get low rewards, high-quality states get high rewards.
- `delta_bonus`: `clip(Δquality × 2.0, −0.15, +0.15)`
Direction signal that distinguishes an improving step from a no-op at the same level.
- `empty_retrieval_signal`: bidirectional, weight ×0.06 (rewards fixing empties too)
- `overflow_signal`: bidirectional, weight ×0.04 (rewards fixing overflows too)
- `step_cost = -0.01`
- `redundancy_penalty = -0.04` for same action type twice in a row
Submit reward (`_apply_action`):
- Success: `0.7 + 0.3 × task_score` → [0.7, 1.0]
- Failure: `0.2 × task_score` → [0.0, 0.2]
Task score (`_compute_task_score`):
- Task 1/2: `0.60*coverage + 0.25*precision + 0.15*efficiency`
- Task 3: `0.55*coverage + 0.25*precision + 0.20*multi_hop_coverage`
## Fault Math (Implemented)
All transformations are in `server/fault_math.py`.
- `CHUNK_TOO_LARGE`: 1D uniform filter along chunk axis; severity scales with `chunk_size`
- `CHUNK_TOO_SMALL`: gaussian noise scaled by small chunk size, mitigated by overlap
- `THRESHOLD_TOO_LOW`: additive gaussian noise
- `THRESHOLD_TOO_HIGH`: multiplicative score deflation (`* 0.55`)
- `TOP_K_TOO_SMALL`: score compression toward 0.5; less severe if reranking enabled
- `DUPLICATE_FLOODING`: boosts random duplicate columns; reduced if reranking enabled
- `CONTEXT_OVERFLOW`: zeroes tail columns based on `context_window_limit`
- `NO_RERANKING`: additive noise only when reranking is off
- `WRONG_EMBEDDING_MODEL`: implicit by selecting wrong matrix (not a direct transform)
- **Cross-encoder reranking blend**: after all faults, if `use_reranking=True`,
blends faulted scores back toward pre-fault scores (alpha=0.35). Simulates a
cross-encoder partially recovering true relevance signal. Non-monotonic for
noise-based faults (changes rank order), restores score spread for compression faults.
## Determinism and Fallbacks
- Noise arrays and duplicate indices are sampled once at reset and reused during recomputation for deterministic intra-episode behavior.
- If required corpus files are missing, `server/corpus.py` falls back to synthetic data and emits warnings.
- Synthetic fallback is for smoke testing only, not for real training/evaluation.
|