File size: 5,169 Bytes
f23deb1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ac224ce
 
 
f23deb1
 
ac224ce
 
 
 
 
 
 
 
 
 
f23deb1
ac224ce
f23deb1
ac224ce
 
f23deb1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ac224ce
 
 
 
f23deb1
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
# Architecture

## Runtime Topology

```text
Agent / baseline script
  -> client.RAGDebugEnv (openenv.core.EnvClient)
  -> WebSocket/HTTP to FastAPI app (server/app.py)
  -> RAGDebugEnvironment (server/rag_debug_env_environment.py)
  -> Corpus artifacts (corpora/<domain>/*)
```

Server construction uses `openenv.core.env_server.http_server.create_app`:

- Environment class: `RagDebugEnvironment` aliasing `RAGDebugEnvironment`
- Action schema: `RAGDebugAction`
- Observation schema: `RAGDebugObservation`
- `env_name="rag_debug_env"`
- `max_concurrent_envs=1` in `server/app.py`

## Core Simulation Contract

The environment does not call a live vector database during episodes.

Episode-time retrieval is simulated from precomputed matrices:

- `S_true_{general,medical,legal,code}.npy`: query-chunk cosine matrices
- `ground_truth.json`: relevant chunk IDs (`R*`) per query

At reset:

1. Load one domain corpus (`software`, `climate`, `medical`)
2. Sample episode queries (5 total per task)
3. Slice full `S_true` matrices down to episode query rows
4. Sample injected faults
5. Build `S_faulted` via `server/fault_math.py`
6. Return initial `RAGDebugObservation`

At step:

1. Apply action to config/model/rewrite overlay
2. Recompute `S_faulted` when required
3. Simulate retrieval (`top_k` then threshold)
4. Compute per-query coverage/precision and aggregate metrics
5. Compute dense reward (or terminal submit reward)

## Task Configuration

Values below are sourced from `server/constants.py` and `server/rag_debug_env_environment.py`.

### Shared limits

- Episode queries: 5 (`_N_EPISODE_QUERIES` for all tasks)
- Max steps: 10 (`_MAX_STEPS`)

### Task 1 (software)

- Domain: `software`
- Faults sampled from:
  - `[chunk_too_large, no_reranking]`
  - `[threshold_too_high]`
  - `[top_k_too_small]`
  - `[chunk_too_large]`
- Success check on submit: `task_score >= 0.75`

### Task 2 (climate)

- Domain: `climate`
- Faults sampled from:
  - `[threshold_too_low, duplicate_flooding]`
  - `[top_k_too_small, context_overflow]`
  - `[duplicate_flooding]`
  - `[context_overflow]`
- Success check on submit: `task_score >= 0.75`

### Task 3 (medical)

- Domain: `medical`
- Fixed fault set:
  - `wrong_embedding_model`
  - `chunk_too_large`
  - `threshold_too_high`
- Initial active model is `legal` (intentional mismatch)
- Query sampling forces up to 2 multi-hop queries per episode
- Success check on submit:
  - `task_score >= 0.70`
  - `multi_hop_coverage > 0.60`

## Reward and Scoring

All rewards are in **[0.0, 1.0]**.  Non-terminal steps span **[0.0, ~0.89]**
based on absolute quality progress toward the success threshold.

Dense step reward (`_compute_reward`):

- `progress_reward`: `0.10 + 0.55 × min(1, quality_score / quality_target)` → [0.10, 0.65]
  Absolute quality level signal using `_quality_score` (task_score formula minus efficiency).
  Ensures the full reward range is utilised across the episode — low-quality states
  get low rewards, high-quality states get high rewards.
- `delta_bonus`: `clip(Δquality × 2.0, −0.15, +0.15)`
  Direction signal that distinguishes an improving step from a no-op at the same level.
- `empty_retrieval_signal`: bidirectional, weight ×0.06 (rewards fixing empties too)
- `overflow_signal`: bidirectional, weight ×0.04 (rewards fixing overflows too)
- `step_cost = -0.01`
- `redundancy_penalty = -0.04` for same action type twice in a row

Submit reward (`_apply_action`):

- Success: `0.7 + 0.3 × task_score` → [0.7, 1.0]
- Failure: `0.2 × task_score` → [0.0, 0.2]

Task score (`_compute_task_score`):

- Task 1/2: `0.60*coverage + 0.25*precision + 0.15*efficiency`
- Task 3: `0.55*coverage + 0.25*precision + 0.20*multi_hop_coverage`

## Fault Math (Implemented)

All transformations are in `server/fault_math.py`.

- `CHUNK_TOO_LARGE`: 1D uniform filter along chunk axis; severity scales with `chunk_size`
- `CHUNK_TOO_SMALL`: gaussian noise scaled by small chunk size, mitigated by overlap
- `THRESHOLD_TOO_LOW`: additive gaussian noise
- `THRESHOLD_TOO_HIGH`: multiplicative score deflation (`* 0.55`)
- `TOP_K_TOO_SMALL`: score compression toward 0.5; less severe if reranking enabled
- `DUPLICATE_FLOODING`: boosts random duplicate columns; reduced if reranking enabled
- `CONTEXT_OVERFLOW`: zeroes tail columns based on `context_window_limit`
- `NO_RERANKING`: additive noise only when reranking is off
- `WRONG_EMBEDDING_MODEL`: implicit by selecting wrong matrix (not a direct transform)
- **Cross-encoder reranking blend**: after all faults, if `use_reranking=True`,
  blends faulted scores back toward pre-fault scores (alpha=0.35). Simulates a
  cross-encoder partially recovering true relevance signal. Non-monotonic for
  noise-based faults (changes rank order), restores score spread for compression faults.

## Determinism and Fallbacks

- Noise arrays and duplicate indices are sampled once at reset and reused during recomputation for deterministic intra-episode behavior.
- If required corpus files are missing, `server/corpus.py` falls back to synthetic data and emits warnings.
- Synthetic fallback is for smoke testing only, not for real training/evaluation.