Spaces:

vankap-grover
/

rag_debug_env

Sleeping

App Files Files Community

rag_debug_env / docs /ARCHITECTURE.md

vankap-grover

Upload folder using huggingface_hub

ac224ce verified about 1 month ago

preview code

raw

history blame contribute delete

5.17 kB

	# Architecture

	## Runtime Topology

	```text
	Agent / baseline script
	-> client.RAGDebugEnv (openenv.core.EnvClient)
	-> WebSocket/HTTP to FastAPI app (server/app.py)
	-> RAGDebugEnvironment (server/rag_debug_env_environment.py)
	-> Corpus artifacts (corpora/<domain>/*)
	```

	Server construction uses `openenv.core.env_server.http_server.create_app`:

	- Environment class: `RagDebugEnvironment` aliasing `RAGDebugEnvironment`
	- Action schema: `RAGDebugAction`
	- Observation schema: `RAGDebugObservation`
	- `env_name="rag_debug_env"`
	- `max_concurrent_envs=1` in `server/app.py`

	## Core Simulation Contract

	The environment does not call a live vector database during episodes.

	Episode-time retrieval is simulated from precomputed matrices:

	- `S_true_{general,medical,legal,code}.npy`: query-chunk cosine matrices
	- `ground_truth.json`: relevant chunk IDs (`R*`) per query

	At reset:

	1. Load one domain corpus (`software`, `climate`, `medical`)
	2. Sample episode queries (5 total per task)
	3. Slice full `S_true` matrices down to episode query rows
	4. Sample injected faults
	5. Build `S_faulted` via `server/fault_math.py`
	6. Return initial `RAGDebugObservation`

	At step:

	1. Apply action to config/model/rewrite overlay
	2. Recompute `S_faulted` when required
	3. Simulate retrieval (`top_k` then threshold)
	4. Compute per-query coverage/precision and aggregate metrics
	5. Compute dense reward (or terminal submit reward)

	## Task Configuration

	Values below are sourced from `server/constants.py` and `server/rag_debug_env_environment.py`.

	### Shared limits

	- Episode queries: 5 (`_N_EPISODE_QUERIES` for all tasks)
	- Max steps: 10 (`_MAX_STEPS`)

	### Task 1 (software)

	- Domain: `software`
	- Faults sampled from:
	- `[chunk_too_large, no_reranking]`
	- `[threshold_too_high]`
	- `[top_k_too_small]`
	- `[chunk_too_large]`
	- Success check on submit: `task_score >= 0.75`

	### Task 2 (climate)

	- Domain: `climate`
	- Faults sampled from:
	- `[threshold_too_low, duplicate_flooding]`
	- `[top_k_too_small, context_overflow]`
	- `[duplicate_flooding]`
	- `[context_overflow]`
	- Success check on submit: `task_score >= 0.75`

	### Task 3 (medical)

	- Domain: `medical`
	- Fixed fault set:
	- `wrong_embedding_model`
	- `chunk_too_large`
	- `threshold_too_high`
	- Initial active model is `legal` (intentional mismatch)
	- Query sampling forces up to 2 multi-hop queries per episode
	- Success check on submit:
	- `task_score >= 0.70`
	- `multi_hop_coverage > 0.60`

	## Reward and Scoring

	All rewards are in [0.0, 1.0]. Non-terminal steps span [0.0, ~0.89]
	based on absolute quality progress toward the success threshold.

	Dense step reward (`_compute_reward`):

	- `progress_reward`: `0.10 + 0.55 × min(1, quality_score / quality_target)` → [0.10, 0.65]
	Absolute quality level signal using `_quality_score` (task_score formula minus efficiency).
	Ensures the full reward range is utilised across the episode — low-quality states
	get low rewards, high-quality states get high rewards.
	- `delta_bonus`: `clip(Δquality × 2.0, −0.15, +0.15)`
	Direction signal that distinguishes an improving step from a no-op at the same level.
	- `empty_retrieval_signal`: bidirectional, weight ×0.06 (rewards fixing empties too)
	- `overflow_signal`: bidirectional, weight ×0.04 (rewards fixing overflows too)
	- `step_cost = -0.01`
	- `redundancy_penalty = -0.04` for same action type twice in a row

	Submit reward (`_apply_action`):

	- Success: `0.7 + 0.3 × task_score` → [0.7, 1.0]
	- Failure: `0.2 × task_score` → [0.0, 0.2]

	Task score (`_compute_task_score`):

	- Task 1/2: `0.60coverage + 0.25precision + 0.15*efficiency`
	- Task 3: `0.55coverage + 0.25precision + 0.20*multi_hop_coverage`

	## Fault Math (Implemented)

	All transformations are in `server/fault_math.py`.

	- `CHUNK_TOO_LARGE`: 1D uniform filter along chunk axis; severity scales with `chunk_size`
	- `CHUNK_TOO_SMALL`: gaussian noise scaled by small chunk size, mitigated by overlap
	- `THRESHOLD_TOO_LOW`: additive gaussian noise
	- `THRESHOLD_TOO_HIGH`: multiplicative score deflation (`* 0.55`)
	- `TOP_K_TOO_SMALL`: score compression toward 0.5; less severe if reranking enabled
	- `DUPLICATE_FLOODING`: boosts random duplicate columns; reduced if reranking enabled
	- `CONTEXT_OVERFLOW`: zeroes tail columns based on `context_window_limit`
	- `NO_RERANKING`: additive noise only when reranking is off
	- `WRONG_EMBEDDING_MODEL`: implicit by selecting wrong matrix (not a direct transform)
	- Cross-encoder reranking blend: after all faults, if `use_reranking=True`,
	blends faulted scores back toward pre-fault scores (alpha=0.35). Simulates a
	cross-encoder partially recovering true relevance signal. Non-monotonic for
	noise-based faults (changes rank order), restores score spread for compression faults.

	## Determinism and Fallbacks

	- Noise arrays and duplicate indices are sampled once at reset and reused during recomputation for deterministic intra-episode behavior.
	- If required corpus files are missing, `server/corpus.py` falls back to synthetic data and emits warnings.
	- Synthetic fallback is for smoke testing only, not for real training/evaluation.