Spaces:

vankap-grover
/

rag_debug_env

Sleeping

App Files Files Community

rag_debug_env / docs /ARCHITECTURE.md

vankap-grover

Upload folder using huggingface_hub

ac224ce verified about 1 month ago

preview code

raw

history blame contribute delete

5.17 kB

Architecture

Runtime Topology

Agent / baseline script
  -> client.RAGDebugEnv (openenv.core.EnvClient)
  -> WebSocket/HTTP to FastAPI app (server/app.py)
  -> RAGDebugEnvironment (server/rag_debug_env_environment.py)
  -> Corpus artifacts (corpora/<domain>/*)

Server construction uses openenv.core.env_server.http_server.create_app:

Environment class: RagDebugEnvironment aliasing RAGDebugEnvironment
Action schema: RAGDebugAction
Observation schema: RAGDebugObservation
env_name="rag_debug_env"
max_concurrent_envs=1 in server/app.py

Core Simulation Contract

The environment does not call a live vector database during episodes.

Episode-time retrieval is simulated from precomputed matrices:

S_true_{general,medical,legal,code}.npy: query-chunk cosine matrices
ground_truth.json: relevant chunk IDs (R*) per query

At reset:

Load one domain corpus (software, climate, medical)
Sample episode queries (5 total per task)
Slice full S_true matrices down to episode query rows
Sample injected faults
Build S_faulted via server/fault_math.py
Return initial RAGDebugObservation

At step:

Apply action to config/model/rewrite overlay
Recompute S_faulted when required
Simulate retrieval (top_k then threshold)
Compute per-query coverage/precision and aggregate metrics
Compute dense reward (or terminal submit reward)

Task Configuration

Values below are sourced from server/constants.py and server/rag_debug_env_environment.py.

Shared limits

Episode queries: 5 (_N_EPISODE_QUERIES for all tasks)
Max steps: 10 (_MAX_STEPS)

Task 1 (software)

Domain: software
Faults sampled from:
- [chunk_too_large, no_reranking]
- [threshold_too_high]
- [top_k_too_small]
- [chunk_too_large]
Success check on submit: task_score >= 0.75

Task 2 (climate)

Domain: climate
Faults sampled from:
- [threshold_too_low, duplicate_flooding]
- [top_k_too_small, context_overflow]
- [duplicate_flooding]
- [context_overflow]
Success check on submit: task_score >= 0.75

Task 3 (medical)

Domain: medical
Fixed fault set:
- wrong_embedding_model
- chunk_too_large
- threshold_too_high
Initial active model is legal (intentional mismatch)
Query sampling forces up to 2 multi-hop queries per episode
Success check on submit:
- task_score >= 0.70
- multi_hop_coverage > 0.60

Reward and Scoring

All rewards are in [0.0, 1.0]. Non-terminal steps span [0.0, ~0.89] based on absolute quality progress toward the success threshold.

Dense step reward (_compute_reward):

progress_reward: 0.10 + 0.55 × min(1, quality_score / quality_target) → [0.10, 0.65] Absolute quality level signal using _quality_score (task_score formula minus efficiency). Ensures the full reward range is utilised across the episode — low-quality states get low rewards, high-quality states get high rewards.
delta_bonus: clip(Δquality × 2.0, −0.15, +0.15) Direction signal that distinguishes an improving step from a no-op at the same level.
empty_retrieval_signal: bidirectional, weight ×0.06 (rewards fixing empties too)
overflow_signal: bidirectional, weight ×0.04 (rewards fixing overflows too)
step_cost = -0.01
redundancy_penalty = -0.04 for same action type twice in a row

Submit reward (_apply_action):

Success: 0.7 + 0.3 × task_score → [0.7, 1.0]
Failure: 0.2 × task_score → [0.0, 0.2]

Task score (_compute_task_score):

Task 1/2: 0.60*coverage + 0.25*precision + 0.15*efficiency
Task 3: 0.55*coverage + 0.25*precision + 0.20*multi_hop_coverage

Fault Math (Implemented)

All transformations are in server/fault_math.py.

CHUNK_TOO_LARGE: 1D uniform filter along chunk axis; severity scales with chunk_size
CHUNK_TOO_SMALL: gaussian noise scaled by small chunk size, mitigated by overlap
THRESHOLD_TOO_LOW: additive gaussian noise
THRESHOLD_TOO_HIGH: multiplicative score deflation (* 0.55)
TOP_K_TOO_SMALL: score compression toward 0.5; less severe if reranking enabled
DUPLICATE_FLOODING: boosts random duplicate columns; reduced if reranking enabled
CONTEXT_OVERFLOW: zeroes tail columns based on context_window_limit
NO_RERANKING: additive noise only when reranking is off
WRONG_EMBEDDING_MODEL: implicit by selecting wrong matrix (not a direct transform)
Cross-encoder reranking blend: after all faults, if use_reranking=True, blends faulted scores back toward pre-fault scores (alpha=0.35). Simulates a cross-encoder partially recovering true relevance signal. Non-monotonic for noise-based faults (changes rank order), restores score spread for compression faults.

Determinism and Fallbacks

Noise arrays and duplicate indices are sampled once at reset and reused during recomputation for deterministic intra-episode behavior.
If required corpus files are missing, server/corpus.py falls back to synthetic data and emits warnings.
Synthetic fallback is for smoke testing only, not for real training/evaluation.