rag_debug_env / docs /ARCHITECTURE.md
vankap-grover's picture
Upload folder using huggingface_hub
ac224ce verified

Architecture

Runtime Topology

Agent / baseline script
  -> client.RAGDebugEnv (openenv.core.EnvClient)
  -> WebSocket/HTTP to FastAPI app (server/app.py)
  -> RAGDebugEnvironment (server/rag_debug_env_environment.py)
  -> Corpus artifacts (corpora/<domain>/*)

Server construction uses openenv.core.env_server.http_server.create_app:

  • Environment class: RagDebugEnvironment aliasing RAGDebugEnvironment
  • Action schema: RAGDebugAction
  • Observation schema: RAGDebugObservation
  • env_name="rag_debug_env"
  • max_concurrent_envs=1 in server/app.py

Core Simulation Contract

The environment does not call a live vector database during episodes.

Episode-time retrieval is simulated from precomputed matrices:

  • S_true_{general,medical,legal,code}.npy: query-chunk cosine matrices
  • ground_truth.json: relevant chunk IDs (R*) per query

At reset:

  1. Load one domain corpus (software, climate, medical)
  2. Sample episode queries (5 total per task)
  3. Slice full S_true matrices down to episode query rows
  4. Sample injected faults
  5. Build S_faulted via server/fault_math.py
  6. Return initial RAGDebugObservation

At step:

  1. Apply action to config/model/rewrite overlay
  2. Recompute S_faulted when required
  3. Simulate retrieval (top_k then threshold)
  4. Compute per-query coverage/precision and aggregate metrics
  5. Compute dense reward (or terminal submit reward)

Task Configuration

Values below are sourced from server/constants.py and server/rag_debug_env_environment.py.

Shared limits

  • Episode queries: 5 (_N_EPISODE_QUERIES for all tasks)
  • Max steps: 10 (_MAX_STEPS)

Task 1 (software)

  • Domain: software
  • Faults sampled from:
    • [chunk_too_large, no_reranking]
    • [threshold_too_high]
    • [top_k_too_small]
    • [chunk_too_large]
  • Success check on submit: task_score >= 0.75

Task 2 (climate)

  • Domain: climate
  • Faults sampled from:
    • [threshold_too_low, duplicate_flooding]
    • [top_k_too_small, context_overflow]
    • [duplicate_flooding]
    • [context_overflow]
  • Success check on submit: task_score >= 0.75

Task 3 (medical)

  • Domain: medical
  • Fixed fault set:
    • wrong_embedding_model
    • chunk_too_large
    • threshold_too_high
  • Initial active model is legal (intentional mismatch)
  • Query sampling forces up to 2 multi-hop queries per episode
  • Success check on submit:
    • task_score >= 0.70
    • multi_hop_coverage > 0.60

Reward and Scoring

All rewards are in [0.0, 1.0]. Non-terminal steps span [0.0, ~0.89] based on absolute quality progress toward the success threshold.

Dense step reward (_compute_reward):

  • progress_reward: 0.10 + 0.55 × min(1, quality_score / quality_target) → [0.10, 0.65] Absolute quality level signal using _quality_score (task_score formula minus efficiency). Ensures the full reward range is utilised across the episode — low-quality states get low rewards, high-quality states get high rewards.
  • delta_bonus: clip(Δquality × 2.0, −0.15, +0.15) Direction signal that distinguishes an improving step from a no-op at the same level.
  • empty_retrieval_signal: bidirectional, weight ×0.06 (rewards fixing empties too)
  • overflow_signal: bidirectional, weight ×0.04 (rewards fixing overflows too)
  • step_cost = -0.01
  • redundancy_penalty = -0.04 for same action type twice in a row

Submit reward (_apply_action):

  • Success: 0.7 + 0.3 × task_score → [0.7, 1.0]
  • Failure: 0.2 × task_score → [0.0, 0.2]

Task score (_compute_task_score):

  • Task 1/2: 0.60*coverage + 0.25*precision + 0.15*efficiency
  • Task 3: 0.55*coverage + 0.25*precision + 0.20*multi_hop_coverage

Fault Math (Implemented)

All transformations are in server/fault_math.py.

  • CHUNK_TOO_LARGE: 1D uniform filter along chunk axis; severity scales with chunk_size
  • CHUNK_TOO_SMALL: gaussian noise scaled by small chunk size, mitigated by overlap
  • THRESHOLD_TOO_LOW: additive gaussian noise
  • THRESHOLD_TOO_HIGH: multiplicative score deflation (* 0.55)
  • TOP_K_TOO_SMALL: score compression toward 0.5; less severe if reranking enabled
  • DUPLICATE_FLOODING: boosts random duplicate columns; reduced if reranking enabled
  • CONTEXT_OVERFLOW: zeroes tail columns based on context_window_limit
  • NO_RERANKING: additive noise only when reranking is off
  • WRONG_EMBEDDING_MODEL: implicit by selecting wrong matrix (not a direct transform)
  • Cross-encoder reranking blend: after all faults, if use_reranking=True, blends faulted scores back toward pre-fault scores (alpha=0.35). Simulates a cross-encoder partially recovering true relevance signal. Non-monotonic for noise-based faults (changes rank order), restores score spread for compression faults.

Determinism and Fallbacks

  • Noise arrays and duplicate indices are sampled once at reset and reused during recomputation for deterministic intra-episode behavior.
  • If required corpus files are missing, server/corpus.py falls back to synthetic data and emits warnings.
  • Synthetic fallback is for smoke testing only, not for real training/evaluation.