rag_debug_env / docs /MODELS_REFERENCE.md
vankap-grover's picture
Upload folder using huggingface_hub
ac224ce verified

Models Reference

Field-by-field reference for the Pydantic models in models.py.

Enums

EmbeddingModel

  • GENERAL -> sentence-transformers/all-MiniLM-L6-v2
  • MEDICAL -> NeuML/pubmedbert-base-embeddings
  • LEGAL -> nlpaueb/legal-bert-base-uncased
  • CODE -> sentence-transformers/multi-qa-mpnet-base-dot-v1

Domain

  • SOFTWARE
  • CLIMATE
  • MEDICAL

ActionType

  • ADJUST_CHUNK_SIZE (params.value: int)
  • ADJUST_CHUNK_OVERLAP (params.value: int)
  • ADJUST_THRESHOLD (params.value: float)
  • ADJUST_TOP_K (params.value: int)
  • SWAP_EMBEDDING_MODEL (params.model: str)
  • TOGGLE_RERANKING (params.enabled: bool)
  • ADJUST_CONTEXT_LIMIT (params.value: int)
  • REWRITE_QUERY (params.query_id: int, optional strategy accepted by callers)
  • SUBMIT (no params)

FaultType

  • CHUNK_TOO_LARGE
  • CHUNK_TOO_SMALL
  • THRESHOLD_TOO_LOW
  • THRESHOLD_TOO_HIGH
  • TOP_K_TOO_SMALL
  • CONTEXT_OVERFLOW
  • DUPLICATE_FLOODING
  • WRONG_EMBEDDING_MODEL
  • NO_RERANKING

Tier 1 OpenEnv Interface Models

RAGDebugAction(Action)

Fields:

  • action_type: ActionType
  • params: dict[str, Any] = {}

Notes:

  • params accepts dict or JSON-stringified dict (validator coercion)
  • used by server step routing in _apply_action

RAGDebugObservation(Observation)

Fields:

  • pipeline_config: PipelineConfig
  • query_results: list[QueryResult]
  • metrics: QualityMetrics
  • corpus_stats: CorpusStats
  • steps_taken: int
  • max_steps: int
  • task_id: int
  • task_description: str
  • done: bool
  • last_action_error: str | None
  • diagnostic_hints: list[str]
  • reward_components: dict[str, float]

Design note:

  • injected faults are intentionally omitted from observation and remain internal.

Tier 2 Internal Models

PipelineConfig

Defaults and bounds:

  • chunk_size: int = 512 (64..2048)
  • chunk_overlap: int = 50 (0..500)
  • similarity_threshold: float = 0.3 (0.0..1.0)
  • top_k: int = 10 (1..50)
  • embedding_model: EmbeddingModel = GENERAL
  • use_reranking: bool = False
  • context_window_limit: int = 4096 (512..16384)

Validation:

  • chunk_overlap < chunk_size (model validator)

QueryResult

Per-query retrieval result:

  • query_id: int
  • query_text: str
  • retrieved_chunk_ids: list[int]
  • retrieval_scores: list[float]
  • n_retrieved: int
  • coverage_score: float (0..1)
  • precision_score: float (0..1)
  • is_multi_hop: bool = False

Metric definitions:

  • coverage = |R_agent ∩ R*| / |R*|
  • precision = |R_agent ∩ R*| / |R_agent|

QualityMetrics

Aggregate metrics:

  • mean_coverage: float
  • mean_precision: float
  • mean_recall: float
  • n_empty_retrievals: int
  • n_context_overflows: int
  • multi_hop_coverage: float | None

CorpusStats

Static corpus metadata:

  • domain: Domain
  • n_documents: int
  • n_chunks: int
  • avg_chunk_tokens: int
  • has_near_duplicates: bool
  • n_queries: int
  • n_multi_hop_queries: int

Reward

  • value: float
  • components: dict[str, float]

Component names emitted by environment reward logic:

  • progress_reward
  • delta_bonus
  • empty_retrieval_signal
  • overflow_signal
  • step_cost
  • redundancy_penalty
  • invalid_action_penalty (only when the last action had invalid params)

Terminal submit components:

  • terminal_success (successful submit)
  • terminal_failure (unsuccessful submit)

Terminal submit rewards are handled directly in action routing:

  • success: 0.7 + 0.3 * task_score (clipped to [0.7, 1.0])
  • failure: 0.2 * task_score (clipped to [0.0, 0.2])

FaultConfig

Internal fault descriptor:

  • fault_type: FaultType
  • params: dict[str, Any] = {}
  • description: str = ""

InternalState

Server-side state:

  • injected_faults: list[FaultConfig]
  • episode_seed: int
  • action_history: list[RAGDebugAction]
  • reward_history: list[float]

Properties:

  • total_reward
  • fault_names

EpisodeResult

Post-episode summary model (not currently exposed via custom app endpoint):

  • task_id: int
  • task_score: float (0..1)
  • success: bool
  • n_steps: int
  • total_reward: float
  • final_metrics: QualityMetrics
  • fault_names: list[str]
  • action_history: list[RAGDebugAction]

Runtime Scoring Rules (Environment)

From server/rag_debug_env_environment.py:

Task score:

  • Task 1 and 2:
    • 0.60 * mean_coverage + 0.25 * mean_precision + 0.15 * (1 - n_steps/max_steps)
  • Task 3:
    • 0.55 * mean_coverage + 0.25 * mean_precision + 0.20 * multi_hop_coverage

Success checks:

  • Task 1: task_score >= 0.75
  • Task 2: task_score >= 0.75
  • Task 3: task_score >= 0.70 and multi_hop_coverage > 0.60