rag_debug_env / docs /MODELS_REFERENCE.md
vankap-grover's picture
Upload folder using huggingface_hub
ac224ce verified
# Models Reference
Field-by-field reference for the Pydantic models in `models.py`.
## Enums
### `EmbeddingModel`
- `GENERAL` -> `sentence-transformers/all-MiniLM-L6-v2`
- `MEDICAL` -> `NeuML/pubmedbert-base-embeddings`
- `LEGAL` -> `nlpaueb/legal-bert-base-uncased`
- `CODE` -> `sentence-transformers/multi-qa-mpnet-base-dot-v1`
### `Domain`
- `SOFTWARE`
- `CLIMATE`
- `MEDICAL`
### `ActionType`
- `ADJUST_CHUNK_SIZE` (`params.value: int`)
- `ADJUST_CHUNK_OVERLAP` (`params.value: int`)
- `ADJUST_THRESHOLD` (`params.value: float`)
- `ADJUST_TOP_K` (`params.value: int`)
- `SWAP_EMBEDDING_MODEL` (`params.model: str`)
- `TOGGLE_RERANKING` (`params.enabled: bool`)
- `ADJUST_CONTEXT_LIMIT` (`params.value: int`)
- `REWRITE_QUERY` (`params.query_id: int`, optional `strategy` accepted by callers)
- `SUBMIT` (no params)
### `FaultType`
- `CHUNK_TOO_LARGE`
- `CHUNK_TOO_SMALL`
- `THRESHOLD_TOO_LOW`
- `THRESHOLD_TOO_HIGH`
- `TOP_K_TOO_SMALL`
- `CONTEXT_OVERFLOW`
- `DUPLICATE_FLOODING`
- `WRONG_EMBEDDING_MODEL`
- `NO_RERANKING`
## Tier 1 OpenEnv Interface Models
### `RAGDebugAction(Action)`
Fields:
- `action_type: ActionType`
- `params: dict[str, Any] = {}`
Notes:
- `params` accepts dict or JSON-stringified dict (validator coercion)
- used by server step routing in `_apply_action`
### `RAGDebugObservation(Observation)`
Fields:
- `pipeline_config: PipelineConfig`
- `query_results: list[QueryResult]`
- `metrics: QualityMetrics`
- `corpus_stats: CorpusStats`
- `steps_taken: int`
- `max_steps: int`
- `task_id: int`
- `task_description: str`
- `done: bool`
- `last_action_error: str | None`
- `diagnostic_hints: list[str]`
- `reward_components: dict[str, float]`
Design note:
- injected faults are intentionally omitted from observation and remain internal.
## Tier 2 Internal Models
### `PipelineConfig`
Defaults and bounds:
- `chunk_size: int = 512` (`64..2048`)
- `chunk_overlap: int = 50` (`0..500`)
- `similarity_threshold: float = 0.3` (`0.0..1.0`)
- `top_k: int = 10` (`1..50`)
- `embedding_model: EmbeddingModel = GENERAL`
- `use_reranking: bool = False`
- `context_window_limit: int = 4096` (`512..16384`)
Validation:
- `chunk_overlap < chunk_size` (model validator)
### `QueryResult`
Per-query retrieval result:
- `query_id: int`
- `query_text: str`
- `retrieved_chunk_ids: list[int]`
- `retrieval_scores: list[float]`
- `n_retrieved: int`
- `coverage_score: float` (`0..1`)
- `precision_score: float` (`0..1`)
- `is_multi_hop: bool = False`
Metric definitions:
- coverage = `|R_agent ∩ R*| / |R*|`
- precision = `|R_agent ∩ R*| / |R_agent|`
### `QualityMetrics`
Aggregate metrics:
- `mean_coverage: float`
- `mean_precision: float`
- `mean_recall: float`
- `n_empty_retrievals: int`
- `n_context_overflows: int`
- `multi_hop_coverage: float | None`
### `CorpusStats`
Static corpus metadata:
- `domain: Domain`
- `n_documents: int`
- `n_chunks: int`
- `avg_chunk_tokens: int`
- `has_near_duplicates: bool`
- `n_queries: int`
- `n_multi_hop_queries: int`
### `Reward`
- `value: float`
- `components: dict[str, float]`
Component names emitted by environment reward logic:
- `progress_reward`
- `delta_bonus`
- `empty_retrieval_signal`
- `overflow_signal`
- `step_cost`
- `redundancy_penalty`
- `invalid_action_penalty` (only when the last action had invalid params)
Terminal submit components:
- `terminal_success` (successful submit)
- `terminal_failure` (unsuccessful submit)
Terminal submit rewards are handled directly in action routing:
- success: `0.7 + 0.3 * task_score` (clipped to `[0.7, 1.0]`)
- failure: `0.2 * task_score` (clipped to `[0.0, 0.2]`)
### `FaultConfig`
Internal fault descriptor:
- `fault_type: FaultType`
- `params: dict[str, Any] = {}`
- `description: str = ""`
### `InternalState`
Server-side state:
- `injected_faults: list[FaultConfig]`
- `episode_seed: int`
- `action_history: list[RAGDebugAction]`
- `reward_history: list[float]`
Properties:
- `total_reward`
- `fault_names`
### `EpisodeResult`
Post-episode summary model (not currently exposed via custom app endpoint):
- `task_id: int`
- `task_score: float` (`0..1`)
- `success: bool`
- `n_steps: int`
- `total_reward: float`
- `final_metrics: QualityMetrics`
- `fault_names: list[str]`
- `action_history: list[RAGDebugAction]`
## Runtime Scoring Rules (Environment)
From `server/rag_debug_env_environment.py`:
Task score:
- Task 1 and 2:
- `0.60 * mean_coverage + 0.25 * mean_precision + 0.15 * (1 - n_steps/max_steps)`
- Task 3:
- `0.55 * mean_coverage + 0.25 * mean_precision + 0.20 * multi_hop_coverage`
Success checks:
- Task 1: `task_score >= 0.75`
- Task 2: `task_score >= 0.75`
- Task 3: `task_score >= 0.70` and `multi_hop_coverage > 0.60`