Spaces:
Sleeping
Sleeping
| # Models Reference | |
| Field-by-field reference for the Pydantic models in `models.py`. | |
| ## Enums | |
| ### `EmbeddingModel` | |
| - `GENERAL` -> `sentence-transformers/all-MiniLM-L6-v2` | |
| - `MEDICAL` -> `NeuML/pubmedbert-base-embeddings` | |
| - `LEGAL` -> `nlpaueb/legal-bert-base-uncased` | |
| - `CODE` -> `sentence-transformers/multi-qa-mpnet-base-dot-v1` | |
| ### `Domain` | |
| - `SOFTWARE` | |
| - `CLIMATE` | |
| - `MEDICAL` | |
| ### `ActionType` | |
| - `ADJUST_CHUNK_SIZE` (`params.value: int`) | |
| - `ADJUST_CHUNK_OVERLAP` (`params.value: int`) | |
| - `ADJUST_THRESHOLD` (`params.value: float`) | |
| - `ADJUST_TOP_K` (`params.value: int`) | |
| - `SWAP_EMBEDDING_MODEL` (`params.model: str`) | |
| - `TOGGLE_RERANKING` (`params.enabled: bool`) | |
| - `ADJUST_CONTEXT_LIMIT` (`params.value: int`) | |
| - `REWRITE_QUERY` (`params.query_id: int`, optional `strategy` accepted by callers) | |
| - `SUBMIT` (no params) | |
| ### `FaultType` | |
| - `CHUNK_TOO_LARGE` | |
| - `CHUNK_TOO_SMALL` | |
| - `THRESHOLD_TOO_LOW` | |
| - `THRESHOLD_TOO_HIGH` | |
| - `TOP_K_TOO_SMALL` | |
| - `CONTEXT_OVERFLOW` | |
| - `DUPLICATE_FLOODING` | |
| - `WRONG_EMBEDDING_MODEL` | |
| - `NO_RERANKING` | |
| ## Tier 1 OpenEnv Interface Models | |
| ### `RAGDebugAction(Action)` | |
| Fields: | |
| - `action_type: ActionType` | |
| - `params: dict[str, Any] = {}` | |
| Notes: | |
| - `params` accepts dict or JSON-stringified dict (validator coercion) | |
| - used by server step routing in `_apply_action` | |
| ### `RAGDebugObservation(Observation)` | |
| Fields: | |
| - `pipeline_config: PipelineConfig` | |
| - `query_results: list[QueryResult]` | |
| - `metrics: QualityMetrics` | |
| - `corpus_stats: CorpusStats` | |
| - `steps_taken: int` | |
| - `max_steps: int` | |
| - `task_id: int` | |
| - `task_description: str` | |
| - `done: bool` | |
| - `last_action_error: str | None` | |
| - `diagnostic_hints: list[str]` | |
| - `reward_components: dict[str, float]` | |
| Design note: | |
| - injected faults are intentionally omitted from observation and remain internal. | |
| ## Tier 2 Internal Models | |
| ### `PipelineConfig` | |
| Defaults and bounds: | |
| - `chunk_size: int = 512` (`64..2048`) | |
| - `chunk_overlap: int = 50` (`0..500`) | |
| - `similarity_threshold: float = 0.3` (`0.0..1.0`) | |
| - `top_k: int = 10` (`1..50`) | |
| - `embedding_model: EmbeddingModel = GENERAL` | |
| - `use_reranking: bool = False` | |
| - `context_window_limit: int = 4096` (`512..16384`) | |
| Validation: | |
| - `chunk_overlap < chunk_size` (model validator) | |
| ### `QueryResult` | |
| Per-query retrieval result: | |
| - `query_id: int` | |
| - `query_text: str` | |
| - `retrieved_chunk_ids: list[int]` | |
| - `retrieval_scores: list[float]` | |
| - `n_retrieved: int` | |
| - `coverage_score: float` (`0..1`) | |
| - `precision_score: float` (`0..1`) | |
| - `is_multi_hop: bool = False` | |
| Metric definitions: | |
| - coverage = `|R_agent ∩ R*| / |R*|` | |
| - precision = `|R_agent ∩ R*| / |R_agent|` | |
| ### `QualityMetrics` | |
| Aggregate metrics: | |
| - `mean_coverage: float` | |
| - `mean_precision: float` | |
| - `mean_recall: float` | |
| - `n_empty_retrievals: int` | |
| - `n_context_overflows: int` | |
| - `multi_hop_coverage: float | None` | |
| ### `CorpusStats` | |
| Static corpus metadata: | |
| - `domain: Domain` | |
| - `n_documents: int` | |
| - `n_chunks: int` | |
| - `avg_chunk_tokens: int` | |
| - `has_near_duplicates: bool` | |
| - `n_queries: int` | |
| - `n_multi_hop_queries: int` | |
| ### `Reward` | |
| - `value: float` | |
| - `components: dict[str, float]` | |
| Component names emitted by environment reward logic: | |
| - `progress_reward` | |
| - `delta_bonus` | |
| - `empty_retrieval_signal` | |
| - `overflow_signal` | |
| - `step_cost` | |
| - `redundancy_penalty` | |
| - `invalid_action_penalty` (only when the last action had invalid params) | |
| Terminal submit components: | |
| - `terminal_success` (successful submit) | |
| - `terminal_failure` (unsuccessful submit) | |
| Terminal submit rewards are handled directly in action routing: | |
| - success: `0.7 + 0.3 * task_score` (clipped to `[0.7, 1.0]`) | |
| - failure: `0.2 * task_score` (clipped to `[0.0, 0.2]`) | |
| ### `FaultConfig` | |
| Internal fault descriptor: | |
| - `fault_type: FaultType` | |
| - `params: dict[str, Any] = {}` | |
| - `description: str = ""` | |
| ### `InternalState` | |
| Server-side state: | |
| - `injected_faults: list[FaultConfig]` | |
| - `episode_seed: int` | |
| - `action_history: list[RAGDebugAction]` | |
| - `reward_history: list[float]` | |
| Properties: | |
| - `total_reward` | |
| - `fault_names` | |
| ### `EpisodeResult` | |
| Post-episode summary model (not currently exposed via custom app endpoint): | |
| - `task_id: int` | |
| - `task_score: float` (`0..1`) | |
| - `success: bool` | |
| - `n_steps: int` | |
| - `total_reward: float` | |
| - `final_metrics: QualityMetrics` | |
| - `fault_names: list[str]` | |
| - `action_history: list[RAGDebugAction]` | |
| ## Runtime Scoring Rules (Environment) | |
| From `server/rag_debug_env_environment.py`: | |
| Task score: | |
| - Task 1 and 2: | |
| - `0.60 * mean_coverage + 0.25 * mean_precision + 0.15 * (1 - n_steps/max_steps)` | |
| - Task 3: | |
| - `0.55 * mean_coverage + 0.25 * mean_precision + 0.20 * multi_hop_coverage` | |
| Success checks: | |
| - Task 1: `task_score >= 0.75` | |
| - Task 2: `task_score >= 0.75` | |
| - Task 3: `task_score >= 0.70` and `multi_hop_coverage > 0.60` | |