Spaces:

vankap-grover
/

rag_debug_env

Sleeping

File size: 4,716 Bytes

# Models Reference

Field-by-field reference for the Pydantic models in `models.py`.

## Enums

### `EmbeddingModel`

- `GENERAL` -> `sentence-transformers/all-MiniLM-L6-v2`
- `MEDICAL` -> `NeuML/pubmedbert-base-embeddings`
- `LEGAL` -> `nlpaueb/legal-bert-base-uncased`
- `CODE` -> `sentence-transformers/multi-qa-mpnet-base-dot-v1`

### `Domain`

- `SOFTWARE`
- `CLIMATE`
- `MEDICAL`

### `ActionType`

- `ADJUST_CHUNK_SIZE` (`params.value: int`)
- `ADJUST_CHUNK_OVERLAP` (`params.value: int`)
- `ADJUST_THRESHOLD` (`params.value: float`)
- `ADJUST_TOP_K` (`params.value: int`)
- `SWAP_EMBEDDING_MODEL` (`params.model: str`)
- `TOGGLE_RERANKING` (`params.enabled: bool`)
- `ADJUST_CONTEXT_LIMIT` (`params.value: int`)
- `REWRITE_QUERY` (`params.query_id: int`, optional `strategy` accepted by callers)
- `SUBMIT` (no params)

### `FaultType`

- `CHUNK_TOO_LARGE`
- `CHUNK_TOO_SMALL`
- `THRESHOLD_TOO_LOW`
- `THRESHOLD_TOO_HIGH`
- `TOP_K_TOO_SMALL`
- `CONTEXT_OVERFLOW`
- `DUPLICATE_FLOODING`
- `WRONG_EMBEDDING_MODEL`
- `NO_RERANKING`

## Tier 1 OpenEnv Interface Models

### `RAGDebugAction(Action)`

Fields:

- `action_type: ActionType`
- `params: dict[str, Any] = {}`

Notes:

- `params` accepts dict or JSON-stringified dict (validator coercion)
- used by server step routing in `_apply_action`

### `RAGDebugObservation(Observation)`

Fields:

- `pipeline_config: PipelineConfig`
- `query_results: list[QueryResult]`
- `metrics: QualityMetrics`
- `corpus_stats: CorpusStats`
- `steps_taken: int`
- `max_steps: int`
- `task_id: int`
- `task_description: str`
- `done: bool`
- `last_action_error: str | None`
- `diagnostic_hints: list[str]`
- `reward_components: dict[str, float]`

Design note:

- injected faults are intentionally omitted from observation and remain internal.

## Tier 2 Internal Models

### `PipelineConfig`

Defaults and bounds:

- `chunk_size: int = 512` (`64..2048`)
- `chunk_overlap: int = 50` (`0..500`)
- `similarity_threshold: float = 0.3` (`0.0..1.0`)
- `top_k: int = 10` (`1..50`)
- `embedding_model: EmbeddingModel = GENERAL`
- `use_reranking: bool = False`
- `context_window_limit: int = 4096` (`512..16384`)

Validation:

- `chunk_overlap < chunk_size` (model validator)

### `QueryResult`

Per-query retrieval result:

- `query_id: int`
- `query_text: str`
- `retrieved_chunk_ids: list[int]`
- `retrieval_scores: list[float]`
- `n_retrieved: int`
- `coverage_score: float` (`0..1`)
- `precision_score: float` (`0..1`)
- `is_multi_hop: bool = False`

Metric definitions:

- coverage = `|R_agent ∩ R*| / |R*|`
- precision = `|R_agent ∩ R*| / |R_agent|`

### `QualityMetrics`

Aggregate metrics:

- `mean_coverage: float`
- `mean_precision: float`
- `mean_recall: float`
- `n_empty_retrievals: int`
- `n_context_overflows: int`
- `multi_hop_coverage: float | None`

### `CorpusStats`

Static corpus metadata:

- `domain: Domain`
- `n_documents: int`
- `n_chunks: int`
- `avg_chunk_tokens: int`
- `has_near_duplicates: bool`
- `n_queries: int`
- `n_multi_hop_queries: int`

### `Reward`

- `value: float`
- `components: dict[str, float]`

Component names emitted by environment reward logic:

- `progress_reward`
- `delta_bonus`
- `empty_retrieval_signal`
- `overflow_signal`
- `step_cost`
- `redundancy_penalty`
- `invalid_action_penalty` (only when the last action had invalid params)

Terminal submit components:

- `terminal_success` (successful submit)
- `terminal_failure` (unsuccessful submit)

Terminal submit rewards are handled directly in action routing:

- success: `0.7 + 0.3 * task_score` (clipped to `[0.7, 1.0]`)
- failure: `0.2 * task_score` (clipped to `[0.0, 0.2]`)

### `FaultConfig`

Internal fault descriptor:

- `fault_type: FaultType`
- `params: dict[str, Any] = {}`
- `description: str = ""`

### `InternalState`

Server-side state:

- `injected_faults: list[FaultConfig]`
- `episode_seed: int`
- `action_history: list[RAGDebugAction]`
- `reward_history: list[float]`

Properties:

- `total_reward`
- `fault_names`

### `EpisodeResult`

Post-episode summary model (not currently exposed via custom app endpoint):

- `task_id: int`
- `task_score: float` (`0..1`)
- `success: bool`
- `n_steps: int`
- `total_reward: float`
- `final_metrics: QualityMetrics`
- `fault_names: list[str]`
- `action_history: list[RAGDebugAction]`

## Runtime Scoring Rules (Environment)

From `server/rag_debug_env_environment.py`:

Task score:

- Task 1 and 2:
  - `0.60 * mean_coverage + 0.25 * mean_precision + 0.15 * (1 - n_steps/max_steps)`
- Task 3:
  - `0.55 * mean_coverage + 0.25 * mean_precision + 0.20 * multi_hop_coverage`

Success checks:

- Task 1: `task_score >= 0.75`
- Task 2: `task_score >= 0.75`
- Task 3: `task_score >= 0.70` and `multi_hop_coverage > 0.60`