Spaces:

vankap-grover
/

rag_debug_env

Sleeping

App Files Files Community

rag_debug_env / docs /MODELS_REFERENCE.md

vankap-grover

Upload folder using huggingface_hub

ac224ce verified about 2 months ago

preview code

raw

history blame contribute delete

4.72 kB

	# Models Reference

	Field-by-field reference for the Pydantic models in `models.py`.

	## Enums

	### `EmbeddingModel`

	- `GENERAL` -> `sentence-transformers/all-MiniLM-L6-v2`
	- `MEDICAL` -> `NeuML/pubmedbert-base-embeddings`
	- `LEGAL` -> `nlpaueb/legal-bert-base-uncased`
	- `CODE` -> `sentence-transformers/multi-qa-mpnet-base-dot-v1`

	### `Domain`

	- `SOFTWARE`
	- `CLIMATE`
	- `MEDICAL`

	### `ActionType`

	- `ADJUST_CHUNK_SIZE` (`params.value: int`)
	- `ADJUST_CHUNK_OVERLAP` (`params.value: int`)
	- `ADJUST_THRESHOLD` (`params.value: float`)
	- `ADJUST_TOP_K` (`params.value: int`)
	- `SWAP_EMBEDDING_MODEL` (`params.model: str`)
	- `TOGGLE_RERANKING` (`params.enabled: bool`)
	- `ADJUST_CONTEXT_LIMIT` (`params.value: int`)
	- `REWRITE_QUERY` (`params.query_id: int`, optional `strategy` accepted by callers)
	- `SUBMIT` (no params)

	### `FaultType`

	- `CHUNK_TOO_LARGE`
	- `CHUNK_TOO_SMALL`
	- `THRESHOLD_TOO_LOW`
	- `THRESHOLD_TOO_HIGH`
	- `TOP_K_TOO_SMALL`
	- `CONTEXT_OVERFLOW`
	- `DUPLICATE_FLOODING`
	- `WRONG_EMBEDDING_MODEL`
	- `NO_RERANKING`

	## Tier 1 OpenEnv Interface Models

	### `RAGDebugAction(Action)`

	Fields:

	- `action_type: ActionType`
	- `params: dict[str, Any] = {}`

	Notes:

	- `params` accepts dict or JSON-stringified dict (validator coercion)
	- used by server step routing in `_apply_action`

	### `RAGDebugObservation(Observation)`

	Fields:

	- `pipeline_config: PipelineConfig`
	- `query_results: list[QueryResult]`
	- `metrics: QualityMetrics`
	- `corpus_stats: CorpusStats`
	- `steps_taken: int`
	- `max_steps: int`
	- `task_id: int`
	- `task_description: str`
	- `done: bool`
	- `last_action_error: str \| None`
	- `diagnostic_hints: list[str]`
	- `reward_components: dict[str, float]`

	Design note:

	- injected faults are intentionally omitted from observation and remain internal.

	## Tier 2 Internal Models

	### `PipelineConfig`

	Defaults and bounds:

	- `chunk_size: int = 512` (`64..2048`)
	- `chunk_overlap: int = 50` (`0..500`)
	- `similarity_threshold: float = 0.3` (`0.0..1.0`)
	- `top_k: int = 10` (`1..50`)
	- `embedding_model: EmbeddingModel = GENERAL`
	- `use_reranking: bool = False`
	- `context_window_limit: int = 4096` (`512..16384`)

	Validation:

	- `chunk_overlap < chunk_size` (model validator)

	### `QueryResult`

	Per-query retrieval result:

	- `query_id: int`
	- `query_text: str`
	- `retrieved_chunk_ids: list[int]`
	- `retrieval_scores: list[float]`
	- `n_retrieved: int`
	- `coverage_score: float` (`0..1`)
	- `precision_score: float` (`0..1`)
	- `is_multi_hop: bool = False`

	Metric definitions:

	- coverage = `\|R_agent ∩ R\| / \|R\|`
	- precision = `\|R_agent ∩ R*\| / \|R_agent\|`

	### `QualityMetrics`

	Aggregate metrics:

	- `mean_coverage: float`
	- `mean_precision: float`
	- `mean_recall: float`
	- `n_empty_retrievals: int`
	- `n_context_overflows: int`
	- `multi_hop_coverage: float \| None`

	### `CorpusStats`

	Static corpus metadata:

	- `domain: Domain`
	- `n_documents: int`
	- `n_chunks: int`
	- `avg_chunk_tokens: int`
	- `has_near_duplicates: bool`
	- `n_queries: int`
	- `n_multi_hop_queries: int`

	### `Reward`

	- `value: float`
	- `components: dict[str, float]`

	Component names emitted by environment reward logic:

	- `progress_reward`
	- `delta_bonus`
	- `empty_retrieval_signal`
	- `overflow_signal`
	- `step_cost`
	- `redundancy_penalty`
	- `invalid_action_penalty` (only when the last action had invalid params)

	Terminal submit components:

	- `terminal_success` (successful submit)
	- `terminal_failure` (unsuccessful submit)

	Terminal submit rewards are handled directly in action routing:

	- success: `0.7 + 0.3 * task_score` (clipped to `[0.7, 1.0]`)
	- failure: `0.2 * task_score` (clipped to `[0.0, 0.2]`)

	### `FaultConfig`

	Internal fault descriptor:

	- `fault_type: FaultType`
	- `params: dict[str, Any] = {}`
	- `description: str = ""`

	### `InternalState`

	Server-side state:

	- `injected_faults: list[FaultConfig]`
	- `episode_seed: int`
	- `action_history: list[RAGDebugAction]`
	- `reward_history: list[float]`

	Properties:

	- `total_reward`
	- `fault_names`

	### `EpisodeResult`

	Post-episode summary model (not currently exposed via custom app endpoint):

	- `task_id: int`
	- `task_score: float` (`0..1`)
	- `success: bool`
	- `n_steps: int`
	- `total_reward: float`
	- `final_metrics: QualityMetrics`
	- `fault_names: list[str]`
	- `action_history: list[RAGDebugAction]`

	## Runtime Scoring Rules (Environment)

	From `server/rag_debug_env_environment.py`:

	Task score:

	- Task 1 and 2:
	- `0.60 * mean_coverage + 0.25 * mean_precision + 0.15 * (1 - n_steps/max_steps)`
	- Task 3:
	- `0.55 * mean_coverage + 0.25 * mean_precision + 0.20 * multi_hop_coverage`

	Success checks:

	- Task 1: `task_score >= 0.75`
	- Task 2: `task_score >= 0.75`
	- Task 3: `task_score >= 0.70` and `multi_hop_coverage > 0.60`