Spaces:
Sleeping
Sleeping
Models Reference
Field-by-field reference for the Pydantic models in models.py.
Enums
EmbeddingModel
GENERAL->sentence-transformers/all-MiniLM-L6-v2MEDICAL->NeuML/pubmedbert-base-embeddingsLEGAL->nlpaueb/legal-bert-base-uncasedCODE->sentence-transformers/multi-qa-mpnet-base-dot-v1
Domain
SOFTWARECLIMATEMEDICAL
ActionType
ADJUST_CHUNK_SIZE(params.value: int)ADJUST_CHUNK_OVERLAP(params.value: int)ADJUST_THRESHOLD(params.value: float)ADJUST_TOP_K(params.value: int)SWAP_EMBEDDING_MODEL(params.model: str)TOGGLE_RERANKING(params.enabled: bool)ADJUST_CONTEXT_LIMIT(params.value: int)REWRITE_QUERY(params.query_id: int, optionalstrategyaccepted by callers)SUBMIT(no params)
FaultType
CHUNK_TOO_LARGECHUNK_TOO_SMALLTHRESHOLD_TOO_LOWTHRESHOLD_TOO_HIGHTOP_K_TOO_SMALLCONTEXT_OVERFLOWDUPLICATE_FLOODINGWRONG_EMBEDDING_MODELNO_RERANKING
Tier 1 OpenEnv Interface Models
RAGDebugAction(Action)
Fields:
action_type: ActionTypeparams: dict[str, Any] = {}
Notes:
paramsaccepts dict or JSON-stringified dict (validator coercion)- used by server step routing in
_apply_action
RAGDebugObservation(Observation)
Fields:
pipeline_config: PipelineConfigquery_results: list[QueryResult]metrics: QualityMetricscorpus_stats: CorpusStatssteps_taken: intmax_steps: inttask_id: inttask_description: strdone: boollast_action_error: str | Nonediagnostic_hints: list[str]reward_components: dict[str, float]
Design note:
- injected faults are intentionally omitted from observation and remain internal.
Tier 2 Internal Models
PipelineConfig
Defaults and bounds:
chunk_size: int = 512(64..2048)chunk_overlap: int = 50(0..500)similarity_threshold: float = 0.3(0.0..1.0)top_k: int = 10(1..50)embedding_model: EmbeddingModel = GENERALuse_reranking: bool = Falsecontext_window_limit: int = 4096(512..16384)
Validation:
chunk_overlap < chunk_size(model validator)
QueryResult
Per-query retrieval result:
query_id: intquery_text: strretrieved_chunk_ids: list[int]retrieval_scores: list[float]n_retrieved: intcoverage_score: float(0..1)precision_score: float(0..1)is_multi_hop: bool = False
Metric definitions:
- coverage =
|R_agent ∩ R*| / |R*| - precision =
|R_agent ∩ R*| / |R_agent|
QualityMetrics
Aggregate metrics:
mean_coverage: floatmean_precision: floatmean_recall: floatn_empty_retrievals: intn_context_overflows: intmulti_hop_coverage: float | None
CorpusStats
Static corpus metadata:
domain: Domainn_documents: intn_chunks: intavg_chunk_tokens: inthas_near_duplicates: booln_queries: intn_multi_hop_queries: int
Reward
value: floatcomponents: dict[str, float]
Component names emitted by environment reward logic:
progress_rewarddelta_bonusempty_retrieval_signaloverflow_signalstep_costredundancy_penaltyinvalid_action_penalty(only when the last action had invalid params)
Terminal submit components:
terminal_success(successful submit)terminal_failure(unsuccessful submit)
Terminal submit rewards are handled directly in action routing:
- success:
0.7 + 0.3 * task_score(clipped to[0.7, 1.0]) - failure:
0.2 * task_score(clipped to[0.0, 0.2])
FaultConfig
Internal fault descriptor:
fault_type: FaultTypeparams: dict[str, Any] = {}description: str = ""
InternalState
Server-side state:
injected_faults: list[FaultConfig]episode_seed: intaction_history: list[RAGDebugAction]reward_history: list[float]
Properties:
total_rewardfault_names
EpisodeResult
Post-episode summary model (not currently exposed via custom app endpoint):
task_id: inttask_score: float(0..1)success: booln_steps: inttotal_reward: floatfinal_metrics: QualityMetricsfault_names: list[str]action_history: list[RAGDebugAction]
Runtime Scoring Rules (Environment)
From server/rag_debug_env_environment.py:
Task score:
- Task 1 and 2:
0.60 * mean_coverage + 0.25 * mean_precision + 0.15 * (1 - n_steps/max_steps)
- Task 3:
0.55 * mean_coverage + 0.25 * mean_precision + 0.20 * multi_hop_coverage
Success checks:
- Task 1:
task_score >= 0.75 - Task 2:
task_score >= 0.75 - Task 3:
task_score >= 0.70andmulti_hop_coverage > 0.60