Spaces:

vankap-grover
/

rag_debug_env

Sleeping

ADJUST_CHUNK_SIZE (params.value: int)
ADJUST_CHUNK_OVERLAP (params.value: int)
ADJUST_THRESHOLD (params.value: float)
ADJUST_TOP_K (params.value: int)
SWAP_EMBEDDING_MODEL (params.model: str)
TOGGLE_RERANKING (params.enabled: bool)
ADJUST_CONTEXT_LIMIT (params.value: int)
REWRITE_QUERY (params.query_id: int, optional strategy accepted by callers)
SUBMIT (no params)

`FaultType`

CHUNK_TOO_LARGE
CHUNK_TOO_SMALL
THRESHOLD_TOO_LOW
THRESHOLD_TOO_HIGH
TOP_K_TOO_SMALL
CONTEXT_OVERFLOW
DUPLICATE_FLOODING
WRONG_EMBEDDING_MODEL
NO_RERANKING

Tier 1 OpenEnv Interface Models

`RAGDebugAction(Action)`

Fields:

action_type: ActionType
params: dict[str, Any] = {}

Notes:

params accepts dict or JSON-stringified dict (validator coercion)
used by server step routing in _apply_action

`RAGDebugObservation(Observation)`

Fields:

pipeline_config: PipelineConfig
query_results: list[QueryResult]
metrics: QualityMetrics
corpus_stats: CorpusStats
steps_taken: int
max_steps: int
task_id: int
task_description: str
done: bool
last_action_error: str | None
diagnostic_hints: list[str]
reward_components: dict[str, float]

Design note:

injected faults are intentionally omitted from observation and remain internal.

Tier 2 Internal Models

`PipelineConfig`

Defaults and bounds:

chunk_size: int = 512 (64..2048)
chunk_overlap: int = 50 (0..500)
similarity_threshold: float = 0.3 (0.0..1.0)
top_k: int = 10 (1..50)
embedding_model: EmbeddingModel = GENERAL
use_reranking: bool = False
context_window_limit: int = 4096 (512..16384)

Validation:

chunk_overlap < chunk_size (model validator)

`QueryResult`

Per-query retrieval result:

query_id: int
query_text: str
retrieved_chunk_ids: list[int]
retrieval_scores: list[float]
n_retrieved: int
coverage_score: float (0..1)
precision_score: float (0..1)
is_multi_hop: bool = False

Metric definitions:

coverage = |R_agent ∩ R*| / |R*|
precision = |R_agent ∩ R*| / |R_agent|

`QualityMetrics`

Aggregate metrics:

mean_coverage: float
mean_precision: float
mean_recall: float
n_empty_retrievals: int
n_context_overflows: int
multi_hop_coverage: float | None

`CorpusStats`

Static corpus metadata:

domain: Domain
n_documents: int
n_chunks: int
avg_chunk_tokens: int
has_near_duplicates: bool
n_queries: int
n_multi_hop_queries: int

`Reward`

value: float
components: dict[str, float]

Component names emitted by environment reward logic:

progress_reward
delta_bonus
empty_retrieval_signal
overflow_signal
step_cost
redundancy_penalty
invalid_action_penalty (only when the last action had invalid params)

Terminal submit components:

terminal_success (successful submit)
terminal_failure (unsuccessful submit)

Terminal submit rewards are handled directly in action routing:

success: 0.7 + 0.3 * task_score (clipped to [0.7, 1.0])
failure: 0.2 * task_score (clipped to [0.0, 0.2])

`FaultConfig`

Internal fault descriptor:

fault_type: FaultType
params: dict[str, Any] = {}
description: str = ""

`InternalState`

Server-side state:

injected_faults: list[FaultConfig]
episode_seed: int
action_history: list[RAGDebugAction]
reward_history: list[float]

Properties:

total_reward
fault_names

`EpisodeResult`

Post-episode summary model (not currently exposed via custom app endpoint):

task_id: int
task_score: float (0..1)
success: bool
n_steps: int
total_reward: float
final_metrics: QualityMetrics
fault_names: list[str]
action_history: list[RAGDebugAction]

Runtime Scoring Rules (Environment)

From server/rag_debug_env_environment.py:

Task score:

Task 1 and 2:
- 0.60 * mean_coverage + 0.25 * mean_precision + 0.15 * (1 - n_steps/max_steps)
Task 3:
- 0.55 * mean_coverage + 0.25 * mean_precision + 0.20 * multi_hop_coverage

Success checks:

Task 1: task_score >= 0.75
Task 2: task_score >= 0.75
Task 3: task_score >= 0.70 and multi_hop_coverage > 0.60