How it works
Three primitives. Nine reward signals. One goal: no data_qualitys.
reset()
Sample a question + context document from one of 38 curated datasets, stratified by difficulty tier.
step(answer)
Submit your answer with confidence and a source quote. Receive a dense reward signal across all 9 components.
grade()
Aggregate episode rewards into a task score. Track accuracy, data_quality rate, and skill rating over time.
9-Component Reward System
Every answer is graded on factual correctness, source grounding, citation accuracy, confidence calibration, semantic consistency, data_quality detection, ROUGE-L, BERTScore, and AlignScore. Each component is weighted and combined into a single scalar reward in [0, 1]. Confident wrong answers are penalized harder than uncertain ones.
Curriculum Progression
Episodes advance from Beginner (single-hop factual QA with unambiguous ground-truth) through Intermediate (multi-hop synthesis across multiple context sentences) to Advanced (adversarial prompts where confident refusals score best). The environment tracks a live skill rating and adjusts difficulty sampling accordingly.
Task Tiers
Three progressively harder tasks drawn from 38 datasets with 1M+ examples.
Factual Grounding
Beginner ~450K examplesAnswer straightforward factual questions from a short context passage. Single-hop retrieval with unambiguous ground truth. The grader rewards precise citation and heavily penalizes adding information not found in the context.
Multi-Hop Synthesis
Intermediate ~380K examplesSynthesize evidence from multiple context sentences to reach an answer. Requires connecting disparate facts without fabricating bridge claims. AlignScore and BERTScore are weighted more heavily at this tier.
Adversarial Resistance
Advanced ~210K examplesResist adversarial prompts designed to elicit data_qualitys. Many questions are deliberately unanswerable — confident refusals with low confidence score better than fabricated plausible-sounding answers.
API Reference
RESTful JSON API. All endpoints accept and return application/json. No auth required.
| Method | Endpoint | Description |
|---|---|---|
| POST | /reset | Start episode — returns question, context, difficulty, episode_id |
| POST | /step | Submit answer with confidence + source_quote, receive reward breakdown |
| GET | /state | Current episode metadata — accuracy, data_quality_rate, skill_rating |
| GET | /tasks | List all 3 tasks with action schema |
| POST | /grader | Score a completed episode (0.0 – 1.0) from rewards + infos |
| POST | /baseline | Run heuristic baseline across all 3 tasks |
| GET | /metadata | Environment name, version, license |
| GET | /schema | Full JSON schemas for action, observation, state |
| GET | /health | Health check — returns {"status":"healthy"} |
| POST | /mcp | JSON-RPC 2.0 tool discovery for MCP clients |
| GET | /leaderboard | Ranked leaderboard by avg_reward |
| POST | /leaderboard/submit | Submit model results for ranking |
Quick Start
Three commands to run your first episode.
Interactive Playground
Reset an episode, read the context, craft your answer, and see the live reward breakdown.