| --- |
| license: mit |
| language: |
| - en |
| - hi |
| - sa |
| pipeline_tag: reinforcement-learning |
| --- |
| # IndicScriptureQA β OpenEnv Environment |
|
|
| **Semantic structure and factual grounding evaluation for low-resource Indic languages.** |
|
|
| Most LLM benchmarks for Hindi, Sanskrit, and other Indic languages test surface-level factual recall β did the model get the right answer? This environment goes further. It evaluates whether an agent can produce answers that are not only **factually correct** but also **semantically well-formed**: logically ordered, terminologically precise, structurally complete, and coherently written β the qualities that separate a genuinely useful answer from one that merely contains the right words in the wrong shape. |
|
|
| The domain is Indic scriptural knowledge (Vedas, Upanishads, Ramayana, Mahabharata, Bhagavad Gita, Puranas), chosen because it stresses every axis at once: factual precision matters (misattributing a verse to the wrong text is a hallucination), but so does structural literacy β knowing that an explanation of Rta should distinguish its natural-law and moral-law dimensions, that the Samudra Manthan narrative has a specific dramatic arc, or that "nishkama karma" is the correct term, not the English gloss "selfless action." |
|
|
| ## The problem with low-resource language evaluation |
|
|
| LLMs fail on low-resource languages in ways that pure accuracy metrics miss: |
|
|
| **Terminology collapse.** Models substitute English glosses for domain-specific terms β writing "cosmic order" instead of "Rta", "meditation" instead of "dhyana", "duty" instead of "svadharma." This strips cultural and semantic precision even when the underlying fact is technically correct. |
|
|
| **Structural incoherence.** Answers about complex topics arrive as bags of loosely related facts instead of logically sequenced arguments. An explanation of the six Darshanas that jumbles founders with commentators, or a Dashavatara account that breaks chronological ordering, fails structurally even if every individual claim is true. |
|
|
| **Completeness gaps.** Models cover one dimension of a multi-faceted concept and call it done β describing dharma only as "duty" without addressing its subtlety (sukshma), context-dependence, or the rajadharma/apaddharma/moksha-dharma triad that the Mahabharata actually teaches. |
|
|
| **Misconception propagation.** Some errors are so common in training data that models reproduce them confidently β Shankaracharya "founding" Vedanta (he was a commentator, not the founder), or Indra "maintaining" Rta (that's Varuna). These need active detection and penalisation, not just factual comparison. |
|
|
| This environment provides a structured RL benchmark for training and evaluating agents that address **all four failure modes simultaneously**. |
|
|
| --- |
|
|
| ## How it works |
|
|
| An agent receives a question and a pre-generated answer that may be flawed along any combination of axes β factual errors, poor structure, missing terminology, wrong ordering, incomplete coverage. The agent interacts with the environment through a fixed action space to iteratively improve the answer before finalising it. |
|
|
| ## Action Space |
|
|
| | Action | Payload | Effect | |
| |---------------|-------------------------------------------|-------------------------------------------------------| |
| | `RETRIEVE` | Optional query string | Surfaces the next available source passage | |
| | `EDIT` | New answer text | Rewrites to fix factual errors and improve content | |
| | `RESTRUCTURE` | Reorganised answer text | Reorganises flow, ordering, and terminology without changing facts | |
| | `CITE` | Citation string (e.g. `"Bhagavad Gita 2.47"`) | Attaches a citation | |
| | `ACCEPT` | β | Accepts answer as final (terminal) | |
| | `REJECT` | β | Rejects the answer entirely (terminal) | |
|
|
| The distinction between `EDIT` and `RESTRUCTURE` is deliberate. EDIT changes what the answer says. RESTRUCTURE changes how it says it β reordering paragraphs, inserting transitions, swapping an English gloss for the correct Sanskrit term, expanding a single sentence into the three conceptual aspects the topic requires. The grader scores them differently: RESTRUCTURE is penalised if it destroys factual content, and EDIT is measured on both factual and structural improvement. |
|
|
| ## Observation Space |
|
|
| | Field | Type | Description | |
| |----------------------|--------------|-------------------------------------------------------------| |
| | `question` | `str` | The question being answered | |
| | `current_answer` | `str` | Current (possibly flawed) answer | |
| | `retrieved_passages` | `list[str]` | Source passages retrieved so far | |
| | `current_citations` | `list[str]` | Citations attached so far | |
| | `steps_remaining` | `int` | Steps left in the episode | |
| | `task_name` | `str` | Active task identifier | |
| | `feedback` | `str?` | Feedback from the last action (includes structural breakdown) | |
| | `structural_hints` | `list[str]` | Non-spoiler hints about expected answer structure | |
|
|
| `structural_hints` are the agent's window into what the grader expects structurally β things like "Use the Sanskrit term for selfless action", "Cover scriptural, ritual, AND mathematical dimensions", or "Follow narrative arc: setup β churning β crisis β treasures β resolution." They don't reveal the answer but guide the agent toward well-formed output. |
|
|
| ## Tasks |
|
|
| | Task | Difficulty | Max Steps | Focus | |
| |-----------------------|------------|-----------|-------| |
| | `verify-factual` | Easy | 5 | Can the agent distinguish a correct answer from a wrong one, accounting for both factual accuracy and structural adequacy? | |
| | `correct-and-cite` | Medium | 8 | Given a partially correct answer with missing citations and poor structure, can the agent retrieve sources, fix gaps, add terminology, and reorganise? | |
| | `fix-hallucination` | Hard | 12 | Can the agent detect subtle hallucinations woven into plausible text while simultaneously fixing structural problems: wrong concept ordering, banned misconception terms, incomplete aspect coverage? | |
|
|
| Each task has 5 scenarios covering the Vedas, Upanishads, Ramayana, Mahabharata, Bhagavad Gita, and Puranas. Every scenario carries both factual ground truth and a `StructuralMeta` specification defining required terms, required sections, expected ordering, and banned misconception markers. |
|
|
| ## Reward Structure |
|
|
| The final score blends **factual quality** and **structural quality** into **[0.0, 1.0]**. |
|
|
| ### Terminal reward (on ACCEPT) |
|
|
| | Component | Max | What it measures | |
| |---------------------|-------|------------------------------------------------------| |
| | Factual similarity | 0.90 | Token-F1 between final answer and ground truth | |
| | Citation recall | 0.30 | Fraction of expected citations matched | |
| | Structural quality | 0.70 | Composite of 4 axes (see below) | |
| | Efficiency bonus | 0.20 | Reward for finishing in fewer steps | |
|
|
| ### Structural quality composite (0.70 max) |
|
|
| | Axis | Weight | What it catches | |
| |------------------|--------|----------------------------------------------------------------| |
| | **Terminology** | 0.30 | Are the correct Sanskrit/domain terms present? Are banned misconception markers absent? | |
| | **Completeness** | 0.25 | Does the answer cover all required conceptual aspects of the topic? | |
| | **Ordering** | 0.25 | Do concepts appear in the expected logical/narrative sequence? | |
| | **Coherence** | 0.20 | Transition quality, sentence-structure uniformity, multi-sentence flow | |
|
|
| All four axes are computed without ML dependencies β token matching, keyword heuristics, positional analysis, and discourse marker detection β so the environment runs on minimal hardware (2 vCPU, 8 GB RAM). |
|
|
| ### Per-step shaping |
|
|
| | Action | Good outcome | Bad outcome | |
| |--------------|-----------------------|--------------------------------| |
| | `RETRIEVE` | +0.05 (useful) | β0.15 (redundant, >3 times) | |
| | `EDIT` | +0.20 + quality delta | β0.20 (degradation) | |
| | `RESTRUCTURE`| +0.25 + struct delta | β0.25 (destroyed facts) | |
| | `CITE` | +0.15 (correct) | β0.05 (wrong) | |
|
|
| Step-level rewards blend factual and structural deltas (60/40 for EDIT, structure-dominant for RESTRUCTURE), giving the agent continuous signal throughout the episode rather than only at termination. |
|
|
| --- |
|
|
| ## Setup |
|
|
| ### Server (Docker) |
|
|
| ```bash |
| docker build -t indic-scripture-qa . |
| docker run -p 8000:8000 indic-scripture-qa |
| ``` |
|
|
| Verify: `curl -X POST http://localhost:8000/reset -H 'Content-Type: application/json' -d '{}'` |
|
|
| ### Inference |
|
|
| ```bash |
| pip install -r requirements.txt |
| |
| export API_BASE_URL="https://router.huggingface.co/v1" |
| export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" |
| export HF_TOKEN="your-token" |
| export PING_URL="http://localhost:8000" |
| |
| python inference.py |
| ``` |
|
|
| ### Validate |
|
|
| ```bash |
| pip install openenv-core |
| openenv validate |
| ``` |
|
|
| ## Baseline Scores |
|
|
| | Task | Score | |
| |----------------------|--------| |
| | `verify-factual` | ~0.40 | |
| | `correct-and-cite` | ~0.30 | |
| | `fix-hallucination` | ~0.22 | |
| | **Average** | ~0.31 | |
|
|
| *(Qwen2.5-72B-Instruct, temperature=0.4, scenario 0, structural eval enabled)* |
|
|
| ## Project Structure |
|
|
| ``` |
| βββ openenv.yaml # OpenEnv metadata |
| βββ Dockerfile # Server container |
| βββ main.py # FastAPI server (reset/step/state) |
| βββ environment.py # Core env logic |
| βββ models.py # Typed Pydantic models + StructuralMeta |
| βββ tasks.py # Task definitions, scenarios, structural metadata |
| βββ rewards.py # Factual + structural reward computation |
| βββ inference.py # Baseline inference script |
| βββ requirements.txt # Client deps |
| βββ requirements-server.txt # Server deps |
| ``` |
|
|
| ## License |
|
|
| MIT |