AI & ML interests

Verified STEM reasoning data for frontier AI labs. Indian curriculum, RLVR-ready: JEE/NEET benchmarks, multimodal QA, and annotated tables โ€” with open models post-trained on them.

Recent Activity

nalanda-dataย  updated a dataset about 6 hours ago
Nalandadata/NalandaJEENEETBench-sample
nalanda-dataย  updated a dataset about 6 hours ago
Nalandadata/nalanda-image-qa-sample
nalanda-dataย  updated a dataset about 6 hours ago
Nalandadata/DrishtiTable-sample
View all activity

Organization Card

Nalandadata

Verified, curriculum-aligned Indian STEM data for frontier AI labs

Training ยท Post-training ยท Evaluation โ€” across reasoning, multimodal understanding, and document intelligence.

๐Ÿ”— License our data  ยท  ๐Ÿ“จ Contact / request access


Nalandadata builds high-quality, curriculum-aligned data sourced from S. Chand โ€” India's largest academic textbook publisher โ€” spanning all subjects, grade levels, and major Indic languages alongside English. Textbook content is structured, expert-authored, and verified, which makes it valuable far beyond education: reasoning chains, scientific diagrams, structured tables, and multilingual content that transfer directly to general-purpose model training and evaluation.

๐Ÿ“ฆ Products

Datasets

Dataset What it is
NalandaJEENEETBench 116,831 JEE & NEET questions with verified answers + worked solutions. RLVR-ready ground truth.
nalanda-image-qa 22,000+ scientific image Q&A pairs from NCERT diagrams (physics, chemistry, biology).
DrishtiTable 1,421 annotated tables for document AI / table structure recognition โ€” with a full TEDS benchmark + leaderboard.

Models

Model Result
nalanda-qwen-7b-grpo Qwen-7B + GRPO on NalandaJEENEETBench: +6.3pp (vs โˆ’16pp for naive SFT) โ€” verified answers make RLVR work.
nalanda-image-vl Multimodal diagram understanding: +9.3pp over zero-shot.
DrishtiTable-Qwen2.5-VL-7B Table recognition at 83.2% TEDS โ€” beats GPT-4o on our benchmark.

Benchmark & demos

โœ… Why it works

  • Verified ground truth โ†’ every JEE/NEET item has a checkable answer, enabling RLVR / GRPO pipelines that actually improve capability.
  • Expert-authored, structured source โ†’ reasoning chains, diagrams, and tables, not scraped web noise.
  • Multilingual, curriculum-aligned โ†’ English + major Indic languages across all grade levels.

๐Ÿค Trusted by

Partner & customer logos coming soon. Want to be featured? Get in touch.

๐Ÿ“œ Licensing & access

We license datasets for AI training, post-training, and evaluation.