Nalanda Image VL
Project Overview
What is this project?
Modern AI models can look at a photo and tell you what's in it. But show them a physics circuit diagram, a chemistry molecular structure, or a geometry construction, and they often get confused. These aren't everyday photos -- they're specialized scientific diagrams with their own visual language. A resistor symbol in a circuit, a benzene ring in organic chemistry, or an angle bisector in geometry all carry precise meaning that general-purpose AI models weren't specifically trained to understand.
Nalanda Image VL is our attempt to fix this. We took a powerful vision-language AI model (Meta's LLaMA-3.2-Vision, which can understand both text and images) and taught it to be better at science by training it on thousands of real science questions that come with diagrams, figures, and visual explanations.
Think of it like this: the base AI model is a smart student who has read a lot of textbooks but hasn't practiced many diagram-based science problems. We gave it intensive practice with over 19,000 science questions across Physics, Chemistry, Biology, and Mathematics -- each paired with the kind of diagrams and figures students actually encounter in exams and classrooms.
What is the dataset?
Our training dataset contains 22,679 science question-answer pairs drawn from a pool of over 180,000 questions. We specifically selected questions that involve images, because that's where AI models struggle the most. The dataset covers four core subjects:
- Physics (7,394 questions) -- circuit diagrams, optics ray diagrams, force/free-body diagrams, wave patterns, experimental setups
- Mathematics (6,411 questions) -- geometric constructions, graphs of functions, coordinate geometry figures, trigonometry diagrams
- Chemistry (4,875 questions) -- molecular structures, crystal lattice diagrams, periodic table references, reaction mechanism illustrations
- Biology (3,993 questions) -- cell diagrams, anatomical illustrations, ecosystem charts, genetics pedigree diagrams
What makes this dataset unique is that it contains three types of images:
- Question images (10,144) -- the diagram or figure that comes with the question itself (e.g., "Look at this circuit diagram and find the current")
- Solution images (17,829) -- step-by-step visual explanations showing how to solve the problem (e.g., annotated diagrams showing the working)
- Option images (1,404) -- when the multiple-choice answers are themselves images (e.g., "Which of these molecular structures is correct?")
Each question also comes with a detailed chain-of-thought answer -- not just "the answer is B," but a full explanation of the reasoning steps. This teaches the AI model how to think, not just what to answer.
A representative sample of 1,000 entries from this dataset is publicly available at Nalandadata/nalanda-image-qa.
What are we trying to achieve?
The goal is straightforward: make AI better at understanding and answering science questions that involve diagrams and figures.
Today's best AI models were trained on billions of natural images (photos, screenshots, artwork), but scientific diagrams are a completely different visual domain. A circuit schematic doesn't look like a cat photo. A molecular orbital diagram doesn't look like a selfie. We hypothesized that giving the AI model focused practice on science-specific visuals would significantly improve its performance -- and the results confirmed this.
What did we achieve?
We fine-tuned the model and compared it against the original (untrained) version on 162 test questions it had never seen before:
| Before Training | After Training | Improvement | |
|---|---|---|---|
| Overall accuracy | 38.3% | 47.5% | +9.3 percentage points |
| Mathematics | 14.7% | 38.2% | +23.5 pp (biggest gain) |
| Biology | 37.8% | 51.4% | +13.5 pp |
| Physics | 38.5% | 44.2% | +5.8 pp |
| Chemistry | 59.0% | 56.4% | -2.6 pp |
The most striking result is in Mathematics, where accuracy jumped from 14.7% to 38.2% -- nearly tripling. This makes sense: math questions are heavily diagram-dependent (geometry proofs, graph interpretations, coordinate problems), so the model benefited enormously from seeing thousands of annotated math diagrams during training.
Biology saw the second-largest gain (+13.5 pp), likely because biological diagrams (cell structures, ecosystem flow charts, anatomy) are visually distinct from anything in the model's original training data.
Chemistry showed a small decrease (-2.6 pp), which is an area for future improvement -- possibly through better data balancing or chemistry-specific training strategies.
How does it work (in simple terms)?
Instead of retraining the entire 11-billion-parameter model from scratch (which would be extremely expensive), we used a technique called LoRA (Low-Rank Adaptation). This is like adding a small, specialized "lens" on top of the existing model. The original model stays frozen, and the LoRA adapter (this repository -- about 500MB) learns the science-specific adjustments. To use the model, you load the original LLaMA-3.2-Vision model and then attach our adapter on top.
The training was done on a single NVIDIA A100-80GB GPU using Unsloth for acceleration, and the entire pipeline ran on Modal serverless infrastructure.
Who is this for?
- Researchers exploring domain-specific fine-tuning of vision-language models
- EdTech developers building AI-powered science tutoring tools
- Students and educators interested in AI-assisted science learning
- AI practitioners looking for a practical example of multimodal LoRA fine-tuning
Technical Details
Model Configuration
| Parameter | Value |
|---|---|
| Base Model | meta-llama/Llama-3.2-11B-Vision-Instruct |
| Fine-tuning Method | QLoRA (4-bit NF4) via Unsloth |
| LoRA Rank / Alpha | 32 / 64 |
| Target Modules | All linear (vision + language layers) |
| Training Epochs | 3 |
| Learning Rate | 5e-5 (cosine schedule, 10% warmup) |
| Effective Batch Size | 8 (1 x 8 gradient accumulation) |
| Max Sequence Length | 2048 |
| Optimizer | AdamW 8-bit |
| Gradient Checkpointing | Unsloth (30% VRAM savings) |
| Hardware | 1x NVIDIA A100-80GB (Modal) |
| Training Samples | 19,272 |
| Training Loss | 0.7974 |
Training Choices
- Vision layers trained: Unlike typical fine-tuning that freezes the vision encoder, we trained both vision and language layers. Scientific diagrams are visually different from natural images, so the vision encoder needs adaptation too.
- Response-only training: The loss is computed only on the model's answers, not on the questions. This prevents memorizing question templates and focuses learning on generating correct reasoning.
- Subject/difficulty tags: Each question is prepended with tags like
[Subject: Physics] [Difficulty: hard], allowing the model to calibrate its reasoning depth. - Chemistry oversampling: Chemistry had fewer samples than Physics, so we oversampled it to balance the training distribution.
- Curriculum ordering: Training data was ordered Physics -> Maths -> Biology -> Chemistry for curriculum-style learning.
Data Split
| Split | Samples | Purpose |
|---|---|---|
| Train | 19,272 | Model training |
| Validation | 1,700 | Hyperparameter tuning, early stopping |
| Test | 1,701 | Final evaluation (stratified by subject) |
Usage
With Unsloth (recommended)
from unsloth import FastVisionModel
from peft import PeftModel
# Load base model in 4-bit
model, tokenizer = FastVisionModel.from_pretrained(
"unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
)
# Load LoRA adapter
model = PeftModel.from_pretrained(model, "Nalandadata/nalanda-image-vl")
FastVisionModel.for_inference(model)
# Build input
from PIL import Image
img = Image.open("your_science_diagram.jpg")
messages = [{
"role": "user",
"content": [
{"type": "image", "image": img},
{"type": "text", "text": "[Subject: Physics] [Difficulty: medium]\n\nAnswer the following question..."},
]
}]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(input_text, add_special_tokens=False, return_tensors="pt").to(model.device)
import torch
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=False)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)
With Transformers + PEFT
from transformers import MllamaForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch
base_model = MllamaForConditionalGeneration.from_pretrained(
"meta-llama/Llama-3.2-11B-Vision-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "Nalandadata/nalanda-image-vl")
processor = AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")
Limitations
- Evaluation scope: Tested on our held-out test set, not on external benchmarks (MMMU, MathVista, ScienceQA). Cross-benchmark evaluation is planned.
- Chemistry performance: Slight regression on Chemistry questions (-2.6 pp) suggests the model needs better chemistry-specific training strategies.
- Dataset size: At 22K samples, our dataset is smaller than state-of-the-art alternatives (Math-LLaVA: 360K, MAVIS: 834K). Scaling could yield further gains.
- Single base model: Only tested with LLaMA-3.2-Vision-11B. Results may differ with other base models (Qwen2.5-VL, InternVL2).
- Missing images: 3,253 samples were excluded due to missing image files, which may introduce selection bias.
Future Work
- Scale the dataset with synthetic augmentation
- Apply GRPO reinforcement learning for self-verification and reasoning
- Evaluate on external benchmarks (MMMU, MathVista, ScienceQA)
- Compare across multiple base models (Qwen2.5-VL, InternVL2)
- Improve Chemistry-specific performance
Citation
@misc{nalanda-image-vl-2025,
title={Nalanda Image VL: Domain-Specific Visual Instruction Tuning for Science Education},
author={Nalanda Data},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/Nalandadata/nalanda-image-vl}
}
- Downloads last month
- 22
Model tree for Nalandadata/nalanda-image-vl
Base model
meta-llama/Llama-3.2-11B-Vision-Instruct