Math Vision Dataset Integration Summary
This document summarizes the complete integration of the Math Vision dataset into the MemGen training pipeline, following the MM Math pattern.
Created Files
1. Dataloader (Builder)
Path: larm/data/builders/math_vision_builder.py
- Class:
MathVisionBuilder - Registry: Registered as
"math_vision"in the builder registry - Features:
- Loads preprocessed data from
data/math_vision/directory - Supports both SFT (supervised fine-tuning) and GRPO (reinforcement learning) modes
- Validates required fields:
prompt,completion,solution,image_path - Handles max_samples filtering for quick testing
- For GRPO mode, clears completion field to enable generation from scratch
- Loads preprocessed data from
2. Environment
Path: larm/data/envs/math_vision_env.py
- Class:
MathVisionEnv - Registry: Registered as
"math_vision"in the environment registry - Features:
- Computes accuracy rewards for GRPO training
- Supports multiple answer extraction formats:
- LaTeX
\boxed{}notation <answer>XML tags- Multiple choice letters (A-E)
- Direct numerical/text answers
- LaTeX
- Normalizes answers for comparison (handles whitespace, case)
- Optional LLM-based verification (set
USE_LLM_VERIFIER=true)
3. Configuration
Path: configs/latent_memory/math_vision.yaml
Model Config:
- Reasoner: Qwen2.5-VL-7B-Instruct (vision-language model)
- Weaver: Qwen2.5-1.5B-Instruct (memory processor)
- Trigger: Qwen2.5-0.5B-Instruct (optional, not used in default config)
Dataset Config:
- Mode:
sftorgrpo - Train/valid/test split ratios
- Image root directory
- Mode:
Training Config:
- SFT settings: batch_size=4, max_epochs=2, lr=1e-5
- GRPO settings: batch_size=1, num_generations=8, lr=1e-5
- Generation config: max_response_length=1024, temperature=1.0
4. Data Processing Script
Path: scripts/math_vision_process.py
Purpose: Converts raw Math Vision data to training format
Input: Existing
data/math_vision/train.jsonandtest.jsonOutput: Preprocessed
train.json,valid.json,test.jsonwith fields:prompt: Formatted question with instructions and optionscompletion: Solution text or generated answersolution: Ground truth in\boxed{}formatimage_path: Absolute path to image file
Features:
- Handles multiple choice questions (adds formatted options)
- Handles free-form questions
- Creates validation split from training data
- Filters samples with empty solutions
5. Training Scripts
Multi-GPU Training
Path: scripts/run_math_vision_multi.sh
- 8 GPUs (configurable via CUDA_VISIBLE_DEVICES)
- Uses DeepSpeed ZeRO-2 optimization
- GRPO training mode by default
Single-GPU Training
Path: scripts/run_math_vision.sh
- 1 GPU
- Simpler setup for development/testing
6. Documentation
Path: data/math_vision/README.md
- Complete usage guide
- Configuration explanations
- Troubleshooting tips
- Comparison with MM Math
7. Registry Updates
Path: larm/data/builders/__init__.py
- Added import for
math_vision_builder
Quick Start Guide
Step 1: Preprocess the Data
cd /root/CVPR/MemGen
# Preprocess with config file
uv run python scripts/math_vision_process.py --config configs/latent_memory/math_vision.yaml
Expected output:
[INFO] Loading dataset from data/math_vision
[INFO] Found train.json at data/math_vision/train.json
[INFO] Found test.json at data/math_vision/test.json
[INFO] Splitting train set with val_ratio=0.1
[INFO] Split sizes - train: 28477, valid: 3165
[INFO] Preprocessing train split with batch_size=512
[INFO] Preprocessing valid split with batch_size=512
[INFO] Preprocessing test split with batch_size=512
[INFO] Saved 28477 samples to data/math_vision/train.json
[INFO] Saved 3165 samples to data/math_vision/valid.json
[INFO] Saved 7954 samples to data/math_vision/test.json
Step 2: Train the Model
Option A: Multi-GPU (Recommended for production)
cd /root/CVPR/MemGen
bash scripts/run_math_vision_multi.sh
Option B: Single-GPU (For testing)
cd /root/CVPR/MemGen
bash scripts/run_math_vision.sh
Step 3: Monitor Training
Check logs:
tail -f /root/CVPR/MemGen/test_output/debug_log_math_vision.txt
Check outputs:
ls -la /root/CVPR/MemGen/test_output/math_vision/
Architecture Overview
Data Flow:
βββββββββββββββββββββββ
β Raw Math Vision β
β train.json β
β test.json β
ββββββββββββ¬βββββββββββ
β
βΌ
βββββββββββββββββββββββ
β math_vision_process β β Preprocessing script
β .py β
ββββββββββββ¬βββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Preprocessed Data β
β train.json β
β valid.json β
β test.json β
ββββββββββββ¬βββββββββββ
β
βΌ
βββββββββββββββββββββββ
β MathVisionBuilder β β Dataloader
β β
ββββββββββββ¬βββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Training Loop β
β (SFT or GRPO) β
ββββββββββββ¬βββββββββββ
β
βΌ
βββββββββββββββββββββββ
β MathVisionEnv β β Reward computation
β β
βββββββββββββββββββββββ
Key Design Decisions
1. Answer Format Flexibility
The environment supports multiple answer formats to handle:
- Multiple choice: Extract letter (A-E) from completion
- Free-form: Extract from
\boxed{}or<answer>tags - Robust parsing with fallbacks
2. Completion vs Solution
- completion: Full solution text (used in SFT mode)
- solution: Just the answer in
\boxed{}format (used for reward) - In GRPO mode, completion is cleared to force generation
3. Image Path Handling
- Original data has absolute paths β preserved as-is
- No need to reconstruct paths from relative components
- Dataloader just passes through the image_path field
4. Multiple Choice Formatting
- Options are formatted as "A. option1\nB. option2\n..." in the prompt
- Answer extraction recognizes single-letter answers
- Normalization handles case differences
Configuration Tips
For Quick Testing
# In configs/latent_memory/math_vision.yaml
datasets:
math_vision:
sft:
max_samples: 100 # Add this line to limit dataset size
run:
weaver:
sft:
max_epochs: 1
batch_size: 2
For Production Training
run:
weaver:
grpo:
max_epochs: 3
batch_size: 2
grad_accum_steps: 4 # Effective batch size = 2 * 4 * 8 GPUs = 64
num_generations: 16
For Memory-Constrained GPUs
run:
weaver:
grpo:
batch_size: 1
grad_accum_steps: 8
generation:
max_response_length: 512 # Reduce from 1024
Comparison: MM Math vs Math Vision
Similarities (Design Pattern)
- β Same builder interface (BaseDatasetBuilder)
- β Same environment interface (StaticEnv)
- β Same preprocessing pipeline
- β Same field structure (prompt, completion, solution, image_path)
- β Same training scripts structure
- β Same config file structure
Differences (Data Characteristics)
| Aspect | MM Math | Math Vision |
|---|---|---|
| Language | Chinese | English |
| Questions | Mixed math problems | Competition math (AMC, etc.) |
| Answer Type | Mostly numerical | Multiple choice + free-form |
| Solutions | Always provided | Sometimes missing |
| Images | Geometry diagrams | Various (graphs, diagrams, etc.) |
| Levels | Not categorized | Levels 1-5 |
| Subjects | General | Categorized (algebra, geometry, etc.) |
| Dataset Size | ~30k train | ~28k train after split |
Environment Differences
The key difference in MathVisionEnv vs MMMathEnv:
- Added multiple choice letter extraction
- Added subject/level awareness (for future filtering)
- More flexible answer normalization
Testing Checklist
Before starting full training, verify:
- Preprocessing runs without errors
- train.json, valid.json, test.json are created
- Check a few preprocessed examples manually
- Verify image paths are valid
- Test loading with dataloader (run with max_samples=10)
- Test single batch of training
- Check reward computation works
- Monitor GPU memory usage
- Verify checkpoints are saved
Troubleshooting
Issue: "No preprocessed data found"
Solution: Run preprocessing script first
uv run python scripts/math_vision_process.py --config configs/latent_memory/math_vision.yaml
Issue: "Missing required fields"
Solution: Re-run preprocessing, ensure all 4 fields are present
Issue: "Image file not found"
Solution: Check that images exist at the paths in the JSON files
head -20 data/math_vision/train.json | jq '.[] | .image_path'
Issue: Low accuracy (all rewards = 0)
Solution: Check answer extraction
# Test answer extraction
from larm.data.envs.math_vision_env import MathVisionEnv
completions = ["The answer is \\boxed{A}"]
solutions = ["\\boxed{A}"]
scores = MathVisionEnv._accuracy_reward(completions, solutions)
print(scores) # Should be [1.0]
Issue: Out of memory
Solution: Reduce batch size and response length
batch_size: 1
grad_accum_steps: 16
max_response_length: 384
Next Steps
After successful training:
- Evaluate: Use the test set to measure performance
- Tune: Adjust hyperparameters based on results
- Experiment: Try different latent lengths, models, or methods
- Analyze: Look at generated solutions to identify issues
- Iterate: Refine prompts or reward functions as needed
Files Summary
Created:
- β
larm/data/builders/math_vision_builder.py(137 lines) - β
larm/data/envs/math_vision_env.py(115 lines) - β
configs/latent_memory/math_vision.yaml(166 lines) - β
scripts/math_vision_process.py(326 lines) - β
scripts/run_math_vision_multi.sh(44 lines) - β
scripts/run_math_vision.sh(32 lines) - β
data/math_vision/README.md(comprehensive guide) - β
MATH_VISION_SETUP.md(this file)
Modified:
- β
larm/data/builders/__init__.py(added import)
Total: 9 files created/modified
All files follow the MM Math pattern and integrate seamlessly with the existing MemGen codebase.