model111 / MATH_VISION_SETUP.md

Upload MemGen code and data

e34b94f verified 4 months ago

11.1 kB

	# Math Vision Dataset Integration Summary

	This document summarizes the complete integration of the Math Vision dataset into the MemGen training pipeline, following the MM Math pattern.

	## Created Files

	### 1. Dataloader (Builder)
	Path: `larm/data/builders/math_vision_builder.py`

	- Class: `MathVisionBuilder`
	- Registry: Registered as `"math_vision"` in the builder registry
	- Features:
	- Loads preprocessed data from `data/math_vision/` directory
	- Supports both SFT (supervised fine-tuning) and GRPO (reinforcement learning) modes
	- Validates required fields: `prompt`, `completion`, `solution`, `image_path`
	- Handles max_samples filtering for quick testing
	- For GRPO mode, clears completion field to enable generation from scratch

	### 2. Environment
	Path: `larm/data/envs/math_vision_env.py`

	- Class: `MathVisionEnv`
	- Registry: Registered as `"math_vision"` in the environment registry
	- Features:
	- Computes accuracy rewards for GRPO training
	- Supports multiple answer extraction formats:
	- LaTeX `\boxed{}` notation
	- `<answer>` XML tags
	- Multiple choice letters (A-E)
	- Direct numerical/text answers
	- Normalizes answers for comparison (handles whitespace, case)
	- Optional LLM-based verification (set `USE_LLM_VERIFIER=true`)

	### 3. Configuration
	Path: `configs/latent_memory/math_vision.yaml`

	- Model Config:
	- Reasoner: Qwen2.5-VL-7B-Instruct (vision-language model)
	- Weaver: Qwen2.5-1.5B-Instruct (memory processor)
	- Trigger: Qwen2.5-0.5B-Instruct (optional, not used in default config)

	- Dataset Config:
	- Mode: `sft` or `grpo`
	- Train/valid/test split ratios
	- Image root directory

	- Training Config:
	- SFT settings: batch_size=4, max_epochs=2, lr=1e-5
	- GRPO settings: batch_size=1, num_generations=8, lr=1e-5
	- Generation config: max_response_length=1024, temperature=1.0

	### 4. Data Processing Script
	Path: `scripts/math_vision_process.py`

	- Purpose: Converts raw Math Vision data to training format
	- Input: Existing `data/math_vision/train.json` and `test.json`
	- Output: Preprocessed `train.json`, `valid.json`, `test.json` with fields:
	- `prompt`: Formatted question with instructions and options
	- `completion`: Solution text or generated answer
	- `solution`: Ground truth in `\boxed{}` format
	- `image_path`: Absolute path to image file

	- Features:
	- Handles multiple choice questions (adds formatted options)
	- Handles free-form questions
	- Creates validation split from training data
	- Filters samples with empty solutions

	### 5. Training Scripts

	#### Multi-GPU Training
	Path: `scripts/run_math_vision_multi.sh`
	- 8 GPUs (configurable via CUDA_VISIBLE_DEVICES)
	- Uses DeepSpeed ZeRO-2 optimization
	- GRPO training mode by default

	#### Single-GPU Training
	Path: `scripts/run_math_vision.sh`
	- 1 GPU
	- Simpler setup for development/testing

	### 6. Documentation
	Path: `data/math_vision/README.md`
	- Complete usage guide
	- Configuration explanations
	- Troubleshooting tips
	- Comparison with MM Math

	### 7. Registry Updates
	Path: `larm/data/builders/__init__.py`
	- Added import for `math_vision_builder`

	## Quick Start Guide

	### Step 1: Preprocess the Data

	```bash
	cd /root/CVPR/MemGen

	# Preprocess with config file
	uv run python scripts/math_vision_process.py --config configs/latent_memory/math_vision.yaml
	```

	Expected output:
	```
	[INFO] Loading dataset from data/math_vision
	[INFO] Found train.json at data/math_vision/train.json
	[INFO] Found test.json at data/math_vision/test.json
	[INFO] Splitting train set with val_ratio=0.1
	[INFO] Split sizes - train: 28477, valid: 3165
	[INFO] Preprocessing train split with batch_size=512
	[INFO] Preprocessing valid split with batch_size=512
	[INFO] Preprocessing test split with batch_size=512
	[INFO] Saved 28477 samples to data/math_vision/train.json
	[INFO] Saved 3165 samples to data/math_vision/valid.json
	[INFO] Saved 7954 samples to data/math_vision/test.json
	```

	### Step 2: Train the Model

	#### Option A: Multi-GPU (Recommended for production)
	```bash
	cd /root/CVPR/MemGen
	bash scripts/run_math_vision_multi.sh
	```

	#### Option B: Single-GPU (For testing)
	```bash
	cd /root/CVPR/MemGen
	bash scripts/run_math_vision.sh
	```

	### Step 3: Monitor Training

	Check logs:
	```bash
	tail -f /root/CVPR/MemGen/test_output/debug_log_math_vision.txt
	```

	Check outputs:
	```bash
	ls -la /root/CVPR/MemGen/test_output/math_vision/
	```

	## Architecture Overview

	```
	Data Flow:
	┌─────────────────────┐
	│ Raw Math Vision │
	│ train.json │
	│ test.json │
	└──────────┬──────────┘
	│
	▼
	┌─────────────────────┐
	│ math_vision_process │ ← Preprocessing script
	│ .py │
	└──────────┬──────────┘
	│
	▼
	┌─────────────────────┐
	│ Preprocessed Data │
	│ train.json │
	│ valid.json │
	│ test.json │
	└──────────┬──────────┘
	│
	▼
	┌─────────────────────┐
	│ MathVisionBuilder │ ← Dataloader
	│ │
	└──────────┬──────────┘
	│
	▼
	┌─────────────────────┐
	│ Training Loop │
	│ (SFT or GRPO) │
	└──────────┬──────────┘
	│
	▼
	┌─────────────────────┐
	│ MathVisionEnv │ ← Reward computation
	│ │
	└─────────────────────┘
	```

	## Key Design Decisions

	### 1. Answer Format Flexibility
	The environment supports multiple answer formats to handle:
	- Multiple choice: Extract letter (A-E) from completion
	- Free-form: Extract from `\boxed{}` or `<answer>` tags
	- Robust parsing with fallbacks

	### 2. Completion vs Solution
	- completion: Full solution text (used in SFT mode)
	- solution: Just the answer in `\boxed{}` format (used for reward)
	- In GRPO mode, completion is cleared to force generation

	### 3. Image Path Handling
	- Original data has absolute paths → preserved as-is
	- No need to reconstruct paths from relative components
	- Dataloader just passes through the image_path field

	### 4. Multiple Choice Formatting
	- Options are formatted as "A. option1\nB. option2\n..." in the prompt
	- Answer extraction recognizes single-letter answers
	- Normalization handles case differences

	## Configuration Tips

	### For Quick Testing
	```yaml
	# In configs/latent_memory/math_vision.yaml
	datasets:
	math_vision:
	sft:
	max_samples: 100 # Add this line to limit dataset size

	run:
	weaver:
	sft:
	max_epochs: 1
	batch_size: 2
	```

	### For Production Training
	```yaml
	run:
	weaver:
	grpo:
	max_epochs: 3
	batch_size: 2
	grad_accum_steps: 4 # Effective batch size = 2 * 4 * 8 GPUs = 64
	num_generations: 16
	```

	### For Memory-Constrained GPUs
	```yaml
	run:
	weaver:
	grpo:
	batch_size: 1
	grad_accum_steps: 8
	generation:
	max_response_length: 512 # Reduce from 1024
	```

	## Comparison: MM Math vs Math Vision

	### Similarities (Design Pattern)
	- ✅ Same builder interface (BaseDatasetBuilder)
	- ✅ Same environment interface (StaticEnv)
	- ✅ Same preprocessing pipeline
	- ✅ Same field structure (prompt, completion, solution, image_path)
	- ✅ Same training scripts structure
	- ✅ Same config file structure

	### Differences (Data Characteristics)

	\| Aspect \| MM Math \| Math Vision \|
	\|--------\|---------\|-------------\|
	\| Language \| Chinese \| English \|
	\| Questions \| Mixed math problems \| Competition math (AMC, etc.) \|
	\| Answer Type \| Mostly numerical \| Multiple choice + free-form \|
	\| Solutions \| Always provided \| Sometimes missing \|
	\| Images \| Geometry diagrams \| Various (graphs, diagrams, etc.) \|
	\| Levels \| Not categorized \| Levels 1-5 \|
	\| Subjects \| General \| Categorized (algebra, geometry, etc.) \|
	\| Dataset Size \| ~30k train \| ~28k train after split \|

	### Environment Differences
	The key difference in `MathVisionEnv` vs `MMMathEnv`:
	- Added multiple choice letter extraction
	- Added subject/level awareness (for future filtering)
	- More flexible answer normalization

	## Testing Checklist

	Before starting full training, verify:

	- [ ] Preprocessing runs without errors
	- [ ] train.json, valid.json, test.json are created
	- [ ] Check a few preprocessed examples manually
	- [ ] Verify image paths are valid
	- [ ] Test loading with dataloader (run with max_samples=10)
	- [ ] Test single batch of training
	- [ ] Check reward computation works
	- [ ] Monitor GPU memory usage
	- [ ] Verify checkpoints are saved

	## Troubleshooting

	### Issue: "No preprocessed data found"
	Solution: Run preprocessing script first
	```bash
	uv run python scripts/math_vision_process.py --config configs/latent_memory/math_vision.yaml
	```

	### Issue: "Missing required fields"
	Solution: Re-run preprocessing, ensure all 4 fields are present

	### Issue: "Image file not found"
	Solution: Check that images exist at the paths in the JSON files
	```bash
	head -20 data/math_vision/train.json \| jq '.[] \| .image_path'
	```

	### Issue: Low accuracy (all rewards = 0)
	Solution: Check answer extraction
	```python
	# Test answer extraction
	from larm.data.envs.math_vision_env import MathVisionEnv
	completions = ["The answer is \\boxed{A}"]
	solutions = ["\\boxed{A}"]
	scores = MathVisionEnv._accuracy_reward(completions, solutions)
	print(scores) # Should be [1.0]
	```

	### Issue: Out of memory
	Solution: Reduce batch size and response length
	```yaml
	batch_size: 1
	grad_accum_steps: 16
	max_response_length: 384
	```

	## Next Steps

	After successful training:

	1. Evaluate: Use the test set to measure performance
	2. Tune: Adjust hyperparameters based on results
	3. Experiment: Try different latent lengths, models, or methods
	4. Analyze: Look at generated solutions to identify issues
	5. Iterate: Refine prompts or reward functions as needed

	## Files Summary

	Created:
	- ✅ `larm/data/builders/math_vision_builder.py` (137 lines)
	- ✅ `larm/data/envs/math_vision_env.py` (115 lines)
	- ✅ `configs/latent_memory/math_vision.yaml` (166 lines)
	- ✅ `scripts/math_vision_process.py` (326 lines)
	- ✅ `scripts/run_math_vision_multi.sh` (44 lines)
	- ✅ `scripts/run_math_vision.sh` (32 lines)
	- ✅ `data/math_vision/README.md` (comprehensive guide)
	- ✅ `MATH_VISION_SETUP.md` (this file)

	Modified:
	- ✅ `larm/data/builders/__init__.py` (added import)

	Total: 9 files created/modified

	All files follow the MM Math pattern and integrate seamlessly with the existing MemGen codebase.