model111 / MATH_VISION_SETUP.md
LCZZZZ's picture
Upload MemGen code and data
e34b94f verified
# Math Vision Dataset Integration Summary
This document summarizes the complete integration of the Math Vision dataset into the MemGen training pipeline, following the MM Math pattern.
## Created Files
### 1. Dataloader (Builder)
**Path**: `larm/data/builders/math_vision_builder.py`
- **Class**: `MathVisionBuilder`
- **Registry**: Registered as `"math_vision"` in the builder registry
- **Features**:
- Loads preprocessed data from `data/math_vision/` directory
- Supports both SFT (supervised fine-tuning) and GRPO (reinforcement learning) modes
- Validates required fields: `prompt`, `completion`, `solution`, `image_path`
- Handles max_samples filtering for quick testing
- For GRPO mode, clears completion field to enable generation from scratch
### 2. Environment
**Path**: `larm/data/envs/math_vision_env.py`
- **Class**: `MathVisionEnv`
- **Registry**: Registered as `"math_vision"` in the environment registry
- **Features**:
- Computes accuracy rewards for GRPO training
- Supports multiple answer extraction formats:
- LaTeX `\boxed{}` notation
- `<answer>` XML tags
- Multiple choice letters (A-E)
- Direct numerical/text answers
- Normalizes answers for comparison (handles whitespace, case)
- Optional LLM-based verification (set `USE_LLM_VERIFIER=true`)
### 3. Configuration
**Path**: `configs/latent_memory/math_vision.yaml`
- **Model Config**:
- Reasoner: Qwen2.5-VL-7B-Instruct (vision-language model)
- Weaver: Qwen2.5-1.5B-Instruct (memory processor)
- Trigger: Qwen2.5-0.5B-Instruct (optional, not used in default config)
- **Dataset Config**:
- Mode: `sft` or `grpo`
- Train/valid/test split ratios
- Image root directory
- **Training Config**:
- SFT settings: batch_size=4, max_epochs=2, lr=1e-5
- GRPO settings: batch_size=1, num_generations=8, lr=1e-5
- Generation config: max_response_length=1024, temperature=1.0
### 4. Data Processing Script
**Path**: `scripts/math_vision_process.py`
- **Purpose**: Converts raw Math Vision data to training format
- **Input**: Existing `data/math_vision/train.json` and `test.json`
- **Output**: Preprocessed `train.json`, `valid.json`, `test.json` with fields:
- `prompt`: Formatted question with instructions and options
- `completion`: Solution text or generated answer
- `solution`: Ground truth in `\boxed{}` format
- `image_path`: Absolute path to image file
- **Features**:
- Handles multiple choice questions (adds formatted options)
- Handles free-form questions
- Creates validation split from training data
- Filters samples with empty solutions
### 5. Training Scripts
#### Multi-GPU Training
**Path**: `scripts/run_math_vision_multi.sh`
- 8 GPUs (configurable via CUDA_VISIBLE_DEVICES)
- Uses DeepSpeed ZeRO-2 optimization
- GRPO training mode by default
#### Single-GPU Training
**Path**: `scripts/run_math_vision.sh`
- 1 GPU
- Simpler setup for development/testing
### 6. Documentation
**Path**: `data/math_vision/README.md`
- Complete usage guide
- Configuration explanations
- Troubleshooting tips
- Comparison with MM Math
### 7. Registry Updates
**Path**: `larm/data/builders/__init__.py`
- Added import for `math_vision_builder`
## Quick Start Guide
### Step 1: Preprocess the Data
```bash
cd /root/CVPR/MemGen
# Preprocess with config file
uv run python scripts/math_vision_process.py --config configs/latent_memory/math_vision.yaml
```
Expected output:
```
[INFO] Loading dataset from data/math_vision
[INFO] Found train.json at data/math_vision/train.json
[INFO] Found test.json at data/math_vision/test.json
[INFO] Splitting train set with val_ratio=0.1
[INFO] Split sizes - train: 28477, valid: 3165
[INFO] Preprocessing train split with batch_size=512
[INFO] Preprocessing valid split with batch_size=512
[INFO] Preprocessing test split with batch_size=512
[INFO] Saved 28477 samples to data/math_vision/train.json
[INFO] Saved 3165 samples to data/math_vision/valid.json
[INFO] Saved 7954 samples to data/math_vision/test.json
```
### Step 2: Train the Model
#### Option A: Multi-GPU (Recommended for production)
```bash
cd /root/CVPR/MemGen
bash scripts/run_math_vision_multi.sh
```
#### Option B: Single-GPU (For testing)
```bash
cd /root/CVPR/MemGen
bash scripts/run_math_vision.sh
```
### Step 3: Monitor Training
Check logs:
```bash
tail -f /root/CVPR/MemGen/test_output/debug_log_math_vision.txt
```
Check outputs:
```bash
ls -la /root/CVPR/MemGen/test_output/math_vision/
```
## Architecture Overview
```
Data Flow:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Raw Math Vision β”‚
β”‚ train.json β”‚
β”‚ test.json β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ math_vision_process β”‚ ← Preprocessing script
β”‚ .py β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Preprocessed Data β”‚
β”‚ train.json β”‚
β”‚ valid.json β”‚
β”‚ test.json β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ MathVisionBuilder β”‚ ← Dataloader
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Training Loop β”‚
β”‚ (SFT or GRPO) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ MathVisionEnv β”‚ ← Reward computation
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Key Design Decisions
### 1. Answer Format Flexibility
The environment supports multiple answer formats to handle:
- Multiple choice: Extract letter (A-E) from completion
- Free-form: Extract from `\boxed{}` or `<answer>` tags
- Robust parsing with fallbacks
### 2. Completion vs Solution
- **completion**: Full solution text (used in SFT mode)
- **solution**: Just the answer in `\boxed{}` format (used for reward)
- In GRPO mode, completion is cleared to force generation
### 3. Image Path Handling
- Original data has absolute paths β†’ preserved as-is
- No need to reconstruct paths from relative components
- Dataloader just passes through the image_path field
### 4. Multiple Choice Formatting
- Options are formatted as "A. option1\nB. option2\n..." in the prompt
- Answer extraction recognizes single-letter answers
- Normalization handles case differences
## Configuration Tips
### For Quick Testing
```yaml
# In configs/latent_memory/math_vision.yaml
datasets:
math_vision:
sft:
max_samples: 100 # Add this line to limit dataset size
run:
weaver:
sft:
max_epochs: 1
batch_size: 2
```
### For Production Training
```yaml
run:
weaver:
grpo:
max_epochs: 3
batch_size: 2
grad_accum_steps: 4 # Effective batch size = 2 * 4 * 8 GPUs = 64
num_generations: 16
```
### For Memory-Constrained GPUs
```yaml
run:
weaver:
grpo:
batch_size: 1
grad_accum_steps: 8
generation:
max_response_length: 512 # Reduce from 1024
```
## Comparison: MM Math vs Math Vision
### Similarities (Design Pattern)
- βœ… Same builder interface (BaseDatasetBuilder)
- βœ… Same environment interface (StaticEnv)
- βœ… Same preprocessing pipeline
- βœ… Same field structure (prompt, completion, solution, image_path)
- βœ… Same training scripts structure
- βœ… Same config file structure
### Differences (Data Characteristics)
| Aspect | MM Math | Math Vision |
|--------|---------|-------------|
| **Language** | Chinese | English |
| **Questions** | Mixed math problems | Competition math (AMC, etc.) |
| **Answer Type** | Mostly numerical | Multiple choice + free-form |
| **Solutions** | Always provided | Sometimes missing |
| **Images** | Geometry diagrams | Various (graphs, diagrams, etc.) |
| **Levels** | Not categorized | Levels 1-5 |
| **Subjects** | General | Categorized (algebra, geometry, etc.) |
| **Dataset Size** | ~30k train | ~28k train after split |
### Environment Differences
The key difference in `MathVisionEnv` vs `MMMathEnv`:
- Added multiple choice letter extraction
- Added subject/level awareness (for future filtering)
- More flexible answer normalization
## Testing Checklist
Before starting full training, verify:
- [ ] Preprocessing runs without errors
- [ ] train.json, valid.json, test.json are created
- [ ] Check a few preprocessed examples manually
- [ ] Verify image paths are valid
- [ ] Test loading with dataloader (run with max_samples=10)
- [ ] Test single batch of training
- [ ] Check reward computation works
- [ ] Monitor GPU memory usage
- [ ] Verify checkpoints are saved
## Troubleshooting
### Issue: "No preprocessed data found"
**Solution**: Run preprocessing script first
```bash
uv run python scripts/math_vision_process.py --config configs/latent_memory/math_vision.yaml
```
### Issue: "Missing required fields"
**Solution**: Re-run preprocessing, ensure all 4 fields are present
### Issue: "Image file not found"
**Solution**: Check that images exist at the paths in the JSON files
```bash
head -20 data/math_vision/train.json | jq '.[] | .image_path'
```
### Issue: Low accuracy (all rewards = 0)
**Solution**: Check answer extraction
```python
# Test answer extraction
from larm.data.envs.math_vision_env import MathVisionEnv
completions = ["The answer is \\boxed{A}"]
solutions = ["\\boxed{A}"]
scores = MathVisionEnv._accuracy_reward(completions, solutions)
print(scores) # Should be [1.0]
```
### Issue: Out of memory
**Solution**: Reduce batch size and response length
```yaml
batch_size: 1
grad_accum_steps: 16
max_response_length: 384
```
## Next Steps
After successful training:
1. **Evaluate**: Use the test set to measure performance
2. **Tune**: Adjust hyperparameters based on results
3. **Experiment**: Try different latent lengths, models, or methods
4. **Analyze**: Look at generated solutions to identify issues
5. **Iterate**: Refine prompts or reward functions as needed
## Files Summary
**Created**:
- βœ… `larm/data/builders/math_vision_builder.py` (137 lines)
- βœ… `larm/data/envs/math_vision_env.py` (115 lines)
- βœ… `configs/latent_memory/math_vision.yaml` (166 lines)
- βœ… `scripts/math_vision_process.py` (326 lines)
- βœ… `scripts/run_math_vision_multi.sh` (44 lines)
- βœ… `scripts/run_math_vision.sh` (32 lines)
- βœ… `data/math_vision/README.md` (comprehensive guide)
- βœ… `MATH_VISION_SETUP.md` (this file)
**Modified**:
- βœ… `larm/data/builders/__init__.py` (added import)
**Total**: 9 files created/modified
All files follow the MM Math pattern and integrate seamlessly with the existing MemGen codebase.