File size: 11,094 Bytes

e34b94f

# Math Vision Dataset Integration Summary

This document summarizes the complete integration of the Math Vision dataset into the MemGen training pipeline, following the MM Math pattern.

## Created Files

### 1. Dataloader (Builder)
**Path**: `larm/data/builders/math_vision_builder.py`

- **Class**: `MathVisionBuilder`
- **Registry**: Registered as `"math_vision"` in the builder registry
- **Features**:
  - Loads preprocessed data from `data/math_vision/` directory
  - Supports both SFT (supervised fine-tuning) and GRPO (reinforcement learning) modes
  - Validates required fields: `prompt`, `completion`, `solution`, `image_path`
  - Handles max_samples filtering for quick testing
  - For GRPO mode, clears completion field to enable generation from scratch

### 2. Environment
**Path**: `larm/data/envs/math_vision_env.py`

- **Class**: `MathVisionEnv`
- **Registry**: Registered as `"math_vision"` in the environment registry
- **Features**:
  - Computes accuracy rewards for GRPO training
  - Supports multiple answer extraction formats:
    - LaTeX `\boxed{}` notation
    - `<answer>` XML tags
    - Multiple choice letters (A-E)
    - Direct numerical/text answers
  - Normalizes answers for comparison (handles whitespace, case)
  - Optional LLM-based verification (set `USE_LLM_VERIFIER=true`)

### 3. Configuration
**Path**: `configs/latent_memory/math_vision.yaml`

- **Model Config**:
  - Reasoner: Qwen2.5-VL-7B-Instruct (vision-language model)
  - Weaver: Qwen2.5-1.5B-Instruct (memory processor)
  - Trigger: Qwen2.5-0.5B-Instruct (optional, not used in default config)
  
- **Dataset Config**:
  - Mode: `sft` or `grpo`
  - Train/valid/test split ratios
  - Image root directory
  
- **Training Config**:
  - SFT settings: batch_size=4, max_epochs=2, lr=1e-5
  - GRPO settings: batch_size=1, num_generations=8, lr=1e-5
  - Generation config: max_response_length=1024, temperature=1.0

### 4. Data Processing Script
**Path**: `scripts/math_vision_process.py`

- **Purpose**: Converts raw Math Vision data to training format
- **Input**: Existing `data/math_vision/train.json` and `test.json`
- **Output**: Preprocessed `train.json`, `valid.json`, `test.json` with fields:
  - `prompt`: Formatted question with instructions and options
  - `completion`: Solution text or generated answer
  - `solution`: Ground truth in `\boxed{}` format
  - `image_path`: Absolute path to image file
  
- **Features**:
  - Handles multiple choice questions (adds formatted options)
  - Handles free-form questions
  - Creates validation split from training data
  - Filters samples with empty solutions

### 5. Training Scripts

#### Multi-GPU Training
**Path**: `scripts/run_math_vision_multi.sh`
- 8 GPUs (configurable via CUDA_VISIBLE_DEVICES)
- Uses DeepSpeed ZeRO-2 optimization
- GRPO training mode by default

#### Single-GPU Training
**Path**: `scripts/run_math_vision.sh`
- 1 GPU
- Simpler setup for development/testing

### 6. Documentation
**Path**: `data/math_vision/README.md`
- Complete usage guide
- Configuration explanations
- Troubleshooting tips
- Comparison with MM Math

### 7. Registry Updates
**Path**: `larm/data/builders/__init__.py`
- Added import for `math_vision_builder`

## Quick Start Guide

### Step 1: Preprocess the Data

```bash
cd /root/CVPR/MemGen

# Preprocess with config file
uv run python scripts/math_vision_process.py --config configs/latent_memory/math_vision.yaml
```

Expected output:
```
[INFO] Loading dataset from data/math_vision
[INFO] Found train.json at data/math_vision/train.json
[INFO] Found test.json at data/math_vision/test.json
[INFO] Splitting train set with val_ratio=0.1
[INFO] Split sizes - train: 28477, valid: 3165
[INFO] Preprocessing train split with batch_size=512
[INFO] Preprocessing valid split with batch_size=512
[INFO] Preprocessing test split with batch_size=512
[INFO] Saved 28477 samples to data/math_vision/train.json
[INFO] Saved 3165 samples to data/math_vision/valid.json
[INFO] Saved 7954 samples to data/math_vision/test.json
```

### Step 2: Train the Model

#### Option A: Multi-GPU (Recommended for production)
```bash
cd /root/CVPR/MemGen
bash scripts/run_math_vision_multi.sh
```

#### Option B: Single-GPU (For testing)
```bash
cd /root/CVPR/MemGen
bash scripts/run_math_vision.sh
```

### Step 3: Monitor Training

Check logs:
```bash
tail -f /root/CVPR/MemGen/test_output/debug_log_math_vision.txt
```

Check outputs:
```bash
ls -la /root/CVPR/MemGen/test_output/math_vision/
```

## Architecture Overview

```
Data Flow:
┌─────────────────────┐
│  Raw Math Vision    │
│  train.json         │
│  test.json          │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ math_vision_process │ ← Preprocessing script
│      .py            │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Preprocessed Data  │
│  train.json         │
│  valid.json         │
│  test.json          │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ MathVisionBuilder   │ ← Dataloader
│                     │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│   Training Loop     │
│   (SFT or GRPO)     │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  MathVisionEnv      │ ← Reward computation
│                     │
└─────────────────────┘
```

## Key Design Decisions

### 1. Answer Format Flexibility
The environment supports multiple answer formats to handle:
- Multiple choice: Extract letter (A-E) from completion
- Free-form: Extract from `\boxed{}` or `<answer>` tags
- Robust parsing with fallbacks

### 2. Completion vs Solution
- **completion**: Full solution text (used in SFT mode)
- **solution**: Just the answer in `\boxed{}` format (used for reward)
- In GRPO mode, completion is cleared to force generation

### 3. Image Path Handling
- Original data has absolute paths → preserved as-is
- No need to reconstruct paths from relative components
- Dataloader just passes through the image_path field

### 4. Multiple Choice Formatting
- Options are formatted as "A. option1\nB. option2\n..." in the prompt
- Answer extraction recognizes single-letter answers
- Normalization handles case differences

## Configuration Tips

### For Quick Testing
```yaml
# In configs/latent_memory/math_vision.yaml
datasets:
  math_vision:
    sft:
      max_samples: 100  # Add this line to limit dataset size

run:
  weaver:
    sft:
      max_epochs: 1
      batch_size: 2
```

### For Production Training
```yaml
run:
  weaver:
    grpo:
      max_epochs: 3
      batch_size: 2
      grad_accum_steps: 4  # Effective batch size = 2 * 4 * 8 GPUs = 64
      num_generations: 16
```

### For Memory-Constrained GPUs
```yaml
run:
  weaver:
    grpo:
      batch_size: 1
      grad_accum_steps: 8
  generation:
    max_response_length: 512  # Reduce from 1024
```

## Comparison: MM Math vs Math Vision

### Similarities (Design Pattern)
- ✅ Same builder interface (BaseDatasetBuilder)
- ✅ Same environment interface (StaticEnv)
- ✅ Same preprocessing pipeline
- ✅ Same field structure (prompt, completion, solution, image_path)
- ✅ Same training scripts structure
- ✅ Same config file structure

### Differences (Data Characteristics)

| Aspect | MM Math | Math Vision |
|--------|---------|-------------|
| **Language** | Chinese | English |
| **Questions** | Mixed math problems | Competition math (AMC, etc.) |
| **Answer Type** | Mostly numerical | Multiple choice + free-form |
| **Solutions** | Always provided | Sometimes missing |
| **Images** | Geometry diagrams | Various (graphs, diagrams, etc.) |
| **Levels** | Not categorized | Levels 1-5 |
| **Subjects** | General | Categorized (algebra, geometry, etc.) |
| **Dataset Size** | ~30k train | ~28k train after split |

### Environment Differences
The key difference in `MathVisionEnv` vs `MMMathEnv`:
- Added multiple choice letter extraction
- Added subject/level awareness (for future filtering)
- More flexible answer normalization

## Testing Checklist

Before starting full training, verify:

- [ ] Preprocessing runs without errors
- [ ] train.json, valid.json, test.json are created
- [ ] Check a few preprocessed examples manually
- [ ] Verify image paths are valid
- [ ] Test loading with dataloader (run with max_samples=10)
- [ ] Test single batch of training
- [ ] Check reward computation works
- [ ] Monitor GPU memory usage
- [ ] Verify checkpoints are saved

## Troubleshooting

### Issue: "No preprocessed data found"
**Solution**: Run preprocessing script first
```bash
uv run python scripts/math_vision_process.py --config configs/latent_memory/math_vision.yaml
```

### Issue: "Missing required fields"
**Solution**: Re-run preprocessing, ensure all 4 fields are present

### Issue: "Image file not found"
**Solution**: Check that images exist at the paths in the JSON files
```bash
head -20 data/math_vision/train.json | jq '.[] | .image_path'
```

### Issue: Low accuracy (all rewards = 0)
**Solution**: Check answer extraction
```python
# Test answer extraction
from larm.data.envs.math_vision_env import MathVisionEnv
completions = ["The answer is \\boxed{A}"]
solutions = ["\\boxed{A}"]
scores = MathVisionEnv._accuracy_reward(completions, solutions)
print(scores)  # Should be [1.0]
```

### Issue: Out of memory
**Solution**: Reduce batch size and response length
```yaml
batch_size: 1
grad_accum_steps: 16
max_response_length: 384
```

## Next Steps

After successful training:

1. **Evaluate**: Use the test set to measure performance
2. **Tune**: Adjust hyperparameters based on results
3. **Experiment**: Try different latent lengths, models, or methods
4. **Analyze**: Look at generated solutions to identify issues
5. **Iterate**: Refine prompts or reward functions as needed

## Files Summary

**Created**:
- ✅ `larm/data/builders/math_vision_builder.py` (137 lines)
- ✅ `larm/data/envs/math_vision_env.py` (115 lines)
- ✅ `configs/latent_memory/math_vision.yaml` (166 lines)
- ✅ `scripts/math_vision_process.py` (326 lines)
- ✅ `scripts/run_math_vision_multi.sh` (44 lines)
- ✅ `scripts/run_math_vision.sh` (32 lines)
- ✅ `data/math_vision/README.md` (comprehensive guide)
- ✅ `MATH_VISION_SETUP.md` (this file)

**Modified**:
- ✅ `larm/data/builders/__init__.py` (added import)

**Total**: 9 files created/modified

All files follow the MM Math pattern and integrate seamlessly with the existing MemGen codebase.