File size: 11,094 Bytes
e34b94f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 | # Math Vision Dataset Integration Summary
This document summarizes the complete integration of the Math Vision dataset into the MemGen training pipeline, following the MM Math pattern.
## Created Files
### 1. Dataloader (Builder)
**Path**: `larm/data/builders/math_vision_builder.py`
- **Class**: `MathVisionBuilder`
- **Registry**: Registered as `"math_vision"` in the builder registry
- **Features**:
- Loads preprocessed data from `data/math_vision/` directory
- Supports both SFT (supervised fine-tuning) and GRPO (reinforcement learning) modes
- Validates required fields: `prompt`, `completion`, `solution`, `image_path`
- Handles max_samples filtering for quick testing
- For GRPO mode, clears completion field to enable generation from scratch
### 2. Environment
**Path**: `larm/data/envs/math_vision_env.py`
- **Class**: `MathVisionEnv`
- **Registry**: Registered as `"math_vision"` in the environment registry
- **Features**:
- Computes accuracy rewards for GRPO training
- Supports multiple answer extraction formats:
- LaTeX `\boxed{}` notation
- `<answer>` XML tags
- Multiple choice letters (A-E)
- Direct numerical/text answers
- Normalizes answers for comparison (handles whitespace, case)
- Optional LLM-based verification (set `USE_LLM_VERIFIER=true`)
### 3. Configuration
**Path**: `configs/latent_memory/math_vision.yaml`
- **Model Config**:
- Reasoner: Qwen2.5-VL-7B-Instruct (vision-language model)
- Weaver: Qwen2.5-1.5B-Instruct (memory processor)
- Trigger: Qwen2.5-0.5B-Instruct (optional, not used in default config)
- **Dataset Config**:
- Mode: `sft` or `grpo`
- Train/valid/test split ratios
- Image root directory
- **Training Config**:
- SFT settings: batch_size=4, max_epochs=2, lr=1e-5
- GRPO settings: batch_size=1, num_generations=8, lr=1e-5
- Generation config: max_response_length=1024, temperature=1.0
### 4. Data Processing Script
**Path**: `scripts/math_vision_process.py`
- **Purpose**: Converts raw Math Vision data to training format
- **Input**: Existing `data/math_vision/train.json` and `test.json`
- **Output**: Preprocessed `train.json`, `valid.json`, `test.json` with fields:
- `prompt`: Formatted question with instructions and options
- `completion`: Solution text or generated answer
- `solution`: Ground truth in `\boxed{}` format
- `image_path`: Absolute path to image file
- **Features**:
- Handles multiple choice questions (adds formatted options)
- Handles free-form questions
- Creates validation split from training data
- Filters samples with empty solutions
### 5. Training Scripts
#### Multi-GPU Training
**Path**: `scripts/run_math_vision_multi.sh`
- 8 GPUs (configurable via CUDA_VISIBLE_DEVICES)
- Uses DeepSpeed ZeRO-2 optimization
- GRPO training mode by default
#### Single-GPU Training
**Path**: `scripts/run_math_vision.sh`
- 1 GPU
- Simpler setup for development/testing
### 6. Documentation
**Path**: `data/math_vision/README.md`
- Complete usage guide
- Configuration explanations
- Troubleshooting tips
- Comparison with MM Math
### 7. Registry Updates
**Path**: `larm/data/builders/__init__.py`
- Added import for `math_vision_builder`
## Quick Start Guide
### Step 1: Preprocess the Data
```bash
cd /root/CVPR/MemGen
# Preprocess with config file
uv run python scripts/math_vision_process.py --config configs/latent_memory/math_vision.yaml
```
Expected output:
```
[INFO] Loading dataset from data/math_vision
[INFO] Found train.json at data/math_vision/train.json
[INFO] Found test.json at data/math_vision/test.json
[INFO] Splitting train set with val_ratio=0.1
[INFO] Split sizes - train: 28477, valid: 3165
[INFO] Preprocessing train split with batch_size=512
[INFO] Preprocessing valid split with batch_size=512
[INFO] Preprocessing test split with batch_size=512
[INFO] Saved 28477 samples to data/math_vision/train.json
[INFO] Saved 3165 samples to data/math_vision/valid.json
[INFO] Saved 7954 samples to data/math_vision/test.json
```
### Step 2: Train the Model
#### Option A: Multi-GPU (Recommended for production)
```bash
cd /root/CVPR/MemGen
bash scripts/run_math_vision_multi.sh
```
#### Option B: Single-GPU (For testing)
```bash
cd /root/CVPR/MemGen
bash scripts/run_math_vision.sh
```
### Step 3: Monitor Training
Check logs:
```bash
tail -f /root/CVPR/MemGen/test_output/debug_log_math_vision.txt
```
Check outputs:
```bash
ls -la /root/CVPR/MemGen/test_output/math_vision/
```
## Architecture Overview
```
Data Flow:
βββββββββββββββββββββββ
β Raw Math Vision β
β train.json β
β test.json β
ββββββββββββ¬βββββββββββ
β
βΌ
βββββββββββββββββββββββ
β math_vision_process β β Preprocessing script
β .py β
ββββββββββββ¬βββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Preprocessed Data β
β train.json β
β valid.json β
β test.json β
ββββββββββββ¬βββββββββββ
β
βΌ
βββββββββββββββββββββββ
β MathVisionBuilder β β Dataloader
β β
ββββββββββββ¬βββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Training Loop β
β (SFT or GRPO) β
ββββββββββββ¬βββββββββββ
β
βΌ
βββββββββββββββββββββββ
β MathVisionEnv β β Reward computation
β β
βββββββββββββββββββββββ
```
## Key Design Decisions
### 1. Answer Format Flexibility
The environment supports multiple answer formats to handle:
- Multiple choice: Extract letter (A-E) from completion
- Free-form: Extract from `\boxed{}` or `<answer>` tags
- Robust parsing with fallbacks
### 2. Completion vs Solution
- **completion**: Full solution text (used in SFT mode)
- **solution**: Just the answer in `\boxed{}` format (used for reward)
- In GRPO mode, completion is cleared to force generation
### 3. Image Path Handling
- Original data has absolute paths β preserved as-is
- No need to reconstruct paths from relative components
- Dataloader just passes through the image_path field
### 4. Multiple Choice Formatting
- Options are formatted as "A. option1\nB. option2\n..." in the prompt
- Answer extraction recognizes single-letter answers
- Normalization handles case differences
## Configuration Tips
### For Quick Testing
```yaml
# In configs/latent_memory/math_vision.yaml
datasets:
math_vision:
sft:
max_samples: 100 # Add this line to limit dataset size
run:
weaver:
sft:
max_epochs: 1
batch_size: 2
```
### For Production Training
```yaml
run:
weaver:
grpo:
max_epochs: 3
batch_size: 2
grad_accum_steps: 4 # Effective batch size = 2 * 4 * 8 GPUs = 64
num_generations: 16
```
### For Memory-Constrained GPUs
```yaml
run:
weaver:
grpo:
batch_size: 1
grad_accum_steps: 8
generation:
max_response_length: 512 # Reduce from 1024
```
## Comparison: MM Math vs Math Vision
### Similarities (Design Pattern)
- β
Same builder interface (BaseDatasetBuilder)
- β
Same environment interface (StaticEnv)
- β
Same preprocessing pipeline
- β
Same field structure (prompt, completion, solution, image_path)
- β
Same training scripts structure
- β
Same config file structure
### Differences (Data Characteristics)
| Aspect | MM Math | Math Vision |
|--------|---------|-------------|
| **Language** | Chinese | English |
| **Questions** | Mixed math problems | Competition math (AMC, etc.) |
| **Answer Type** | Mostly numerical | Multiple choice + free-form |
| **Solutions** | Always provided | Sometimes missing |
| **Images** | Geometry diagrams | Various (graphs, diagrams, etc.) |
| **Levels** | Not categorized | Levels 1-5 |
| **Subjects** | General | Categorized (algebra, geometry, etc.) |
| **Dataset Size** | ~30k train | ~28k train after split |
### Environment Differences
The key difference in `MathVisionEnv` vs `MMMathEnv`:
- Added multiple choice letter extraction
- Added subject/level awareness (for future filtering)
- More flexible answer normalization
## Testing Checklist
Before starting full training, verify:
- [ ] Preprocessing runs without errors
- [ ] train.json, valid.json, test.json are created
- [ ] Check a few preprocessed examples manually
- [ ] Verify image paths are valid
- [ ] Test loading with dataloader (run with max_samples=10)
- [ ] Test single batch of training
- [ ] Check reward computation works
- [ ] Monitor GPU memory usage
- [ ] Verify checkpoints are saved
## Troubleshooting
### Issue: "No preprocessed data found"
**Solution**: Run preprocessing script first
```bash
uv run python scripts/math_vision_process.py --config configs/latent_memory/math_vision.yaml
```
### Issue: "Missing required fields"
**Solution**: Re-run preprocessing, ensure all 4 fields are present
### Issue: "Image file not found"
**Solution**: Check that images exist at the paths in the JSON files
```bash
head -20 data/math_vision/train.json | jq '.[] | .image_path'
```
### Issue: Low accuracy (all rewards = 0)
**Solution**: Check answer extraction
```python
# Test answer extraction
from larm.data.envs.math_vision_env import MathVisionEnv
completions = ["The answer is \\boxed{A}"]
solutions = ["\\boxed{A}"]
scores = MathVisionEnv._accuracy_reward(completions, solutions)
print(scores) # Should be [1.0]
```
### Issue: Out of memory
**Solution**: Reduce batch size and response length
```yaml
batch_size: 1
grad_accum_steps: 16
max_response_length: 384
```
## Next Steps
After successful training:
1. **Evaluate**: Use the test set to measure performance
2. **Tune**: Adjust hyperparameters based on results
3. **Experiment**: Try different latent lengths, models, or methods
4. **Analyze**: Look at generated solutions to identify issues
5. **Iterate**: Refine prompts or reward functions as needed
## Files Summary
**Created**:
- β
`larm/data/builders/math_vision_builder.py` (137 lines)
- β
`larm/data/envs/math_vision_env.py` (115 lines)
- β
`configs/latent_memory/math_vision.yaml` (166 lines)
- β
`scripts/math_vision_process.py` (326 lines)
- β
`scripts/run_math_vision_multi.sh` (44 lines)
- β
`scripts/run_math_vision.sh` (32 lines)
- β
`data/math_vision/README.md` (comprehensive guide)
- β
`MATH_VISION_SETUP.md` (this file)
**Modified**:
- β
`larm/data/builders/__init__.py` (added import)
**Total**: 9 files created/modified
All files follow the MM Math pattern and integrate seamlessly with the existing MemGen codebase.
|