| # Math Vision Dataset Integration Summary | |
| This document summarizes the complete integration of the Math Vision dataset into the MemGen training pipeline, following the MM Math pattern. | |
| ## Created Files | |
| ### 1. Dataloader (Builder) | |
| **Path**: `larm/data/builders/math_vision_builder.py` | |
| - **Class**: `MathVisionBuilder` | |
| - **Registry**: Registered as `"math_vision"` in the builder registry | |
| - **Features**: | |
| - Loads preprocessed data from `data/math_vision/` directory | |
| - Supports both SFT (supervised fine-tuning) and GRPO (reinforcement learning) modes | |
| - Validates required fields: `prompt`, `completion`, `solution`, `image_path` | |
| - Handles max_samples filtering for quick testing | |
| - For GRPO mode, clears completion field to enable generation from scratch | |
| ### 2. Environment | |
| **Path**: `larm/data/envs/math_vision_env.py` | |
| - **Class**: `MathVisionEnv` | |
| - **Registry**: Registered as `"math_vision"` in the environment registry | |
| - **Features**: | |
| - Computes accuracy rewards for GRPO training | |
| - Supports multiple answer extraction formats: | |
| - LaTeX `\boxed{}` notation | |
| - `<answer>` XML tags | |
| - Multiple choice letters (A-E) | |
| - Direct numerical/text answers | |
| - Normalizes answers for comparison (handles whitespace, case) | |
| - Optional LLM-based verification (set `USE_LLM_VERIFIER=true`) | |
| ### 3. Configuration | |
| **Path**: `configs/latent_memory/math_vision.yaml` | |
| - **Model Config**: | |
| - Reasoner: Qwen2.5-VL-7B-Instruct (vision-language model) | |
| - Weaver: Qwen2.5-1.5B-Instruct (memory processor) | |
| - Trigger: Qwen2.5-0.5B-Instruct (optional, not used in default config) | |
| - **Dataset Config**: | |
| - Mode: `sft` or `grpo` | |
| - Train/valid/test split ratios | |
| - Image root directory | |
| - **Training Config**: | |
| - SFT settings: batch_size=4, max_epochs=2, lr=1e-5 | |
| - GRPO settings: batch_size=1, num_generations=8, lr=1e-5 | |
| - Generation config: max_response_length=1024, temperature=1.0 | |
| ### 4. Data Processing Script | |
| **Path**: `scripts/math_vision_process.py` | |
| - **Purpose**: Converts raw Math Vision data to training format | |
| - **Input**: Existing `data/math_vision/train.json` and `test.json` | |
| - **Output**: Preprocessed `train.json`, `valid.json`, `test.json` with fields: | |
| - `prompt`: Formatted question with instructions and options | |
| - `completion`: Solution text or generated answer | |
| - `solution`: Ground truth in `\boxed{}` format | |
| - `image_path`: Absolute path to image file | |
| - **Features**: | |
| - Handles multiple choice questions (adds formatted options) | |
| - Handles free-form questions | |
| - Creates validation split from training data | |
| - Filters samples with empty solutions | |
| ### 5. Training Scripts | |
| #### Multi-GPU Training | |
| **Path**: `scripts/run_math_vision_multi.sh` | |
| - 8 GPUs (configurable via CUDA_VISIBLE_DEVICES) | |
| - Uses DeepSpeed ZeRO-2 optimization | |
| - GRPO training mode by default | |
| #### Single-GPU Training | |
| **Path**: `scripts/run_math_vision.sh` | |
| - 1 GPU | |
| - Simpler setup for development/testing | |
| ### 6. Documentation | |
| **Path**: `data/math_vision/README.md` | |
| - Complete usage guide | |
| - Configuration explanations | |
| - Troubleshooting tips | |
| - Comparison with MM Math | |
| ### 7. Registry Updates | |
| **Path**: `larm/data/builders/__init__.py` | |
| - Added import for `math_vision_builder` | |
| ## Quick Start Guide | |
| ### Step 1: Preprocess the Data | |
| ```bash | |
| cd /root/CVPR/MemGen | |
| # Preprocess with config file | |
| uv run python scripts/math_vision_process.py --config configs/latent_memory/math_vision.yaml | |
| ``` | |
| Expected output: | |
| ``` | |
| [INFO] Loading dataset from data/math_vision | |
| [INFO] Found train.json at data/math_vision/train.json | |
| [INFO] Found test.json at data/math_vision/test.json | |
| [INFO] Splitting train set with val_ratio=0.1 | |
| [INFO] Split sizes - train: 28477, valid: 3165 | |
| [INFO] Preprocessing train split with batch_size=512 | |
| [INFO] Preprocessing valid split with batch_size=512 | |
| [INFO] Preprocessing test split with batch_size=512 | |
| [INFO] Saved 28477 samples to data/math_vision/train.json | |
| [INFO] Saved 3165 samples to data/math_vision/valid.json | |
| [INFO] Saved 7954 samples to data/math_vision/test.json | |
| ``` | |
| ### Step 2: Train the Model | |
| #### Option A: Multi-GPU (Recommended for production) | |
| ```bash | |
| cd /root/CVPR/MemGen | |
| bash scripts/run_math_vision_multi.sh | |
| ``` | |
| #### Option B: Single-GPU (For testing) | |
| ```bash | |
| cd /root/CVPR/MemGen | |
| bash scripts/run_math_vision.sh | |
| ``` | |
| ### Step 3: Monitor Training | |
| Check logs: | |
| ```bash | |
| tail -f /root/CVPR/MemGen/test_output/debug_log_math_vision.txt | |
| ``` | |
| Check outputs: | |
| ```bash | |
| ls -la /root/CVPR/MemGen/test_output/math_vision/ | |
| ``` | |
| ## Architecture Overview | |
| ``` | |
| Data Flow: | |
| βββββββββββββββββββββββ | |
| β Raw Math Vision β | |
| β train.json β | |
| β test.json β | |
| ββββββββββββ¬βββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββ | |
| β math_vision_process β β Preprocessing script | |
| β .py β | |
| ββββββββββββ¬βββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββ | |
| β Preprocessed Data β | |
| β train.json β | |
| β valid.json β | |
| β test.json β | |
| ββββββββββββ¬βββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββ | |
| β MathVisionBuilder β β Dataloader | |
| β β | |
| ββββββββββββ¬βββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββ | |
| β Training Loop β | |
| β (SFT or GRPO) β | |
| ββββββββββββ¬βββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββ | |
| β MathVisionEnv β β Reward computation | |
| β β | |
| βββββββββββββββββββββββ | |
| ``` | |
| ## Key Design Decisions | |
| ### 1. Answer Format Flexibility | |
| The environment supports multiple answer formats to handle: | |
| - Multiple choice: Extract letter (A-E) from completion | |
| - Free-form: Extract from `\boxed{}` or `<answer>` tags | |
| - Robust parsing with fallbacks | |
| ### 2. Completion vs Solution | |
| - **completion**: Full solution text (used in SFT mode) | |
| - **solution**: Just the answer in `\boxed{}` format (used for reward) | |
| - In GRPO mode, completion is cleared to force generation | |
| ### 3. Image Path Handling | |
| - Original data has absolute paths β preserved as-is | |
| - No need to reconstruct paths from relative components | |
| - Dataloader just passes through the image_path field | |
| ### 4. Multiple Choice Formatting | |
| - Options are formatted as "A. option1\nB. option2\n..." in the prompt | |
| - Answer extraction recognizes single-letter answers | |
| - Normalization handles case differences | |
| ## Configuration Tips | |
| ### For Quick Testing | |
| ```yaml | |
| # In configs/latent_memory/math_vision.yaml | |
| datasets: | |
| math_vision: | |
| sft: | |
| max_samples: 100 # Add this line to limit dataset size | |
| run: | |
| weaver: | |
| sft: | |
| max_epochs: 1 | |
| batch_size: 2 | |
| ``` | |
| ### For Production Training | |
| ```yaml | |
| run: | |
| weaver: | |
| grpo: | |
| max_epochs: 3 | |
| batch_size: 2 | |
| grad_accum_steps: 4 # Effective batch size = 2 * 4 * 8 GPUs = 64 | |
| num_generations: 16 | |
| ``` | |
| ### For Memory-Constrained GPUs | |
| ```yaml | |
| run: | |
| weaver: | |
| grpo: | |
| batch_size: 1 | |
| grad_accum_steps: 8 | |
| generation: | |
| max_response_length: 512 # Reduce from 1024 | |
| ``` | |
| ## Comparison: MM Math vs Math Vision | |
| ### Similarities (Design Pattern) | |
| - β Same builder interface (BaseDatasetBuilder) | |
| - β Same environment interface (StaticEnv) | |
| - β Same preprocessing pipeline | |
| - β Same field structure (prompt, completion, solution, image_path) | |
| - β Same training scripts structure | |
| - β Same config file structure | |
| ### Differences (Data Characteristics) | |
| | Aspect | MM Math | Math Vision | | |
| |--------|---------|-------------| | |
| | **Language** | Chinese | English | | |
| | **Questions** | Mixed math problems | Competition math (AMC, etc.) | | |
| | **Answer Type** | Mostly numerical | Multiple choice + free-form | | |
| | **Solutions** | Always provided | Sometimes missing | | |
| | **Images** | Geometry diagrams | Various (graphs, diagrams, etc.) | | |
| | **Levels** | Not categorized | Levels 1-5 | | |
| | **Subjects** | General | Categorized (algebra, geometry, etc.) | | |
| | **Dataset Size** | ~30k train | ~28k train after split | | |
| ### Environment Differences | |
| The key difference in `MathVisionEnv` vs `MMMathEnv`: | |
| - Added multiple choice letter extraction | |
| - Added subject/level awareness (for future filtering) | |
| - More flexible answer normalization | |
| ## Testing Checklist | |
| Before starting full training, verify: | |
| - [ ] Preprocessing runs without errors | |
| - [ ] train.json, valid.json, test.json are created | |
| - [ ] Check a few preprocessed examples manually | |
| - [ ] Verify image paths are valid | |
| - [ ] Test loading with dataloader (run with max_samples=10) | |
| - [ ] Test single batch of training | |
| - [ ] Check reward computation works | |
| - [ ] Monitor GPU memory usage | |
| - [ ] Verify checkpoints are saved | |
| ## Troubleshooting | |
| ### Issue: "No preprocessed data found" | |
| **Solution**: Run preprocessing script first | |
| ```bash | |
| uv run python scripts/math_vision_process.py --config configs/latent_memory/math_vision.yaml | |
| ``` | |
| ### Issue: "Missing required fields" | |
| **Solution**: Re-run preprocessing, ensure all 4 fields are present | |
| ### Issue: "Image file not found" | |
| **Solution**: Check that images exist at the paths in the JSON files | |
| ```bash | |
| head -20 data/math_vision/train.json | jq '.[] | .image_path' | |
| ``` | |
| ### Issue: Low accuracy (all rewards = 0) | |
| **Solution**: Check answer extraction | |
| ```python | |
| # Test answer extraction | |
| from larm.data.envs.math_vision_env import MathVisionEnv | |
| completions = ["The answer is \\boxed{A}"] | |
| solutions = ["\\boxed{A}"] | |
| scores = MathVisionEnv._accuracy_reward(completions, solutions) | |
| print(scores) # Should be [1.0] | |
| ``` | |
| ### Issue: Out of memory | |
| **Solution**: Reduce batch size and response length | |
| ```yaml | |
| batch_size: 1 | |
| grad_accum_steps: 16 | |
| max_response_length: 384 | |
| ``` | |
| ## Next Steps | |
| After successful training: | |
| 1. **Evaluate**: Use the test set to measure performance | |
| 2. **Tune**: Adjust hyperparameters based on results | |
| 3. **Experiment**: Try different latent lengths, models, or methods | |
| 4. **Analyze**: Look at generated solutions to identify issues | |
| 5. **Iterate**: Refine prompts or reward functions as needed | |
| ## Files Summary | |
| **Created**: | |
| - β `larm/data/builders/math_vision_builder.py` (137 lines) | |
| - β `larm/data/envs/math_vision_env.py` (115 lines) | |
| - β `configs/latent_memory/math_vision.yaml` (166 lines) | |
| - β `scripts/math_vision_process.py` (326 lines) | |
| - β `scripts/run_math_vision_multi.sh` (44 lines) | |
| - β `scripts/run_math_vision.sh` (32 lines) | |
| - β `data/math_vision/README.md` (comprehensive guide) | |
| - β `MATH_VISION_SETUP.md` (this file) | |
| **Modified**: | |
| - β `larm/data/builders/__init__.py` (added import) | |
| **Total**: 9 files created/modified | |
| All files follow the MM Math pattern and integrate seamlessly with the existing MemGen codebase. | |