model111 / MATH_VISION_SETUP.md
LCZZZZ's picture
Upload MemGen code and data
e34b94f verified

Math Vision Dataset Integration Summary

This document summarizes the complete integration of the Math Vision dataset into the MemGen training pipeline, following the MM Math pattern.

Created Files

1. Dataloader (Builder)

Path: larm/data/builders/math_vision_builder.py

  • Class: MathVisionBuilder
  • Registry: Registered as "math_vision" in the builder registry
  • Features:
    • Loads preprocessed data from data/math_vision/ directory
    • Supports both SFT (supervised fine-tuning) and GRPO (reinforcement learning) modes
    • Validates required fields: prompt, completion, solution, image_path
    • Handles max_samples filtering for quick testing
    • For GRPO mode, clears completion field to enable generation from scratch

2. Environment

Path: larm/data/envs/math_vision_env.py

  • Class: MathVisionEnv
  • Registry: Registered as "math_vision" in the environment registry
  • Features:
    • Computes accuracy rewards for GRPO training
    • Supports multiple answer extraction formats:
      • LaTeX \boxed{} notation
      • <answer> XML tags
      • Multiple choice letters (A-E)
      • Direct numerical/text answers
    • Normalizes answers for comparison (handles whitespace, case)
    • Optional LLM-based verification (set USE_LLM_VERIFIER=true)

3. Configuration

Path: configs/latent_memory/math_vision.yaml

  • Model Config:

    • Reasoner: Qwen2.5-VL-7B-Instruct (vision-language model)
    • Weaver: Qwen2.5-1.5B-Instruct (memory processor)
    • Trigger: Qwen2.5-0.5B-Instruct (optional, not used in default config)
  • Dataset Config:

    • Mode: sft or grpo
    • Train/valid/test split ratios
    • Image root directory
  • Training Config:

    • SFT settings: batch_size=4, max_epochs=2, lr=1e-5
    • GRPO settings: batch_size=1, num_generations=8, lr=1e-5
    • Generation config: max_response_length=1024, temperature=1.0

4. Data Processing Script

Path: scripts/math_vision_process.py

  • Purpose: Converts raw Math Vision data to training format

  • Input: Existing data/math_vision/train.json and test.json

  • Output: Preprocessed train.json, valid.json, test.json with fields:

    • prompt: Formatted question with instructions and options
    • completion: Solution text or generated answer
    • solution: Ground truth in \boxed{} format
    • image_path: Absolute path to image file
  • Features:

    • Handles multiple choice questions (adds formatted options)
    • Handles free-form questions
    • Creates validation split from training data
    • Filters samples with empty solutions

5. Training Scripts

Multi-GPU Training

Path: scripts/run_math_vision_multi.sh

  • 8 GPUs (configurable via CUDA_VISIBLE_DEVICES)
  • Uses DeepSpeed ZeRO-2 optimization
  • GRPO training mode by default

Single-GPU Training

Path: scripts/run_math_vision.sh

  • 1 GPU
  • Simpler setup for development/testing

6. Documentation

Path: data/math_vision/README.md

  • Complete usage guide
  • Configuration explanations
  • Troubleshooting tips
  • Comparison with MM Math

7. Registry Updates

Path: larm/data/builders/__init__.py

  • Added import for math_vision_builder

Quick Start Guide

Step 1: Preprocess the Data

cd /root/CVPR/MemGen

# Preprocess with config file
uv run python scripts/math_vision_process.py --config configs/latent_memory/math_vision.yaml

Expected output:

[INFO] Loading dataset from data/math_vision
[INFO] Found train.json at data/math_vision/train.json
[INFO] Found test.json at data/math_vision/test.json
[INFO] Splitting train set with val_ratio=0.1
[INFO] Split sizes - train: 28477, valid: 3165
[INFO] Preprocessing train split with batch_size=512
[INFO] Preprocessing valid split with batch_size=512
[INFO] Preprocessing test split with batch_size=512
[INFO] Saved 28477 samples to data/math_vision/train.json
[INFO] Saved 3165 samples to data/math_vision/valid.json
[INFO] Saved 7954 samples to data/math_vision/test.json

Step 2: Train the Model

Option A: Multi-GPU (Recommended for production)

cd /root/CVPR/MemGen
bash scripts/run_math_vision_multi.sh

Option B: Single-GPU (For testing)

cd /root/CVPR/MemGen
bash scripts/run_math_vision.sh

Step 3: Monitor Training

Check logs:

tail -f /root/CVPR/MemGen/test_output/debug_log_math_vision.txt

Check outputs:

ls -la /root/CVPR/MemGen/test_output/math_vision/

Architecture Overview

Data Flow:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Raw Math Vision    β”‚
β”‚  train.json         β”‚
β”‚  test.json          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ math_vision_process β”‚ ← Preprocessing script
β”‚      .py            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Preprocessed Data  β”‚
β”‚  train.json         β”‚
β”‚  valid.json         β”‚
β”‚  test.json          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ MathVisionBuilder   β”‚ ← Dataloader
β”‚                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Training Loop     β”‚
β”‚   (SFT or GRPO)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  MathVisionEnv      β”‚ ← Reward computation
β”‚                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Design Decisions

1. Answer Format Flexibility

The environment supports multiple answer formats to handle:

  • Multiple choice: Extract letter (A-E) from completion
  • Free-form: Extract from \boxed{} or <answer> tags
  • Robust parsing with fallbacks

2. Completion vs Solution

  • completion: Full solution text (used in SFT mode)
  • solution: Just the answer in \boxed{} format (used for reward)
  • In GRPO mode, completion is cleared to force generation

3. Image Path Handling

  • Original data has absolute paths β†’ preserved as-is
  • No need to reconstruct paths from relative components
  • Dataloader just passes through the image_path field

4. Multiple Choice Formatting

  • Options are formatted as "A. option1\nB. option2\n..." in the prompt
  • Answer extraction recognizes single-letter answers
  • Normalization handles case differences

Configuration Tips

For Quick Testing

# In configs/latent_memory/math_vision.yaml
datasets:
  math_vision:
    sft:
      max_samples: 100  # Add this line to limit dataset size

run:
  weaver:
    sft:
      max_epochs: 1
      batch_size: 2

For Production Training

run:
  weaver:
    grpo:
      max_epochs: 3
      batch_size: 2
      grad_accum_steps: 4  # Effective batch size = 2 * 4 * 8 GPUs = 64
      num_generations: 16

For Memory-Constrained GPUs

run:
  weaver:
    grpo:
      batch_size: 1
      grad_accum_steps: 8
  generation:
    max_response_length: 512  # Reduce from 1024

Comparison: MM Math vs Math Vision

Similarities (Design Pattern)

  • βœ… Same builder interface (BaseDatasetBuilder)
  • βœ… Same environment interface (StaticEnv)
  • βœ… Same preprocessing pipeline
  • βœ… Same field structure (prompt, completion, solution, image_path)
  • βœ… Same training scripts structure
  • βœ… Same config file structure

Differences (Data Characteristics)

Aspect MM Math Math Vision
Language Chinese English
Questions Mixed math problems Competition math (AMC, etc.)
Answer Type Mostly numerical Multiple choice + free-form
Solutions Always provided Sometimes missing
Images Geometry diagrams Various (graphs, diagrams, etc.)
Levels Not categorized Levels 1-5
Subjects General Categorized (algebra, geometry, etc.)
Dataset Size ~30k train ~28k train after split

Environment Differences

The key difference in MathVisionEnv vs MMMathEnv:

  • Added multiple choice letter extraction
  • Added subject/level awareness (for future filtering)
  • More flexible answer normalization

Testing Checklist

Before starting full training, verify:

  • Preprocessing runs without errors
  • train.json, valid.json, test.json are created
  • Check a few preprocessed examples manually
  • Verify image paths are valid
  • Test loading with dataloader (run with max_samples=10)
  • Test single batch of training
  • Check reward computation works
  • Monitor GPU memory usage
  • Verify checkpoints are saved

Troubleshooting

Issue: "No preprocessed data found"

Solution: Run preprocessing script first

uv run python scripts/math_vision_process.py --config configs/latent_memory/math_vision.yaml

Issue: "Missing required fields"

Solution: Re-run preprocessing, ensure all 4 fields are present

Issue: "Image file not found"

Solution: Check that images exist at the paths in the JSON files

head -20 data/math_vision/train.json | jq '.[] | .image_path'

Issue: Low accuracy (all rewards = 0)

Solution: Check answer extraction

# Test answer extraction
from larm.data.envs.math_vision_env import MathVisionEnv
completions = ["The answer is \\boxed{A}"]
solutions = ["\\boxed{A}"]
scores = MathVisionEnv._accuracy_reward(completions, solutions)
print(scores)  # Should be [1.0]

Issue: Out of memory

Solution: Reduce batch size and response length

batch_size: 1
grad_accum_steps: 16
max_response_length: 384

Next Steps

After successful training:

  1. Evaluate: Use the test set to measure performance
  2. Tune: Adjust hyperparameters based on results
  3. Experiment: Try different latent lengths, models, or methods
  4. Analyze: Look at generated solutions to identify issues
  5. Iterate: Refine prompts or reward functions as needed

Files Summary

Created:

  • βœ… larm/data/builders/math_vision_builder.py (137 lines)
  • βœ… larm/data/envs/math_vision_env.py (115 lines)
  • βœ… configs/latent_memory/math_vision.yaml (166 lines)
  • βœ… scripts/math_vision_process.py (326 lines)
  • βœ… scripts/run_math_vision_multi.sh (44 lines)
  • βœ… scripts/run_math_vision.sh (32 lines)
  • βœ… data/math_vision/README.md (comprehensive guide)
  • βœ… MATH_VISION_SETUP.md (this file)

Modified:

  • βœ… larm/data/builders/__init__.py (added import)

Total: 9 files created/modified

All files follow the MM Math pattern and integrate seamlessly with the existing MemGen codebase.