model111 / MATH_VISION_SETUP.md

LCZZZZ

Upload MemGen code and data

e34b94f verified 4 months ago

preview code

raw

history blame contribute delete

11.1 kB

Math Vision Dataset Integration Summary

This document summarizes the complete integration of the Math Vision dataset into the MemGen training pipeline, following the MM Math pattern.

Created Files

1. Dataloader (Builder)

Path: larm/data/builders/math_vision_builder.py

Class: MathVisionBuilder
Registry: Registered as "math_vision" in the builder registry
Features:
- Loads preprocessed data from data/math_vision/ directory
- Supports both SFT (supervised fine-tuning) and GRPO (reinforcement learning) modes
- Validates required fields: prompt, completion, solution, image_path
- Handles max_samples filtering for quick testing
- For GRPO mode, clears completion field to enable generation from scratch

2. Environment

Path: larm/data/envs/math_vision_env.py

Class: MathVisionEnv
Registry: Registered as "math_vision" in the environment registry
Features:
- Computes accuracy rewards for GRPO training
- Supports multiple answer extraction formats:
  - LaTeX \boxed{} notation
  - <answer> XML tags
  - Multiple choice letters (A-E)
  - Direct numerical/text answers
- Normalizes answers for comparison (handles whitespace, case)
- Optional LLM-based verification (set USE_LLM_VERIFIER=true)

3. Configuration

Path: configs/latent_memory/math_vision.yaml

Model Config:
- Reasoner: Qwen2.5-VL-7B-Instruct (vision-language model)
- Weaver: Qwen2.5-1.5B-Instruct (memory processor)
- Trigger: Qwen2.5-0.5B-Instruct (optional, not used in default config)
Dataset Config:
- Mode: sft or grpo
- Train/valid/test split ratios
- Image root directory
Training Config:
- SFT settings: batch_size=4, max_epochs=2, lr=1e-5
- GRPO settings: batch_size=1, num_generations=8, lr=1e-5
- Generation config: max_response_length=1024, temperature=1.0

4. Data Processing Script

Path: scripts/math_vision_process.py

Purpose: Converts raw Math Vision data to training format
Input: Existing data/math_vision/train.json and test.json
Output: Preprocessed train.json, valid.json, test.json with fields:
- prompt: Formatted question with instructions and options
- completion: Solution text or generated answer
- solution: Ground truth in \boxed{} format
- image_path: Absolute path to image file
Features:
- Handles multiple choice questions (adds formatted options)
- Handles free-form questions
- Creates validation split from training data
- Filters samples with empty solutions

5. Training Scripts

Multi-GPU Training

Path: scripts/run_math_vision_multi.sh

8 GPUs (configurable via CUDA_VISIBLE_DEVICES)
Uses DeepSpeed ZeRO-2 optimization
GRPO training mode by default

Single-GPU Training

Path: scripts/run_math_vision.sh

1 GPU
Simpler setup for development/testing

6. Documentation

Path: data/math_vision/README.md

Complete usage guide
Configuration explanations
Troubleshooting tips
Comparison with MM Math

7. Registry Updates

Path: larm/data/builders/__init__.py

Added import for math_vision_builder

Quick Start Guide

Step 1: Preprocess the Data

cd /root/CVPR/MemGen

# Preprocess with config file
uv run python scripts/math_vision_process.py --config configs/latent_memory/math_vision.yaml

Expected output:

[INFO] Loading dataset from data/math_vision
[INFO] Found train.json at data/math_vision/train.json
[INFO] Found test.json at data/math_vision/test.json
[INFO] Splitting train set with val_ratio=0.1
[INFO] Split sizes - train: 28477, valid: 3165
[INFO] Preprocessing train split with batch_size=512
[INFO] Preprocessing valid split with batch_size=512
[INFO] Preprocessing test split with batch_size=512
[INFO] Saved 28477 samples to data/math_vision/train.json
[INFO] Saved 3165 samples to data/math_vision/valid.json
[INFO] Saved 7954 samples to data/math_vision/test.json

Step 2: Train the Model

Option A: Multi-GPU (Recommended for production)

cd /root/CVPR/MemGen
bash scripts/run_math_vision_multi.sh

Option B: Single-GPU (For testing)

cd /root/CVPR/MemGen
bash scripts/run_math_vision.sh

Step 3: Monitor Training

Check logs:

tail -f /root/CVPR/MemGen/test_output/debug_log_math_vision.txt

Check outputs:

ls -la /root/CVPR/MemGen/test_output/math_vision/

Architecture Overview

Data Flow:
┌─────────────────────┐
│  Raw Math Vision    │
│  train.json         │
│  test.json          │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ math_vision_process │ ← Preprocessing script
│      .py            │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Preprocessed Data  │
│  train.json         │
│  valid.json         │
│  test.json          │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ MathVisionBuilder   │ ← Dataloader
│                     │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│   Training Loop     │
│   (SFT or GRPO)     │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  MathVisionEnv      │ ← Reward computation
│                     │
└─────────────────────┘

Key Design Decisions

1. Answer Format Flexibility

The environment supports multiple answer formats to handle:

Multiple choice: Extract letter (A-E) from completion
Free-form: Extract from \boxed{} or <answer> tags
Robust parsing with fallbacks

2. Completion vs Solution

completion: Full solution text (used in SFT mode)
solution: Just the answer in \boxed{} format (used for reward)
In GRPO mode, completion is cleared to force generation

3. Image Path Handling

Original data has absolute paths → preserved as-is
No need to reconstruct paths from relative components
Dataloader just passes through the image_path field

4. Multiple Choice Formatting

Options are formatted as "A. option1\nB. option2\n..." in the prompt
Answer extraction recognizes single-letter answers
Normalization handles case differences

Configuration Tips

For Quick Testing

# In configs/latent_memory/math_vision.yaml
datasets:
  math_vision:
    sft:
      max_samples: 100  # Add this line to limit dataset size

run:
  weaver:
    sft:
      max_epochs: 1
      batch_size: 2

For Production Training

run:
  weaver:
    grpo:
      max_epochs: 3
      batch_size: 2
      grad_accum_steps: 4  # Effective batch size = 2 * 4 * 8 GPUs = 64
      num_generations: 16

For Memory-Constrained GPUs

run:
  weaver:
    grpo:
      batch_size: 1
      grad_accum_steps: 8
  generation:
    max_response_length: 512  # Reduce from 1024

Comparison: MM Math vs Math Vision

Similarities (Design Pattern)

✅ Same builder interface (BaseDatasetBuilder)
✅ Same environment interface (StaticEnv)
✅ Same preprocessing pipeline
✅ Same field structure (prompt, completion, solution, image_path)
✅ Same training scripts structure
✅ Same config file structure

Differences (Data Characteristics)

Aspect	MM Math	Math Vision
Language	Chinese	English
Questions	Mixed math problems	Competition math (AMC, etc.)
Answer Type	Mostly numerical	Multiple choice + free-form
Solutions	Always provided	Sometimes missing
Images	Geometry diagrams	Various (graphs, diagrams, etc.)
Levels	Not categorized	Levels 1-5
Subjects	General	Categorized (algebra, geometry, etc.)
Dataset Size	~30k train	~28k train after split

Environment Differences

The key difference in MathVisionEnv vs MMMathEnv:

Added multiple choice letter extraction
Added subject/level awareness (for future filtering)
More flexible answer normalization

Testing Checklist

Before starting full training, verify:

Preprocessing runs without errors
train.json, valid.json, test.json are created
Check a few preprocessed examples manually
Verify image paths are valid
Test loading with dataloader (run with max_samples=10)
Test single batch of training
Check reward computation works
Monitor GPU memory usage
Verify checkpoints are saved

Troubleshooting

Issue: "No preprocessed data found"

Solution: Run preprocessing script first

uv run python scripts/math_vision_process.py --config configs/latent_memory/math_vision.yaml

Issue: "Missing required fields"

Solution: Re-run preprocessing, ensure all 4 fields are present

Issue: "Image file not found"

Solution: Check that images exist at the paths in the JSON files

head -20 data/math_vision/train.json | jq '.[] | .image_path'

Issue: Low accuracy (all rewards = 0)

Solution: Check answer extraction

# Test answer extraction
from larm.data.envs.math_vision_env import MathVisionEnv
completions = ["The answer is \\boxed{A}"]
solutions = ["\\boxed{A}"]
scores = MathVisionEnv._accuracy_reward(completions, solutions)
print(scores)  # Should be [1.0]

Issue: Out of memory

Solution: Reduce batch size and response length

batch_size: 1
grad_accum_steps: 16
max_response_length: 384

Next Steps

After successful training:

Evaluate: Use the test set to measure performance
Tune: Adjust hyperparameters based on results
Experiment: Try different latent lengths, models, or methods
Analyze: Look at generated solutions to identify issues
Iterate: Refine prompts or reward functions as needed

Files Summary

Created:

✅ larm/data/builders/math_vision_builder.py (137 lines)
✅ larm/data/envs/math_vision_env.py (115 lines)
✅ configs/latent_memory/math_vision.yaml (166 lines)
✅ scripts/math_vision_process.py (326 lines)
✅ scripts/run_math_vision_multi.sh (44 lines)
✅ scripts/run_math_vision.sh (32 lines)
✅ data/math_vision/README.md (comprehensive guide)
✅ MATH_VISION_SETUP.md (this file)

Modified:

✅ larm/data/builders/__init__.py (added import)

Total: 9 files created/modified

All files follow the MM Math pattern and integrate seamlessly with the existing MemGen codebase.