llm_cp2 / src /lmms-eval /docs /lmms-eval-0.5.md
csuhan's picture
Upload folder using huggingface_hub
b0c0df0 verified
# LMMS-Eval v0.5: Multimodal Expansion Release
## Introduction
LMMs-Eval v0.5 represents a significant expansion in multimodal evaluation capabilities, introducing comprehensive audio understanding support alongside continued vision and reasoning enhancements.
## Table of Contents
- [Introduction](#introduction)
- [Major Features](#major-features)
- [1. Response Caching System](#1-response-caching-system)
- [2. Audio Evaluation Suite](#2-audio-evaluation-suite)
- [3. New Model Support](#3-new-model-support)
- [4. New Benchmarks](#4-new-benchmarks)
- [5. Model Context Protocol (MCP) Integration](#5-model-context-protocol-mcp-integration)
- [6. Async OpenAI Improvements](#6-async-openai-improvements)
- [Usage Examples](#usage-examples)
- [Technical Details](#technical-details)
- [Migration Guide](#migration-guide)
- [Bug Fixes and Improvements](#bug-fixes-and-improvements)
- [Deprecated Features](#deprecated-features)
- [Contributing](#contributing)
- [Acknowledgments](#acknowledgments)
- [Getting Help](#getting-help)
## Major Features
### 1. Response Caching System
A production-ready JSONL-based caching system that dramatically speeds up re-evaluation and reduces API costs:
**Key Features:**
- **Per-document caching**: Cached at `(task_name, doc_id)` level
- **Distributed-safe**: Separate cache files per rank/world size
- **Zero-overhead**: Automatic cache hits with no code changes
- **Multi-backend**: Works with async OpenAI, vLLM, and custom models
**Enable Caching:**
```bash
export LMMS_EVAL_USE_CACHE=True
export LMMS_EVAL_HOME="/path/to/cache_root" # optional
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-2024-11-20,base_url=$OPENAI_API_BASE \
--tasks mmmu_val \
--batch_size 1 \
--output_path ./logs/
```
**Cache Location:**
- Default: `~/.cache/lmms-eval/eval_cache/<model_hash>/{task_name}_rank{rank}_world_size{world_size}.jsonl`
- Each line: `{"doc_id": <doc_id>, "response": <string>}`
**API Integration:**
```python
def generate_until(self, requests):
self.load_cache()
cached, pending = self.get_response_from_cache(requests)
results = [c["response"] for c in cached]
for req in pending:
out = call_backend(req)
self.add_request_response_to_cache(req, out)
results.append(out)
return results
```
See full documentation in `docs/caching.md`.
### 2. Audio Evaluation Suite
Comprehensive audio understanding capabilities with three major benchmark families:
#### Step2 Audio Paralinguistic (11 tasks)
Fine-grained paralinguistic feature evaluation:
- **Acoustic Features**: pitch, rhythm, speed, voice_tone, voice_styles
- **Speaker Attributes**: age, gender, emotions
- **Environmental**: scene, event, vocalsound
- Sematic Match metrics
```bash
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17 \
--tasks step2_audio_paralinguistic \
--batch_size 1
```
#### VoiceBench (9 main categories, 30+ subtasks)
Comprehensive voice and speech evaluation:
- **Instruction Following**: ifeval, alpacaeval, advbench
- **Reasoning**: bbh (Big Bench Hard), commoneval
- **Knowledge**: mmsu (13 subject areas: biology, chemistry, physics, etc.)
- **Q&A**: openbookqa
- **Accent Diversity**: sd-qa (11 regional variants: USA, UK, India, Australia, etc.)
- **Expressiveness**: wildvoice
- Metrics vary by task type, including accuracy(1-5), failure rate, LLM eval, etc.
```bash
# Full VoiceBench
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17 \
--tasks voicebench \
--batch_size 1
# Specific accent evaluation
python -m lmms_eval \
--tasks voicebench_sd-qa_ind_n,voicebench_sd-qa_ind_s \
--batch_size 1
```
#### WenetSpeech (2 splits)
Large-scale ASR and speech evaluation:
- **dev**: Development set for validation
- **test_meeting**: Meeting domain evaluation
- MER (Mixed Error Rate) metrics
```bash
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17 \
--tasks wenet_speech_dev,wenet_speech_test_meeting \
--batch_size 1
```
**Audio Pipeline Features:**
- HuggingFace audio dataset integration
- Unified audio message format
- Multiple metric support (Accuracy, WER, GPT-4 Judge)
- Task grouping for multi-subset benchmarks
### 3. New Model Support
Five new model integrations expanding audio and vision capabilities:
| Model | Type | Key Features | Usage Example |
|-------|------|--------------|---------------|
| **GPT-4o Audio Preview** | Audio+Text | Paralinguistic understanding, multi-turn audio | `--model async_openai --model_args model_version=gpt-4o-audio-preview-2024-12-17` |
| **Gemma-3** | Vision+Text | Enhanced video handling, efficient architecture | `--model gemma3 --model_args pretrained=google/gemma-3-2b-vision-it` |
| **LLaVA-OneVision 1.5** | Vision+Text | Improved vision understanding, latest LLaVA | `--model llava_onevision1_5 --model_args pretrained=lmms-lab/llava-onevision-1.5-7b` |
| **LongViLA-R1** | Video+Text | Long-context video, efficient video processing | `--model longvila --model_args pretrained=Efficient-Large-Model/LongViLA-R1-7B` |
| **Thyme** | Vision+Text | Reasoning-focused, enhanced image handling | `--model thyme --model_args pretrained=thyme-ai/thyme-7b` |
**Example Usage:**
```bash
# GPT-4o Audio Preview for audio tasks
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17 \
--tasks step2_audio_paralinguistic,voicebench \
--batch_size 1
# LongViLA for video understanding
python -m lmms_eval \
--model longvila \
--model_args pretrained=Efficient-Large-Model/LongViLA-R1-7B \
--tasks videomme,egoschema \
--batch_size 1
```
### 4. New Benchmarks
Beyond audio, v0.5 adds diverse vision and reasoning benchmarks significantly expanding LMMS-Eval's coverage into specialized domains:
#### Vision & Reasoning Benchmarks
| Benchmark | Variants | Focus | Metrics |
|-----------|----------|-------|---------|
| **CSBench** | 3 (MCQ, Assertion, Combined) | Code understanding, debugging | Accuracy |
| **SciBench** | 4 (Math, Physics, Chemistry, Combined) | College-level STEM | GPT-4 Judge, Accuracy |
| **MedQA** | 1 | Medical question answering | Accuracy |
| **SuperGPQA** | 1 | Graduate-level science Q&A | Accuracy |
| **Lemonade** | 1 | Video action recognition | Accuracy |
| **CharXiv** | 3 (Descriptive, Reasoning, Combined) | Scientific chart interpretation | Accuracy, GPT-4 Judge |
**Example Usage:**
```bash
# Code understanding
python -m lmms_eval --tasks csbench --batch_size 1
# STEM reasoning
python -m lmms_eval --tasks scibench --batch_size 1
# Chart reasoning
python -m lmms_eval --tasks charxiv --batch_size 1
```
#### Reproducibility Validation
We validated our benchmark implementations against official results using two popular language models. The table below compares lmms-eval scores with officially reported results to demonstrate reproducibility:
| Model | Task | lmms-eval | Reported | Δ | Status |
|-------|------|----------|-----------|-----|--------|
| **Qwen-2.5-7B-Instruct** | MedQA | 53.89 | 54.28 | -0.39 | ✓ |
| | SciBench | 43.86 | 42.97 | +0.89 | ✓ |
| | CSBench | 69.01 | 69.51 | -0.50 | ✓ |
| | SuperGPQA | 29.24 | 28.78 | +0.46 | ✓ |
| **Llama-3.1-8B** | MedQA | 64.49 | 67.01 | -2.52 | ✓ |
| | SciBench | 15.35 | 10.78 | +4.57 | +- |
| | CSBench | 62.49 | 57.87 | +4.62 | +- |
| | SuperGPQA | 21.94 | 19.72 | +2.22 | ✓ |
**Status Legend**: ✓ = Strong agreement (Δ ≤ 2.5%) | +- = Acceptable variance (2.5% < Δ ≤ 5%)
### 5. Model Context Protocol (MCP) Integration
Support for MCP-enabled models with tool calling:
```bash
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-2024-11-20,mcp_server_path=/path/to/mcp_server.py \
--tasks mmmu_val \
--batch_size 1
```
**Features:**
- Tool call parsing and execution
- Multi-step reasoning with tools
- Custom MCP server integration
- See `examples/chat_templates/tool_call_qwen2_5_vl.jinja` for templates
### 6. Async OpenAI Improvements
Enhanced async API integration:
- Better rate limit handling
- Configurable retry logic with delays
- Improved error handling
- Batch size optimization for OpenAI-compatible endpoints
**Common Args Support:**
```python
# Now supports additional parameters
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o,temperature=0.7,top_p=0.95,max_tokens=2048 \
--tasks mmstar
```
## Usage Examples
### Audio Evaluation with Caching
```bash
# Enable caching for expensive audio API calls
export LMMS_EVAL_USE_CACHE=True
export OPENAI_API_KEY="your-key"
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17 \
--tasks step2_audio_paralinguistic,voicebench \
--batch_size 8 \
--output_path ./audio_results/ \
--log_samples
# Second run will use cache - much faster!
```
### Multi-Benchmark Evaluation
```bash
# Evaluate across audio, vision, and reasoning tasks
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-2024-11-20 \
--tasks voicebench_mmsu,csbench,scibench_math,charxiv \
--batch_size 4 \
--output_path ./multimodal_results/
```
### Distributed Evaluation with Caching
```bash
export LMMS_EVAL_USE_CACHE=True
torchrun --nproc_per_node=8 -m lmms_eval \
--model qwen2_5_vl \
--model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct \
--tasks step2_audio_paralinguistic,csbench,scibench \
--batch_size 16 \
--output_path ./distributed_results/
```
### Programmatic API with Caching
```python
import os
from lmms_eval.evaluator import simple_evaluate
from lmms_eval.models.chat.async_openai import AsyncOpenAICompatibleChat
# Enable caching
os.environ["LMMS_EVAL_USE_CACHE"] = "True"
model = AsyncOpenAICompatibleChat(
model_version="gpt-4o-audio-preview-2024-12-17",
base_url="https://api.openai.com/v1"
)
results = simple_evaluate(
model=model,
tasks=["voicebench", "step2_audio_paralinguistic"],
batch_size=8,
device="cuda"
)
print(f"Results: {results['results']}")
```
## Technical Details
### Caching Architecture
**Design Philosophy:**
- **Simplicity**: JSONL format for easy inspection and debugging
- **Distributed-safe**: Per-rank files avoid write contention
- **Transparent**: No code changes needed for models using the API
**Cache Key:** `(task_name, doc_id)`
- Stable across runs if task and document IDs don't change
- Model hash derived from `model_version` and task list
**File Structure:**
```
~/.cache/lmms-eval/eval_cache/
└── <model_hash>/
├── task1_rank0_world_size1.jsonl
├── task1_rank1_world_size1.jsonl
└── task2_rank0_world_size1.jsonl
```
**Performance:**
- Initial run: Full model inference
- Cached run: ~100x faster (I/O bound only)
- Distributed: Linear scaling with cache hits
### Audio Processing Pipeline
**Data Flow:**
1. Load HuggingFace audio datasets
2. Convert to unified message format with audio URLs
3. Process through audio-capable models
4. Apply task-specific metrics (WER, accuracy, GPT-4 judge)
5. Aggregate across task groups
**Message Format:**
```python
{
"role": "user",
"content": [
{"type": "audio", "url": "path/to/audio.wav"},
{"type": "text", "text": "Question about the audio"}
]
}
```
### Model Context Protocol
MCP enables models to call external tools during evaluation:
- Custom server implementation
- Tool definition and parsing
- Multi-step reasoning with tool results
- Compatible with OpenAI-style function calling
## Migration Guide
### From v0.4 to v0.5
**No Breaking Changes**: v0.5 is fully backward compatible with v0.4.
**New Features to Adopt:**
1. **Enable Caching for API Models:**
```bash
# Add these environment variables
export LMMS_EVAL_USE_CACHE=True
```
2. **Use New Audio Models:**
```bash
# GPT-4o Audio Preview
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17
```
3. **Leverage New Benchmarks:**
```bash
# Add audio, code, and STEM benchmarks
--tasks step2_audio_paralinguistic,voicebench,csbench,scibench
```
4. **Optimize Async OpenAI Calls:**
```python
# Use additional parameters for better control
model_args="model_version=gpt-4o,temperature=0.7,max_tokens=2048"
```
### Updating Existing Workflows
**Before (v0.4):**
```bash
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-2024-08-06 \
--tasks mmmu_val \
--batch_size 1
```
**After (v0.5 with caching):**
```bash
export LMMS_EVAL_USE_CACHE=True
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-2024-11-20 \
--tasks mmmu_val,voicebench,csbench \
--batch_size 8 # Higher batch size with caching
```
## Bug Fixes and Improvements
### Fixed Issues
1. **`write_out` Flag Deprecated**: The `--write_out` flag is now deprecated in favor of `--log_samples`
```bash
# Old (deprecated)
--write_out
# New
--log_samples
```
2. **TypeError in `write_out` with `log_samples`**: Fixed crash when using both flags together
3. **Batch Size in OpenAI Endpoint**: Corrected batch size handling for OpenAI-compatible servers
4. **Gemma-3 Loading**: Fixed model loading to use `Gemma3ForConditionalGeneration` correctly
5. **SRT API Bugfix**: Resolved issues in subtitle/caption processing
6. **CharXiv Improvements**: Fixed chart understanding task configurations
7. **Async OpenAI Caching Order**: Corrected cache lookup order to avoid unnecessary API calls
### Performance Improvements
- **10-100x speedup** on cached evaluations
- **Better async handling** for API-based models
- **Reduced memory usage** in distributed settings
- **Faster audio dataset loading** from HuggingFace
## Deprecated Features
### Deprecated Flags
- **`--write_out`**: Use `--log_samples` instead
```bash
# Deprecated
python -m lmms_eval --write_out
# Use instead
python -m lmms_eval --log_samples
```
### Model Notes
- Models should implement caching API for best performance
- Legacy simple models continue to work but miss caching benefits
- See `lmms_eval.api.model.lmms` for caching integration
## Contributing
We welcome contributions to LMMS-Eval! The v0.5 release demonstrates the value of community contributions across models, benchmarks, and infrastructure.
### High-Priority Areas for v0.5.x
1. **Audio Model Integrations**: Help add support for more audio-capable models
2. **Audio Benchmark Implementations**: Expand audio evaluation coverage
3. **Caching Optimizations**: Improve cache hit rates and performance
4. **Documentation**: Enhance guides and examples for audio evaluation
5. **MCP Server Examples**: Create reference implementations for tool calling
### How to Contribute
1. **Fork the repository** and create a feature branch from `dev/v0d5`
2. **Follow the development guidelines** in `CLAUDE.md`:
- Use `uv` for package management (never pip)
- Add type hints and docstrings
- Run `uv run ruff format .` and `uv run ruff check . --fix`
- Run `uv run pyright` for type checking
3. **Test thoroughly**:
- Add tests for new features
- Verify caching works if implementing a model
- Test with realistic datasets
4. **Submit a pull request** with clear description
### Adding New Audio Benchmarks
Follow the pattern in existing audio tasks:
```python
# In tasks/your_audio_task/utils.py
def doc_to_messages(doc):
return [{
"role": "user",
"content": [
{"type": "audio", "url": doc["audio_path"]},
{"type": "text", "text": doc["question"]}
]
}]
```
See `lmms_eval/tasks/step2_audio_paralinguistic/` and `lmms_eval/tasks/voicebench/` for examples.
### Adding Caching to Custom Models
Implement the caching API in your model's `generate_until`:
```python
class MyModel(lmms):
def generate_until(self, requests):
# Load cache
self.load_cache()
# Separate cached vs pending
cached, pending = self.get_response_from_cache(requests)
# Process pending requests
for req in pending:
response = self.my_inference_logic(req)
self.add_request_response_to_cache(req, response)
return [c["response"] for c in cached] + pending_responses
```
See `lmms_eval/models/chat/async_openai.py` for a complete example.
## Acknowledgments
The v0.5 release was made possible by contributions from the LMMS-Eval community:
### Core Contributors
- **Audio Evaluation Suite**: Implementation of Step2 Audio Paralinguistic, VoiceBench, and WenetSpeech benchmarks
- **Caching Infrastructure**: Design and implementation of the JSONL caching system
- **Model Integrations**: Support for GPT-4o Audio Preview, Gemma-3, LLaVA-OneVision 1.5, LongViLA-R1, and Thyme
- **Benchmark Additions**: CSBench, SciBench, Lemonade, and CharXiv implementations
- **MCP Integration**: Model Context Protocol client and tool calling support
- **Bug Fixes**: Numerous fixes to async OpenAI, batch handling, and model loading
### Special Thanks
- Community members who reported issues and provided feedback
- Contributors who improved documentation and examples
- Researchers who shared benchmark datasets and evaluation protocols
## Getting Help
### Documentation
- **Main README**: `README.md` - Quick start and overview
- **Model Guide**: `docs/model_guide.md` - Adding new models
- **Task Guide**: `docs/task_guide.md` - Implementing new benchmarks
- **Caching Guide**: `docs/caching.md` - Detailed caching documentation
- **Commands**: `docs/commands.md` - CLI reference
### Support Channels
- **GitHub Issues**: Report bugs or request features at [lmms-eval/issues](https://github.com/EvolvingLMMs-Lab/lmms-eval/issues)
- **GitHub Discussions**: Ask questions and share ideas at [lmms-eval/discussions](https://github.com/EvolvingLMMs-Lab/lmms-eval/discussions)
- **Documentation**: Check the `docs/` directory for implementation guides
### FAQs
**Q: How do I enable caching?**
```bash
export LMMS_EVAL_USE_CACHE=True
```
**Q: Where are cache files stored?**
```bash
~/.cache/lmms-eval/eval_cache/<model_hash>/
```
**Q: How do I evaluate audio models?**
```bash
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17 \
--tasks step2_audio_paralinguistic,voicebench
```
**Q: Can I use caching with distributed evaluation?**
Yes! Caching works seamlessly with multi-GPU/multi-node evaluation. Each rank maintains its own cache file.
**Q: What's the difference between `--write_out` and `--log_samples`?**
`--write_out` is deprecated. Use `--log_samples` to save individual sample results.
---
**Version**: 0.5.0
**Release Date**: October 2025
**Previous Version**: [v0.4 Release Notes](lmms-eval-0.4.md)