File size: 18,916 Bytes
b0c0df0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
# LMMS-Eval v0.5: Multimodal Expansion Release

## Introduction

LMMs-Eval v0.5 represents a significant expansion in multimodal evaluation capabilities, introducing comprehensive audio understanding support alongside continued vision and reasoning enhancements.

## Table of Contents

- [Introduction](#introduction)
- [Major Features](#major-features)
  - [1. Response Caching System](#1-response-caching-system)
  - [2. Audio Evaluation Suite](#2-audio-evaluation-suite)
  - [3. New Model Support](#3-new-model-support)
  - [4. New Benchmarks](#4-new-benchmarks)
  - [5. Model Context Protocol (MCP) Integration](#5-model-context-protocol-mcp-integration)
  - [6. Async OpenAI Improvements](#6-async-openai-improvements)
- [Usage Examples](#usage-examples)
- [Technical Details](#technical-details)
- [Migration Guide](#migration-guide)
- [Bug Fixes and Improvements](#bug-fixes-and-improvements)
- [Deprecated Features](#deprecated-features)
- [Contributing](#contributing)
- [Acknowledgments](#acknowledgments)
- [Getting Help](#getting-help)

## Major Features

### 1. Response Caching System

A production-ready JSONL-based caching system that dramatically speeds up re-evaluation and reduces API costs:

**Key Features:**
- **Per-document caching**: Cached at `(task_name, doc_id)` level
- **Distributed-safe**: Separate cache files per rank/world size
- **Zero-overhead**: Automatic cache hits with no code changes
- **Multi-backend**: Works with async OpenAI, vLLM, and custom models

**Enable Caching:**
```bash
export LMMS_EVAL_USE_CACHE=True
export LMMS_EVAL_HOME="/path/to/cache_root"  # optional

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-2024-11-20,base_url=$OPENAI_API_BASE \
  --tasks mmmu_val \
  --batch_size 1 \
  --output_path ./logs/
```

**Cache Location:**
- Default: `~/.cache/lmms-eval/eval_cache/<model_hash>/{task_name}_rank{rank}_world_size{world_size}.jsonl`
- Each line: `{"doc_id": <doc_id>, "response": <string>}`

**API Integration:**
```python
def generate_until(self, requests):
    self.load_cache()
    cached, pending = self.get_response_from_cache(requests)
    results = [c["response"] for c in cached]
    for req in pending:
        out = call_backend(req)
        self.add_request_response_to_cache(req, out)
        results.append(out)
    return results
```

See full documentation in `docs/caching.md`.

### 2. Audio Evaluation Suite

Comprehensive audio understanding capabilities with three major benchmark families:

#### Step2 Audio Paralinguistic (11 tasks)
Fine-grained paralinguistic feature evaluation:
- **Acoustic Features**: pitch, rhythm, speed, voice_tone, voice_styles
- **Speaker Attributes**: age, gender, emotions
- **Environmental**: scene, event, vocalsound
- Sematic Match metrics

```bash
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks step2_audio_paralinguistic \
  --batch_size 1
```

#### VoiceBench (9 main categories, 30+ subtasks)
Comprehensive voice and speech evaluation:
- **Instruction Following**: ifeval, alpacaeval, advbench
- **Reasoning**: bbh (Big Bench Hard), commoneval
- **Knowledge**: mmsu (13 subject areas: biology, chemistry, physics, etc.)
- **Q&A**: openbookqa
- **Accent Diversity**: sd-qa (11 regional variants: USA, UK, India, Australia, etc.)
- **Expressiveness**: wildvoice
- Metrics vary by task type, including accuracy(1-5), failure rate, LLM eval, etc.

```bash
# Full VoiceBench
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks voicebench \
  --batch_size 1

# Specific accent evaluation
python -m lmms_eval \
  --tasks voicebench_sd-qa_ind_n,voicebench_sd-qa_ind_s \
  --batch_size 1
```

#### WenetSpeech (2 splits)
Large-scale ASR and speech evaluation:
- **dev**: Development set for validation
- **test_meeting**: Meeting domain evaluation
- MER (Mixed Error Rate) metrics

```bash
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks wenet_speech_dev,wenet_speech_test_meeting \
  --batch_size 1
```

**Audio Pipeline Features:**
- HuggingFace audio dataset integration
- Unified audio message format
- Multiple metric support (Accuracy, WER, GPT-4 Judge)
- Task grouping for multi-subset benchmarks

### 3. New Model Support

Five new model integrations expanding audio and vision capabilities:

| Model | Type | Key Features | Usage Example |
|-------|------|--------------|---------------|
| **GPT-4o Audio Preview** | Audio+Text | Paralinguistic understanding, multi-turn audio | `--model async_openai --model_args model_version=gpt-4o-audio-preview-2024-12-17` |
| **Gemma-3** | Vision+Text | Enhanced video handling, efficient architecture | `--model gemma3 --model_args pretrained=google/gemma-3-2b-vision-it` |
| **LLaVA-OneVision 1.5** | Vision+Text | Improved vision understanding, latest LLaVA | `--model llava_onevision1_5 --model_args pretrained=lmms-lab/llava-onevision-1.5-7b` |
| **LongViLA-R1** | Video+Text | Long-context video, efficient video processing | `--model longvila --model_args pretrained=Efficient-Large-Model/LongViLA-R1-7B` |
| **Thyme** | Vision+Text | Reasoning-focused, enhanced image handling | `--model thyme --model_args pretrained=thyme-ai/thyme-7b` |

**Example Usage:**
```bash
# GPT-4o Audio Preview for audio tasks
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks step2_audio_paralinguistic,voicebench \
  --batch_size 1

# LongViLA for video understanding
python -m lmms_eval \
  --model longvila \
  --model_args pretrained=Efficient-Large-Model/LongViLA-R1-7B \
  --tasks videomme,egoschema \
  --batch_size 1
```

### 4. New Benchmarks

Beyond audio, v0.5 adds diverse vision and reasoning benchmarks significantly expanding LMMS-Eval's coverage into specialized domains:

#### Vision & Reasoning Benchmarks

| Benchmark | Variants | Focus | Metrics |
|-----------|----------|-------|---------|
| **CSBench** | 3 (MCQ, Assertion, Combined) | Code understanding, debugging | Accuracy |
| **SciBench** | 4 (Math, Physics, Chemistry, Combined) | College-level STEM | GPT-4 Judge, Accuracy |
| **MedQA** | 1 | Medical question answering | Accuracy |
| **SuperGPQA** | 1 | Graduate-level science Q&A | Accuracy |
| **Lemonade** | 1 | Video action recognition | Accuracy |
| **CharXiv** | 3 (Descriptive, Reasoning, Combined) | Scientific chart interpretation | Accuracy, GPT-4 Judge |

**Example Usage:**
```bash
# Code understanding
python -m lmms_eval --tasks csbench --batch_size 1

# STEM reasoning
python -m lmms_eval --tasks scibench --batch_size 1

# Chart reasoning
python -m lmms_eval --tasks charxiv --batch_size 1
```

#### Reproducibility Validation

We validated our benchmark implementations against official results using two popular language models. The table below compares lmms-eval scores with officially reported results to demonstrate reproducibility:

| Model | Task | lmms-eval | Reported | Ξ” | Status |
|-------|------|----------|-----------|-----|--------|
| **Qwen-2.5-7B-Instruct** | MedQA | 53.89 | 54.28 | -0.39 | βœ“ |
| | SciBench | 43.86 | 42.97 | +0.89 | βœ“ |
| | CSBench | 69.01 | 69.51 | -0.50 | βœ“ |
| | SuperGPQA | 29.24 | 28.78 | +0.46 | βœ“ |
| **Llama-3.1-8B** | MedQA | 64.49 | 67.01 | -2.52 | βœ“ |
| | SciBench | 15.35 | 10.78 | +4.57 | +- |
| | CSBench | 62.49 | 57.87 | +4.62 | +- |
| | SuperGPQA | 21.94 | 19.72 | +2.22 | βœ“ |

**Status Legend**: βœ“ = Strong agreement (Ξ” ≀ 2.5%) | +- = Acceptable variance (2.5% < Ξ” ≀ 5%)

### 5. Model Context Protocol (MCP) Integration

Support for MCP-enabled models with tool calling:

```bash
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-2024-11-20,mcp_server_path=/path/to/mcp_server.py \
  --tasks mmmu_val \
  --batch_size 1
```

**Features:**
- Tool call parsing and execution
- Multi-step reasoning with tools
- Custom MCP server integration
- See `examples/chat_templates/tool_call_qwen2_5_vl.jinja` for templates

### 6. Async OpenAI Improvements

Enhanced async API integration:
- Better rate limit handling
- Configurable retry logic with delays
- Improved error handling
- Batch size optimization for OpenAI-compatible endpoints

**Common Args Support:**
```python
# Now supports additional parameters
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o,temperature=0.7,top_p=0.95,max_tokens=2048 \
  --tasks mmstar
```

## Usage Examples

### Audio Evaluation with Caching
```bash
# Enable caching for expensive audio API calls
export LMMS_EVAL_USE_CACHE=True
export OPENAI_API_KEY="your-key"

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks step2_audio_paralinguistic,voicebench \
  --batch_size 8 \
  --output_path ./audio_results/ \
  --log_samples

# Second run will use cache - much faster!
```

### Multi-Benchmark Evaluation
```bash
# Evaluate across audio, vision, and reasoning tasks
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-2024-11-20 \
  --tasks voicebench_mmsu,csbench,scibench_math,charxiv \
  --batch_size 4 \
  --output_path ./multimodal_results/
```

### Distributed Evaluation with Caching
```bash
export LMMS_EVAL_USE_CACHE=True

torchrun --nproc_per_node=8 -m lmms_eval \
  --model qwen2_5_vl \
  --model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct \
  --tasks step2_audio_paralinguistic,csbench,scibench \
  --batch_size 16 \
  --output_path ./distributed_results/
```

### Programmatic API with Caching
```python
import os
from lmms_eval.evaluator import simple_evaluate
from lmms_eval.models.chat.async_openai import AsyncOpenAICompatibleChat

# Enable caching
os.environ["LMMS_EVAL_USE_CACHE"] = "True"

model = AsyncOpenAICompatibleChat(
    model_version="gpt-4o-audio-preview-2024-12-17",
    base_url="https://api.openai.com/v1"
)

results = simple_evaluate(
    model=model,
    tasks=["voicebench", "step2_audio_paralinguistic"],
    batch_size=8,
    device="cuda"
)

print(f"Results: {results['results']}")
```

## Technical Details

### Caching Architecture

**Design Philosophy:**
- **Simplicity**: JSONL format for easy inspection and debugging
- **Distributed-safe**: Per-rank files avoid write contention
- **Transparent**: No code changes needed for models using the API

**Cache Key:** `(task_name, doc_id)`
- Stable across runs if task and document IDs don't change
- Model hash derived from `model_version` and task list

**File Structure:**
```
~/.cache/lmms-eval/eval_cache/
└── <model_hash>/
    β”œβ”€β”€ task1_rank0_world_size1.jsonl
    β”œβ”€β”€ task1_rank1_world_size1.jsonl
    └── task2_rank0_world_size1.jsonl
```

**Performance:**
- Initial run: Full model inference
- Cached run: ~100x faster (I/O bound only)
- Distributed: Linear scaling with cache hits

### Audio Processing Pipeline

**Data Flow:**
1. Load HuggingFace audio datasets
2. Convert to unified message format with audio URLs
3. Process through audio-capable models
4. Apply task-specific metrics (WER, accuracy, GPT-4 judge)
5. Aggregate across task groups

**Message Format:**
```python
{
    "role": "user",
    "content": [
        {"type": "audio", "url": "path/to/audio.wav"},
        {"type": "text", "text": "Question about the audio"}
    ]
}
```

### Model Context Protocol

MCP enables models to call external tools during evaluation:
- Custom server implementation
- Tool definition and parsing
- Multi-step reasoning with tool results
- Compatible with OpenAI-style function calling

## Migration Guide

### From v0.4 to v0.5

**No Breaking Changes**: v0.5 is fully backward compatible with v0.4.

**New Features to Adopt:**

1. **Enable Caching for API Models:**
```bash
# Add these environment variables
export LMMS_EVAL_USE_CACHE=True
```

2. **Use New Audio Models:**
```bash
# GPT-4o Audio Preview
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17
```

3. **Leverage New Benchmarks:**
```bash
# Add audio, code, and STEM benchmarks
--tasks step2_audio_paralinguistic,voicebench,csbench,scibench
```

4. **Optimize Async OpenAI Calls:**
```python
# Use additional parameters for better control
model_args="model_version=gpt-4o,temperature=0.7,max_tokens=2048"
```

### Updating Existing Workflows

**Before (v0.4):**
```bash
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-2024-08-06 \
  --tasks mmmu_val \
  --batch_size 1
```

**After (v0.5 with caching):**
```bash
export LMMS_EVAL_USE_CACHE=True

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-2024-11-20 \
  --tasks mmmu_val,voicebench,csbench \
  --batch_size 8  # Higher batch size with caching
```

## Bug Fixes and Improvements

### Fixed Issues

1. **`write_out` Flag Deprecated**: The `--write_out` flag is now deprecated in favor of `--log_samples`
   ```bash
   # Old (deprecated)
   --write_out

   # New
   --log_samples
   ```

2. **TypeError in `write_out` with `log_samples`**: Fixed crash when using both flags together

3. **Batch Size in OpenAI Endpoint**: Corrected batch size handling for OpenAI-compatible servers

4. **Gemma-3 Loading**: Fixed model loading to use `Gemma3ForConditionalGeneration` correctly

5. **SRT API Bugfix**: Resolved issues in subtitle/caption processing

6. **CharXiv Improvements**: Fixed chart understanding task configurations

7. **Async OpenAI Caching Order**: Corrected cache lookup order to avoid unnecessary API calls

### Performance Improvements

- **10-100x speedup** on cached evaluations
- **Better async handling** for API-based models
- **Reduced memory usage** in distributed settings
- **Faster audio dataset loading** from HuggingFace

## Deprecated Features

### Deprecated Flags

- **`--write_out`**: Use `--log_samples` instead
  ```bash
  # Deprecated
  python -m lmms_eval --write_out

  # Use instead
  python -m lmms_eval --log_samples
  ```

### Model Notes

- Models should implement caching API for best performance
- Legacy simple models continue to work but miss caching benefits
- See `lmms_eval.api.model.lmms` for caching integration

## Contributing

We welcome contributions to LMMS-Eval! The v0.5 release demonstrates the value of community contributions across models, benchmarks, and infrastructure.

### High-Priority Areas for v0.5.x

1. **Audio Model Integrations**: Help add support for more audio-capable models
2. **Audio Benchmark Implementations**: Expand audio evaluation coverage
3. **Caching Optimizations**: Improve cache hit rates and performance
4. **Documentation**: Enhance guides and examples for audio evaluation
5. **MCP Server Examples**: Create reference implementations for tool calling

### How to Contribute

1. **Fork the repository** and create a feature branch from `dev/v0d5`
2. **Follow the development guidelines** in `CLAUDE.md`:
   - Use `uv` for package management (never pip)
   - Add type hints and docstrings
   - Run `uv run ruff format .` and `uv run ruff check . --fix`
   - Run `uv run pyright` for type checking
3. **Test thoroughly**:
   - Add tests for new features
   - Verify caching works if implementing a model
   - Test with realistic datasets
4. **Submit a pull request** with clear description

### Adding New Audio Benchmarks

Follow the pattern in existing audio tasks:

```python
# In tasks/your_audio_task/utils.py
def doc_to_messages(doc):
    return [{
        "role": "user",
        "content": [
            {"type": "audio", "url": doc["audio_path"]},
            {"type": "text", "text": doc["question"]}
        ]
    }]
```

See `lmms_eval/tasks/step2_audio_paralinguistic/` and `lmms_eval/tasks/voicebench/` for examples.

### Adding Caching to Custom Models

Implement the caching API in your model's `generate_until`:

```python
class MyModel(lmms):
    def generate_until(self, requests):
        # Load cache
        self.load_cache()

        # Separate cached vs pending
        cached, pending = self.get_response_from_cache(requests)

        # Process pending requests
        for req in pending:
            response = self.my_inference_logic(req)
            self.add_request_response_to_cache(req, response)

        return [c["response"] for c in cached] + pending_responses
```

See `lmms_eval/models/chat/async_openai.py` for a complete example.

## Acknowledgments

The v0.5 release was made possible by contributions from the LMMS-Eval community:

### Core Contributors

- **Audio Evaluation Suite**: Implementation of Step2 Audio Paralinguistic, VoiceBench, and WenetSpeech benchmarks
- **Caching Infrastructure**: Design and implementation of the JSONL caching system
- **Model Integrations**: Support for GPT-4o Audio Preview, Gemma-3, LLaVA-OneVision 1.5, LongViLA-R1, and Thyme
- **Benchmark Additions**: CSBench, SciBench, Lemonade, and CharXiv implementations
- **MCP Integration**: Model Context Protocol client and tool calling support
- **Bug Fixes**: Numerous fixes to async OpenAI, batch handling, and model loading

### Special Thanks

- Community members who reported issues and provided feedback
- Contributors who improved documentation and examples
- Researchers who shared benchmark datasets and evaluation protocols

## Getting Help

### Documentation

- **Main README**: `README.md` - Quick start and overview
- **Model Guide**: `docs/model_guide.md` - Adding new models
- **Task Guide**: `docs/task_guide.md` - Implementing new benchmarks
- **Caching Guide**: `docs/caching.md` - Detailed caching documentation
- **Commands**: `docs/commands.md` - CLI reference

### Support Channels

- **GitHub Issues**: Report bugs or request features at [lmms-eval/issues](https://github.com/EvolvingLMMs-Lab/lmms-eval/issues)
- **GitHub Discussions**: Ask questions and share ideas at [lmms-eval/discussions](https://github.com/EvolvingLMMs-Lab/lmms-eval/discussions)
- **Documentation**: Check the `docs/` directory for implementation guides

### FAQs

**Q: How do I enable caching?**
```bash
export LMMS_EVAL_USE_CACHE=True
```

**Q: Where are cache files stored?**
```bash
~/.cache/lmms-eval/eval_cache/<model_hash>/
```

**Q: How do I evaluate audio models?**
```bash
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks step2_audio_paralinguistic,voicebench
```

**Q: Can I use caching with distributed evaluation?**

Yes! Caching works seamlessly with multi-GPU/multi-node evaluation. Each rank maintains its own cache file.

**Q: What's the difference between `--write_out` and `--log_samples`?**

`--write_out` is deprecated. Use `--log_samples` to save individual sample results.

---

**Version**: 0.5.0
**Release Date**: October 2025
**Previous Version**: [v0.4 Release Notes](lmms-eval-0.4.md)