llm_cp2 / src /lmms-eval /docs /throughput_metrics.md
csuhan's picture
Upload folder using huggingface_hub
b0c0df0 verified
# Throughput Metrics Documentation
This document describes the inference throughput metrics automatically logged by LMMS-Eval chat models during evaluation.
## Overview
LMMS-Eval chat models automatically log detailed timing metrics during inference to help users understand model performance characteristics. These metrics are logged at the INFO level and provide insights into end-to-end latency, token generation speed, and other performance indicators.
## Metrics Explained
### Core Timing Metrics
- **E2E (End-to-End Latency)**: Total time from request submission to response completion (in seconds)
- **TTFT (Time to First Token)**: Time from request submission until the first token is generated (in seconds)
- **TPOT (Time Per Output Token)**: Average time to generate each output token after the first (in seconds)
- **Speed (Inference Speed)**: Token generation rate calculated as 1/TPOT (tokens per second)
- **Output Tokens**: Number of tokens generated in the response
### Batch Metrics
For models that process multiple requests in batches:
- **Batch Summary**: Aggregated metrics across all outputs in a batch
- **Total Time**: Total batch processing time
- **Total Tokens**: Sum of all output tokens in the batch
- **Avg Speed**: Average throughput across the entire batch (tokens/s)
## Log Format Examples
### Individual Output Metrics
```
Output 0 - E2E: 2.145s, TTFT: 0.215s, TPOT: 0.048s, Speed: 20.8 tokens/s, Output tokens: 42
```
### Batch Summary Metrics
```
Batch summary - Total time: 2.145s, Total tokens: 128, Avg speed: 59.7 tokens/s
```
### Single Request Metrics (Non-batched)
```
Inference metrics - E2E: 1.823s, TTFT: 0.182s, TPOT: 0.052s, Speed: 19.2 tokens/s, Output tokens: 32
```
## Supported Models
The following chat models automatically log throughput metrics:
- **sglang_runtime** (`/lmms_eval/models/chat/sglang.py`)
- **vllm_chat** (`/lmms_eval/models/chat/vllm.py`)
- **llava_hf_chat** (`/lmms_eval/models/chat/llava_hf.py`)
- **openai_compatible_chat** (`/lmms_eval/models/chat/openai_compatible.py`)
- **qwen2_5_vl_chat** (`/lmms_eval/models/chat/qwen2_5_vl.py`)
- **huggingface_chat** (`/lmms_eval/models/chat/huggingface.py`)
## Usage
Throughput metrics are automatically logged during evaluation - no additional configuration is required. To view the metrics:
1. **Command Line Output**: Metrics appear in real-time during evaluation
2. **Log Files**: Metrics are written to log files if logging is configured
3. **Log Level**: Ensure logging level is set to INFO or lower to see metrics
### Example Evaluation Command
```bash
python -m lmms_eval \
--model sglang_runtime \
--model_args model=Qwen/Qwen2.5-VL-3B-Instruct \
--tasks mme \
--batch_size 4 \
--log_samples \
--output_path ./results
```
## Metric Calculation Details
### TTFT Calculation
- **Available from model**: Uses actual first token timestamp when provided
- **Estimated**: When unavailable, estimated as 10% of total inference time
### TPOT Calculation
- **Formula**: `(E2E_latency - TTFT) / (output_tokens - 1)`
- **Single token responses**: Uses full E2E latency as TPOT
- **Batch processing**: Divides total batch time by number of outputs
### Speed Calculation
- **Formula**: `1 / TPOT` (when TPOT > 0)
- **Edge cases**: Set to 0 for single-token responses or zero TPOT
## Performance Analysis
### Interpreting Metrics
- **High TTFT**: May indicate model loading, prompt processing, or scheduling delays
- **High TPOT**: Suggests slower token generation, possibly due to model size or hardware limitations
- **Low Speed**: Indicates throughput bottlenecks in token generation
- **E2E vs TTFT+TPOT**: Large differences may suggest batching overhead or system delays
### Optimization Insights
- **Reduce TTFT**: Optimize prompt processing, use model caching, improve scheduling
- **Reduce TPOT**: Use faster hardware, optimize model inference, adjust batch sizes
- **Batch Efficiency**: Compare individual vs batch metrics to assess batching benefits
## Troubleshooting
### Missing Metrics
- Ensure model supports throughput logging (see supported models list)
- Check log level is set to INFO or lower
- Verify model implementation includes timing instrumentation
### Inaccurate Metrics
- TTFT estimates may be imprecise when actual timing unavailable
- Batch metrics average across multiple outputs, individual variance not captured
- Network latency may affect metrics for API-based models
## Implementation Notes
Throughput metrics are implemented consistently across chat models using:
- `time.time()` for wall-clock timing measurements
- Model-specific metadata when available (e.g., SGLang, VLLM native metrics)
- Fallback estimation methods for missing timing data
- Structured logging format for consistent parsing and analysis