llm_cp2 / src /lmms-eval /docs /throughput_metrics.md

Upload folder using huggingface_hub

b0c0df0 verified about 1 month ago

4.82 kB

	# Throughput Metrics Documentation

	This document describes the inference throughput metrics automatically logged by LMMS-Eval chat models during evaluation.

	## Overview

	LMMS-Eval chat models automatically log detailed timing metrics during inference to help users understand model performance characteristics. These metrics are logged at the INFO level and provide insights into end-to-end latency, token generation speed, and other performance indicators.

	## Metrics Explained

	### Core Timing Metrics

	- E2E (End-to-End Latency): Total time from request submission to response completion (in seconds)
	- TTFT (Time to First Token): Time from request submission until the first token is generated (in seconds)
	- TPOT (Time Per Output Token): Average time to generate each output token after the first (in seconds)
	- Speed (Inference Speed): Token generation rate calculated as 1/TPOT (tokens per second)
	- Output Tokens: Number of tokens generated in the response

	### Batch Metrics

	For models that process multiple requests in batches:

	- Batch Summary: Aggregated metrics across all outputs in a batch
	- Total Time: Total batch processing time
	- Total Tokens: Sum of all output tokens in the batch
	- Avg Speed: Average throughput across the entire batch (tokens/s)

	## Log Format Examples

	### Individual Output Metrics
	```
	Output 0 - E2E: 2.145s, TTFT: 0.215s, TPOT: 0.048s, Speed: 20.8 tokens/s, Output tokens: 42
	```

	### Batch Summary Metrics
	```
	Batch summary - Total time: 2.145s, Total tokens: 128, Avg speed: 59.7 tokens/s
	```

	### Single Request Metrics (Non-batched)
	```
	Inference metrics - E2E: 1.823s, TTFT: 0.182s, TPOT: 0.052s, Speed: 19.2 tokens/s, Output tokens: 32
	```

	## Supported Models

	The following chat models automatically log throughput metrics:

	- sglang_runtime (`/lmms_eval/models/chat/sglang.py`)
	- vllm_chat (`/lmms_eval/models/chat/vllm.py`)
	- llava_hf_chat (`/lmms_eval/models/chat/llava_hf.py`)
	- openai_compatible_chat (`/lmms_eval/models/chat/openai_compatible.py`)
	- qwen2_5_vl_chat (`/lmms_eval/models/chat/qwen2_5_vl.py`)
	- huggingface_chat (`/lmms_eval/models/chat/huggingface.py`)

	## Usage

	Throughput metrics are automatically logged during evaluation - no additional configuration is required. To view the metrics:

	1. Command Line Output: Metrics appear in real-time during evaluation
	2. Log Files: Metrics are written to log files if logging is configured
	3. Log Level: Ensure logging level is set to INFO or lower to see metrics

	### Example Evaluation Command
	```bash
	python -m lmms_eval \
	--model sglang_runtime \
	--model_args model=Qwen/Qwen2.5-VL-3B-Instruct \
	--tasks mme \
	--batch_size 4 \
	--log_samples \
	--output_path ./results
	```

	## Metric Calculation Details

	### TTFT Calculation
	- Available from model: Uses actual first token timestamp when provided
	- Estimated: When unavailable, estimated as 10% of total inference time

	### TPOT Calculation
	- Formula: `(E2E_latency - TTFT) / (output_tokens - 1)`
	- Single token responses: Uses full E2E latency as TPOT
	- Batch processing: Divides total batch time by number of outputs

	### Speed Calculation
	- Formula: `1 / TPOT` (when TPOT > 0)
	- Edge cases: Set to 0 for single-token responses or zero TPOT

	## Performance Analysis

	### Interpreting Metrics

	- High TTFT: May indicate model loading, prompt processing, or scheduling delays
	- High TPOT: Suggests slower token generation, possibly due to model size or hardware limitations
	- Low Speed: Indicates throughput bottlenecks in token generation
	- E2E vs TTFT+TPOT: Large differences may suggest batching overhead or system delays

	### Optimization Insights

	- Reduce TTFT: Optimize prompt processing, use model caching, improve scheduling
	- Reduce TPOT: Use faster hardware, optimize model inference, adjust batch sizes
	- Batch Efficiency: Compare individual vs batch metrics to assess batching benefits

	## Troubleshooting

	### Missing Metrics
	- Ensure model supports throughput logging (see supported models list)
	- Check log level is set to INFO or lower
	- Verify model implementation includes timing instrumentation

	### Inaccurate Metrics
	- TTFT estimates may be imprecise when actual timing unavailable
	- Batch metrics average across multiple outputs, individual variance not captured
	- Network latency may affect metrics for API-based models

	## Implementation Notes

	Throughput metrics are implemented consistently across chat models using:
	- `time.time()` for wall-clock timing measurements
	- Model-specific metadata when available (e.g., SGLang, VLLM native metrics)
	- Fallback estimation methods for missing timing data
	- Structured logging format for consistent parsing and analysis