Throughput Metrics Documentation
This document describes the inference throughput metrics automatically logged by LMMS-Eval chat models during evaluation.
Overview
LMMS-Eval chat models automatically log detailed timing metrics during inference to help users understand model performance characteristics. These metrics are logged at the INFO level and provide insights into end-to-end latency, token generation speed, and other performance indicators.
Metrics Explained
Core Timing Metrics
- E2E (End-to-End Latency): Total time from request submission to response completion (in seconds)
- TTFT (Time to First Token): Time from request submission until the first token is generated (in seconds)
- TPOT (Time Per Output Token): Average time to generate each output token after the first (in seconds)
- Speed (Inference Speed): Token generation rate calculated as 1/TPOT (tokens per second)
- Output Tokens: Number of tokens generated in the response
Batch Metrics
For models that process multiple requests in batches:
- Batch Summary: Aggregated metrics across all outputs in a batch
- Total Time: Total batch processing time
- Total Tokens: Sum of all output tokens in the batch
- Avg Speed: Average throughput across the entire batch (tokens/s)
Log Format Examples
Individual Output Metrics
Output 0 - E2E: 2.145s, TTFT: 0.215s, TPOT: 0.048s, Speed: 20.8 tokens/s, Output tokens: 42
Batch Summary Metrics
Batch summary - Total time: 2.145s, Total tokens: 128, Avg speed: 59.7 tokens/s
Single Request Metrics (Non-batched)
Inference metrics - E2E: 1.823s, TTFT: 0.182s, TPOT: 0.052s, Speed: 19.2 tokens/s, Output tokens: 32
Supported Models
The following chat models automatically log throughput metrics:
- sglang_runtime (
/lmms_eval/models/chat/sglang.py) - vllm_chat (
/lmms_eval/models/chat/vllm.py) - llava_hf_chat (
/lmms_eval/models/chat/llava_hf.py) - openai_compatible_chat (
/lmms_eval/models/chat/openai_compatible.py) - qwen2_5_vl_chat (
/lmms_eval/models/chat/qwen2_5_vl.py) - huggingface_chat (
/lmms_eval/models/chat/huggingface.py)
Usage
Throughput metrics are automatically logged during evaluation - no additional configuration is required. To view the metrics:
- Command Line Output: Metrics appear in real-time during evaluation
- Log Files: Metrics are written to log files if logging is configured
- Log Level: Ensure logging level is set to INFO or lower to see metrics
Example Evaluation Command
python -m lmms_eval \
--model sglang_runtime \
--model_args model=Qwen/Qwen2.5-VL-3B-Instruct \
--tasks mme \
--batch_size 4 \
--log_samples \
--output_path ./results
Metric Calculation Details
TTFT Calculation
- Available from model: Uses actual first token timestamp when provided
- Estimated: When unavailable, estimated as 10% of total inference time
TPOT Calculation
- Formula:
(E2E_latency - TTFT) / (output_tokens - 1) - Single token responses: Uses full E2E latency as TPOT
- Batch processing: Divides total batch time by number of outputs
Speed Calculation
- Formula:
1 / TPOT(when TPOT > 0) - Edge cases: Set to 0 for single-token responses or zero TPOT
Performance Analysis
Interpreting Metrics
- High TTFT: May indicate model loading, prompt processing, or scheduling delays
- High TPOT: Suggests slower token generation, possibly due to model size or hardware limitations
- Low Speed: Indicates throughput bottlenecks in token generation
- E2E vs TTFT+TPOT: Large differences may suggest batching overhead or system delays
Optimization Insights
- Reduce TTFT: Optimize prompt processing, use model caching, improve scheduling
- Reduce TPOT: Use faster hardware, optimize model inference, adjust batch sizes
- Batch Efficiency: Compare individual vs batch metrics to assess batching benefits
Troubleshooting
Missing Metrics
- Ensure model supports throughput logging (see supported models list)
- Check log level is set to INFO or lower
- Verify model implementation includes timing instrumentation
Inaccurate Metrics
- TTFT estimates may be imprecise when actual timing unavailable
- Batch metrics average across multiple outputs, individual variance not captured
- Network latency may affect metrics for API-based models
Implementation Notes
Throughput metrics are implemented consistently across chat models using:
time.time()for wall-clock timing measurements- Model-specific metadata when available (e.g., SGLang, VLLM native metrics)
- Fallback estimation methods for missing timing data
- Structured logging format for consistent parsing and analysis