| # Throughput Metrics Documentation | |
| This document describes the inference throughput metrics automatically logged by LMMS-Eval chat models during evaluation. | |
| ## Overview | |
| LMMS-Eval chat models automatically log detailed timing metrics during inference to help users understand model performance characteristics. These metrics are logged at the INFO level and provide insights into end-to-end latency, token generation speed, and other performance indicators. | |
| ## Metrics Explained | |
| ### Core Timing Metrics | |
| - **E2E (End-to-End Latency)**: Total time from request submission to response completion (in seconds) | |
| - **TTFT (Time to First Token)**: Time from request submission until the first token is generated (in seconds) | |
| - **TPOT (Time Per Output Token)**: Average time to generate each output token after the first (in seconds) | |
| - **Speed (Inference Speed)**: Token generation rate calculated as 1/TPOT (tokens per second) | |
| - **Output Tokens**: Number of tokens generated in the response | |
| ### Batch Metrics | |
| For models that process multiple requests in batches: | |
| - **Batch Summary**: Aggregated metrics across all outputs in a batch | |
| - **Total Time**: Total batch processing time | |
| - **Total Tokens**: Sum of all output tokens in the batch | |
| - **Avg Speed**: Average throughput across the entire batch (tokens/s) | |
| ## Log Format Examples | |
| ### Individual Output Metrics | |
| ``` | |
| Output 0 - E2E: 2.145s, TTFT: 0.215s, TPOT: 0.048s, Speed: 20.8 tokens/s, Output tokens: 42 | |
| ``` | |
| ### Batch Summary Metrics | |
| ``` | |
| Batch summary - Total time: 2.145s, Total tokens: 128, Avg speed: 59.7 tokens/s | |
| ``` | |
| ### Single Request Metrics (Non-batched) | |
| ``` | |
| Inference metrics - E2E: 1.823s, TTFT: 0.182s, TPOT: 0.052s, Speed: 19.2 tokens/s, Output tokens: 32 | |
| ``` | |
| ## Supported Models | |
| The following chat models automatically log throughput metrics: | |
| - **sglang_runtime** (`/lmms_eval/models/chat/sglang.py`) | |
| - **vllm_chat** (`/lmms_eval/models/chat/vllm.py`) | |
| - **llava_hf_chat** (`/lmms_eval/models/chat/llava_hf.py`) | |
| - **openai_compatible_chat** (`/lmms_eval/models/chat/openai_compatible.py`) | |
| - **qwen2_5_vl_chat** (`/lmms_eval/models/chat/qwen2_5_vl.py`) | |
| - **huggingface_chat** (`/lmms_eval/models/chat/huggingface.py`) | |
| ## Usage | |
| Throughput metrics are automatically logged during evaluation - no additional configuration is required. To view the metrics: | |
| 1. **Command Line Output**: Metrics appear in real-time during evaluation | |
| 2. **Log Files**: Metrics are written to log files if logging is configured | |
| 3. **Log Level**: Ensure logging level is set to INFO or lower to see metrics | |
| ### Example Evaluation Command | |
| ```bash | |
| python -m lmms_eval \ | |
| --model sglang_runtime \ | |
| --model_args model=Qwen/Qwen2.5-VL-3B-Instruct \ | |
| --tasks mme \ | |
| --batch_size 4 \ | |
| --log_samples \ | |
| --output_path ./results | |
| ``` | |
| ## Metric Calculation Details | |
| ### TTFT Calculation | |
| - **Available from model**: Uses actual first token timestamp when provided | |
| - **Estimated**: When unavailable, estimated as 10% of total inference time | |
| ### TPOT Calculation | |
| - **Formula**: `(E2E_latency - TTFT) / (output_tokens - 1)` | |
| - **Single token responses**: Uses full E2E latency as TPOT | |
| - **Batch processing**: Divides total batch time by number of outputs | |
| ### Speed Calculation | |
| - **Formula**: `1 / TPOT` (when TPOT > 0) | |
| - **Edge cases**: Set to 0 for single-token responses or zero TPOT | |
| ## Performance Analysis | |
| ### Interpreting Metrics | |
| - **High TTFT**: May indicate model loading, prompt processing, or scheduling delays | |
| - **High TPOT**: Suggests slower token generation, possibly due to model size or hardware limitations | |
| - **Low Speed**: Indicates throughput bottlenecks in token generation | |
| - **E2E vs TTFT+TPOT**: Large differences may suggest batching overhead or system delays | |
| ### Optimization Insights | |
| - **Reduce TTFT**: Optimize prompt processing, use model caching, improve scheduling | |
| - **Reduce TPOT**: Use faster hardware, optimize model inference, adjust batch sizes | |
| - **Batch Efficiency**: Compare individual vs batch metrics to assess batching benefits | |
| ## Troubleshooting | |
| ### Missing Metrics | |
| - Ensure model supports throughput logging (see supported models list) | |
| - Check log level is set to INFO or lower | |
| - Verify model implementation includes timing instrumentation | |
| ### Inaccurate Metrics | |
| - TTFT estimates may be imprecise when actual timing unavailable | |
| - Batch metrics average across multiple outputs, individual variance not captured | |
| - Network latency may affect metrics for API-based models | |
| ## Implementation Notes | |
| Throughput metrics are implemented consistently across chat models using: | |
| - `time.time()` for wall-clock timing measurements | |
| - Model-specific metadata when available (e.g., SGLang, VLLM native metrics) | |
| - Fallback estimation methods for missing timing data | |
| - Structured logging format for consistent parsing and analysis |