# Throughput Metrics Documentation This document describes the inference throughput metrics automatically logged by LMMS-Eval chat models during evaluation. ## Overview LMMS-Eval chat models automatically log detailed timing metrics during inference to help users understand model performance characteristics. These metrics are logged at the INFO level and provide insights into end-to-end latency, token generation speed, and other performance indicators. ## Metrics Explained ### Core Timing Metrics - **E2E (End-to-End Latency)**: Total time from request submission to response completion (in seconds) - **TTFT (Time to First Token)**: Time from request submission until the first token is generated (in seconds) - **TPOT (Time Per Output Token)**: Average time to generate each output token after the first (in seconds) - **Speed (Inference Speed)**: Token generation rate calculated as 1/TPOT (tokens per second) - **Output Tokens**: Number of tokens generated in the response ### Batch Metrics For models that process multiple requests in batches: - **Batch Summary**: Aggregated metrics across all outputs in a batch - **Total Time**: Total batch processing time - **Total Tokens**: Sum of all output tokens in the batch - **Avg Speed**: Average throughput across the entire batch (tokens/s) ## Log Format Examples ### Individual Output Metrics ``` Output 0 - E2E: 2.145s, TTFT: 0.215s, TPOT: 0.048s, Speed: 20.8 tokens/s, Output tokens: 42 ``` ### Batch Summary Metrics ``` Batch summary - Total time: 2.145s, Total tokens: 128, Avg speed: 59.7 tokens/s ``` ### Single Request Metrics (Non-batched) ``` Inference metrics - E2E: 1.823s, TTFT: 0.182s, TPOT: 0.052s, Speed: 19.2 tokens/s, Output tokens: 32 ``` ## Supported Models The following chat models automatically log throughput metrics: - **sglang_runtime** (`/lmms_eval/models/chat/sglang.py`) - **vllm_chat** (`/lmms_eval/models/chat/vllm.py`) - **llava_hf_chat** (`/lmms_eval/models/chat/llava_hf.py`) - **openai_compatible_chat** (`/lmms_eval/models/chat/openai_compatible.py`) - **qwen2_5_vl_chat** (`/lmms_eval/models/chat/qwen2_5_vl.py`) - **huggingface_chat** (`/lmms_eval/models/chat/huggingface.py`) ## Usage Throughput metrics are automatically logged during evaluation - no additional configuration is required. To view the metrics: 1. **Command Line Output**: Metrics appear in real-time during evaluation 2. **Log Files**: Metrics are written to log files if logging is configured 3. **Log Level**: Ensure logging level is set to INFO or lower to see metrics ### Example Evaluation Command ```bash python -m lmms_eval \ --model sglang_runtime \ --model_args model=Qwen/Qwen2.5-VL-3B-Instruct \ --tasks mme \ --batch_size 4 \ --log_samples \ --output_path ./results ``` ## Metric Calculation Details ### TTFT Calculation - **Available from model**: Uses actual first token timestamp when provided - **Estimated**: When unavailable, estimated as 10% of total inference time ### TPOT Calculation - **Formula**: `(E2E_latency - TTFT) / (output_tokens - 1)` - **Single token responses**: Uses full E2E latency as TPOT - **Batch processing**: Divides total batch time by number of outputs ### Speed Calculation - **Formula**: `1 / TPOT` (when TPOT > 0) - **Edge cases**: Set to 0 for single-token responses or zero TPOT ## Performance Analysis ### Interpreting Metrics - **High TTFT**: May indicate model loading, prompt processing, or scheduling delays - **High TPOT**: Suggests slower token generation, possibly due to model size or hardware limitations - **Low Speed**: Indicates throughput bottlenecks in token generation - **E2E vs TTFT+TPOT**: Large differences may suggest batching overhead or system delays ### Optimization Insights - **Reduce TTFT**: Optimize prompt processing, use model caching, improve scheduling - **Reduce TPOT**: Use faster hardware, optimize model inference, adjust batch sizes - **Batch Efficiency**: Compare individual vs batch metrics to assess batching benefits ## Troubleshooting ### Missing Metrics - Ensure model supports throughput logging (see supported models list) - Check log level is set to INFO or lower - Verify model implementation includes timing instrumentation ### Inaccurate Metrics - TTFT estimates may be imprecise when actual timing unavailable - Batch metrics average across multiple outputs, individual variance not captured - Network latency may affect metrics for API-based models ## Implementation Notes Throughput metrics are implemented consistently across chat models using: - `time.time()` for wall-clock timing measurements - Model-specific metadata when available (e.g., SGLang, VLLM native metrics) - Fallback estimation methods for missing timing data - Structured logging format for consistent parsing and analysis