Spaces:
Sleeping
Sleeping
| title: On-Device LLM Throughput Calculator | |
| emoji: 🚀 | |
| colorFrom: pink | |
| colorTo: blue | |
| sdk: gradio | |
| sdk_version: 4.36.0 | |
| app_file: src/app.py | |
| pinned: false | |
| license: mit | |
| # On-Device LLM Throughput Calculator | |
| A Gradio web application that helps visualize LLM throughput on memory-bandwidth-constrained devices. | |
| ## Overview | |
| This tool calculates and visualizes the theoretical throughput (tokens per second) that can be achieved by a Large Language Model (LLM) running on devices with memory bandwidth constraints. It supports different attention mechanisms: | |
| - Grouped Query Attention (GQA) | |
| - Multi-Query Attention (MQA) | |
| - Memory-Latent Attention (MLA) | |
| It also visualizes how sliding window attention impacts throughput at different context lengths. | |
| ## Features | |
| - Customize device specifications (memory bandwidth) | |
| - Configure model parameters (size, layers, heads) | |
| - Compare different attention mechanisms | |
| - Visualize performance across different context lengths | |
| - Sliding window attention support | |
| ## Usage | |
| 1. Configure your device details (name, memory bandwidth) | |
| 2. Set model parameters (number of parameters, layer count, etc.) | |
| 3. Choose which attention mechanism configurations to compare | |
| 4. Generate a visualization of expected throughput | |
| ## Installation | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ## Running Locally | |
| ```bash | |
| cd src | |
| python app.py | |
| ``` | |
| ## Theory | |
| The calculations are based on memory bandwidth bottlenecks as described in the [JAX ML Scaling Book](https://jax-ml.github.io/scaling-book/inference/#theoretical-estimates-for-llm-latency-and-throughput). | |
| The basic formula for tokens per second: | |
| ``` | |
| tokens_per_second = (batch_size * memory_bandwidth) / (batch_size * total_kv_size + parameter_size) | |
| ``` | |
| ## License | |
| MIT | |