a0y0346
Fix: Use Gradio 5.12.0 + pyaudioop for Python 3.13
2c154e8

A newer version of the Gradio SDK is available: 6.13.0

Upgrade
metadata
title: LLM Inference Profiler
emoji: 
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.12.0
app_file: app.py
pinned: false
license: mit
short_description: Interactive calculator for LLM inference performance

LLM Inference Profiler

An interactive educational tool for understanding LLM inference performance. Explore how model size, GPU specs, and workload characteristics affect the prefill and decode phases.

Features

  • Time Analysis: See how long prefill and decode take, and why decode dominates
  • GPU Utilization: Understand why prefill achieves 50-70% utilization while decode is often <5%
  • Arithmetic Intensity: Visualize the compute-bound vs memory-bound nature of each phase
  • KV Cache Growth: Watch how memory usage grows during generation
  • Waste Factor: See how much work the KV cache saves

Key Concepts Demonstrated

  • Prefill Phase: Processes all prompt tokens in parallel (compute-bound)
  • Decode Phase: Generates tokens one at a time (memory-bound)
  • KV Cache: Trades memory for compute by storing Key/Value vectors
  • Arithmetic Intensity: The ratio that determines if you're compute or memory limited

Based On

This tool accompanies the "Foundations of LLM Inference" article series, which covers:

  1. The Autoregressive Loop and Redundancy Problem
  2. The KV Cache
  3. Prefill and Decode Phases
  4. Why Prefill is Compute-Bound
  5. Why Decode is Memory-Bound
  6. The Utilization Paradox
  7. Optimization Strategies

Usage

  1. Select a model (LLaMA-7B, 13B, 70B, etc.)
  2. Choose a GPU (A100, H100, T4, etc.)
  3. Set precision (FP16, INT8, INT4)
  4. Adjust prompt and generation lengths
  5. Experiment with batch size to see its effect on decode

The tool will show you timing breakdowns, utilization metrics, and interactive visualizations.