| | <!DOCTYPE html> |
| | <html lang="en"> |
| | <head> |
| | <meta charset="UTF-8"> |
| | <meta name="viewport" content="width=device-width, initial-scale=1.0"> |
| | <title>LLM Quantization Formats & CUDA Support Reference</title> |
| | <style> |
| | * { |
| | margin: 0; |
| | padding: 0; |
| | box-sizing: border-box; |
| | } |
| | |
| | body { |
| | font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', 'Oxygen', 'Ubuntu', 'Cantarell', sans-serif; |
| | background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); |
| | min-height: 100vh; |
| | padding: 2rem; |
| | color: #333; |
| | } |
| | |
| | .container { |
| | max-width: 1400px; |
| | margin: 0 auto; |
| | background: white; |
| | border-radius: 16px; |
| | box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3); |
| | overflow: hidden; |
| | } |
| | |
| | header { |
| | background: linear-gradient(135deg, #1e3c72 0%, #2a5298 100%); |
| | color: white; |
| | padding: 3rem 2rem; |
| | text-align: center; |
| | } |
| | |
| | h1 { |
| | font-size: 2.5rem; |
| | margin-bottom: 0.5rem; |
| | font-weight: 700; |
| | } |
| | |
| | .subtitle { |
| | font-size: 1.1rem; |
| | opacity: 0.9; |
| | font-weight: 300; |
| | } |
| | |
| | .content { |
| | padding: 2rem; |
| | } |
| | |
| | .section { |
| | margin-bottom: 3rem; |
| | } |
| | |
| | h2 { |
| | color: #1e3c72; |
| | font-size: 1.8rem; |
| | margin-bottom: 1.5rem; |
| | border-bottom: 3px solid #667eea; |
| | padding-bottom: 0.5rem; |
| | } |
| | |
| | .table-wrapper { |
| | overflow-x: auto; |
| | margin-bottom: 2rem; |
| | border-radius: 8px; |
| | box-shadow: 0 2px 8px rgba(0, 0, 0, 0.1); |
| | } |
| | |
| | table { |
| | width: 100%; |
| | border-collapse: collapse; |
| | font-size: 0.95rem; |
| | } |
| | |
| | thead { |
| | background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); |
| | color: white; |
| | } |
| | |
| | th { |
| | padding: 1rem; |
| | text-align: left; |
| | font-weight: 600; |
| | text-transform: uppercase; |
| | font-size: 0.85rem; |
| | letter-spacing: 0.5px; |
| | } |
| | |
| | td { |
| | padding: 0.9rem 1rem; |
| | border-bottom: 1px solid #e5e7eb; |
| | } |
| | |
| | tbody tr { |
| | transition: background-color 0.2s; |
| | } |
| | |
| | tbody tr:hover { |
| | background-color: #f3f4f6; |
| | } |
| | |
| | tbody tr:nth-child(even) { |
| | background-color: #f9fafb; |
| | } |
| | |
| | .highlight { |
| | background: linear-gradient(135deg, #fef3c7 0%, #fde68a 100%); |
| | font-weight: 600; |
| | } |
| | |
| | .cuda-grid { |
| | display: grid; |
| | grid-template-columns: repeat(auto-fit, minmax(250px, 1fr)); |
| | gap: 1.5rem; |
| | margin-top: 1rem; |
| | } |
| | |
| | .cuda-card { |
| | background: linear-gradient(135deg, #f3f4f6 0%, #e5e7eb 100%); |
| | padding: 1.5rem; |
| | border-radius: 8px; |
| | border-left: 4px solid #667eea; |
| | } |
| | |
| | .cuda-card h3 { |
| | color: #1e3c72; |
| | font-size: 1.2rem; |
| | margin-bottom: 0.5rem; |
| | } |
| | |
| | .cuda-card p { |
| | color: #6b7280; |
| | line-height: 1.6; |
| | } |
| | |
| | .notes-grid { |
| | display: grid; |
| | gap: 1rem; |
| | margin-top: 1rem; |
| | } |
| | |
| | .note-item { |
| | background: #f0f9ff; |
| | padding: 1rem; |
| | border-radius: 6px; |
| | border-left: 3px solid #3b82f6; |
| | } |
| | |
| | .note-item strong { |
| | color: #1e40af; |
| | } |
| | |
| | footer { |
| | background: #f9fafb; |
| | padding: 2rem; |
| | text-align: center; |
| | color: #6b7280; |
| | border-top: 1px solid #e5e7eb; |
| | } |
| | |
| | @media (max-width: 768px) { |
| | body { |
| | padding: 1rem; |
| | } |
| | |
| | h1 { |
| | font-size: 1.8rem; |
| | } |
| | |
| | .content { |
| | padding: 1rem; |
| | } |
| | |
| | table { |
| | font-size: 0.85rem; |
| | } |
| | |
| | th, td { |
| | padding: 0.7rem 0.5rem; |
| | } |
| | } |
| | </style> |
| | </head> |
| | <body> |
| | <div class="container"> |
| | <header> |
| | <h1>🚀 Quantization Formats & CUDA Support</h1> |
| | <p class="subtitle">Complete reference guide for LLM quantization methods and hardware requirements</p> |
| | </header> |
| |
|
| | <div class="content"> |
| | <div class="section"> |
| | <h2>📊 Quantization Formats</h2> |
| | <div class="table-wrapper"> |
| | <table> |
| | <thead> |
| | <tr> |
| | <th>Format</th> |
| | <th>Bits</th> |
| | <th>Min CUDA</th> |
| | <th>GPU Examples</th> |
| | <th>Notes</th> |
| | </tr> |
| | </thead> |
| | <tbody> |
| | <tr> |
| | <td><strong>FP16</strong></td> |
| | <td>16</td> |
| | <td>5.3+</td> |
| | <td>GTX 1000, RTX 2000+</td> |
| | <td>Native half precision</td> |
| | </tr> |
| | <tr> |
| | <td><strong>BF16</strong></td> |
| | <td>16</td> |
| | <td>8.0+</td> |
| | <td>A100, RTX 3090, 4090</td> |
| | <td>Better range than FP16</td> |
| | </tr> |
| | <tr class="highlight"> |
| | <td><strong>FP8 (E4M3/E5M2)</strong></td> |
| | <td>8</td> |
| | <td>8.9+</td> |
| | <td>H100, H200, L40S</td> |
| | <td>Transformer Engine support</td> |
| | </tr> |
| | <tr class="highlight"> |
| | <td><strong>MXFP8</strong></td> |
| | <td>8</td> |
| | <td>8.9+</td> |
| | <td>H100, H200, Blackwell</td> |
| | <td>Block-size 32, E8M0 scale</td> |
| | </tr> |
| | <tr class="highlight"> |
| | <td><strong>FP6</strong></td> |
| | <td>6</td> |
| | <td>10.0+</td> |
| | <td>GB200, B100, B200</td> |
| | <td>Blackwell native support</td> |
| | </tr> |
| | <tr class="highlight"> |
| | <td><strong>MXFP6</strong></td> |
| | <td>6</td> |
| | <td>8.9+</td> |
| | <td>H100+, Blackwell</td> |
| | <td>E2M3/E3M2, block-size 32</td> |
| | </tr> |
| | <tr> |
| | <td><strong>INT8</strong></td> |
| | <td>8</td> |
| | <td>6.1+</td> |
| | <td>GTX 1080+, P100+</td> |
| | <td>Wide compatibility</td> |
| | </tr> |
| | <tr> |
| | <td><strong>INT4</strong></td> |
| | <td>4</td> |
| | <td>7.5+</td> |
| | <td>RTX 2080+, T4, V100</td> |
| | <td>CUTLASS kernels</td> |
| | </tr> |
| | <tr class="highlight"> |
| | <td><strong>MXFP4</strong></td> |
| | <td>4</td> |
| | <td>9.0+</td> |
| | <td>H100, H200, GB200</td> |
| | <td>E2M1, block-size 32, OpenAI</td> |
| | </tr> |
| | <tr class="highlight"> |
| | <td><strong>NVFP4</strong></td> |
| | <td>4</td> |
| | <td>10.0+</td> |
| | <td>GB200, B100, B200</td> |
| | <td>E2M1, block-size 16, dual-scale</td> |
| | </tr> |
| | <tr> |
| | <td><strong>GPTQ</strong></td> |
| | <td>2-8</td> |
| | <td>7.0+</td> |
| | <td>RTX 2000+, V100+</td> |
| | <td>Group-wise quantization</td> |
| | </tr> |
| | <tr> |
| | <td><strong>AWQ</strong></td> |
| | <td>4</td> |
| | <td>7.5+</td> |
| | <td>RTX 3000+, A100+</td> |
| | <td>Activation-aware</td> |
| | </tr> |
| | <tr> |
| | <td><strong>QuIP</strong></td> |
| | <td>2-4</td> |
| | <td>7.0+</td> |
| | <td>RTX 2000+, V100+</td> |
| | <td>Incoherence processing</td> |
| | </tr> |
| | <tr> |
| | <td><strong>QuIP#</strong></td> |
| | <td>2-4</td> |
| | <td>8.0+</td> |
| | <td>RTX 3090+, A100+</td> |
| | <td>E8P lattice codebook</td> |
| | </tr> |
| | <tr> |
| | <td><strong>GGUF/GGML</strong></td> |
| | <td>2-8</td> |
| | <td>6.1+</td> |
| | <td>GTX 1060+, most GPUs</td> |
| | <td>CPU fallback available</td> |
| | </tr> |
| | <tr> |
| | <td><strong>EXL2</strong></td> |
| | <td>2-8</td> |
| | <td>7.5+</td> |
| | <td>RTX 2000+, V100+</td> |
| | <td>Variable bit-width</td> |
| | </tr> |
| | <tr> |
| | <td><strong>NF4</strong></td> |
| | <td>4</td> |
| | <td>7.0+</td> |
| | <td>RTX 2000+, V100+</td> |
| | <td>QLoRA, normal float</td> |
| | </tr> |
| | <tr> |
| | <td><strong>GGUF-IQ</strong></td> |
| | <td>1-8</td> |
| | <td>6.1+</td> |
| | <td>GTX 1060+</td> |
| | <td>Importance matrix</td> |
| | </tr> |
| | </tbody> |
| | </table> |
| | </div> |
| | </div> |
| |
|
| | <div class="section"> |
| | <h2>🎯 CUDA Compute Capabilities</h2> |
| | <div class="cuda-grid"> |
| | <div class="cuda-card"> |
| | <h3>6.1 - Pascal</h3> |
| | <p>GTX 1000 series, Tesla P100</p> |
| | </div> |
| | <div class="cuda-card"> |
| | <h3>7.0 - Volta</h3> |
| | <p>Tesla V100, Titan V</p> |
| | </div> |
| | <div class="cuda-card"> |
| | <h3>7.5 - Turing</h3> |
| | <p>RTX 2000 series, T4, RTX 6000</p> |
| | </div> |
| | <div class="cuda-card"> |
| | <h3>8.0 - Ampere</h3> |
| | <p>A100, RTX 3090</p> |
| | </div> |
| | <div class="cuda-card"> |
| | <h3>8.6 - Ampere</h3> |
| | <p>RTX 3000 series consumer</p> |
| | </div> |
| | <div class="cuda-card"> |
| | <h3>8.9 - Ada Lovelace</h3> |
| | <p>RTX 4000 series, L40S</p> |
| | </div> |
| | <div class="cuda-card"> |
| | <h3>9.0 - Hopper</h3> |
| | <p>H100, H200</p> |
| | </div> |
| | <div class="cuda-card"> |
| | <h3>10.0 - Blackwell</h3> |
| | <p>GB200, B100, B200</p> |
| | </div> |
| | </div> |
| | </div> |
| |
|
| | <div class="section"> |
| | <h2>⚡ Performance Notes</h2> |
| | <div class="notes-grid"> |
| | <div class="note-item"> |
| | <strong>FP8/MXFP8:</strong> Transformer Engine support, 2x faster than BF16 on H100+ |
| | </div> |
| | <div class="note-item"> |
| | <strong>NVFP4:</strong> Native on Blackwell, 2x faster than FP8, ~3.5x memory reduction vs FP16 |
| | </div> |
| | <div class="note-item"> |
| | <strong>MXFP4:</strong> Requires H100+ (CC 9.0), uses Triton kernels, OpenAI GPT-OSS format |
| | </div> |
| | <div class="note-item"> |
| | <strong>MXFP6:</strong> Training & inference on H100+, better accuracy than MXFP4 |
| | </div> |
| | <div class="note-item"> |
| | <strong>QuIP#:</strong> 2-4 bit with E8P lattice codebook, 50% peak bandwidth on RTX 4090 |
| | </div> |
| | <div class="note-item"> |
| | <strong>INT4/GPTQ/AWQ:</strong> ~3-4x memory reduction, 1.5-2x faster inference |
| | </div> |
| | <div class="note-item"> |
| | <strong>GGUF:</strong> Best CPU/GPU hybrid performance |
| | </div> |
| | <div class="note-item"> |
| | <strong>EXL2:</strong> Highest quality at low bits, slower than GPTQ |
| | </div> |
| | </div> |
| | </div> |
| | </div> |
| |
|
| | <footer> |
| | <p>Last updated: November 2025 | Reference for LLM quantization formats and CUDA compute capability requirements</p> |
| | </footer> |
| | </div> |
| | </body> |
| | </html> |