Spaces:

BarraHome
/

quantization-formats-and-cuda-compute-capability-support

Running

App Files Files Community

quantization-formats-and-cuda-compute-capability-support / index.html

BarraHome

Update index.html

0a6e286 verified 3 months ago

raw

history blame contribute delete

14.9 kB

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<title>LLM Quantization Formats & CUDA Support Reference</title>
	<style>
	* {
	margin: 0;
	padding: 0;
	box-sizing: border-box;
	}

	body {
	font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', 'Oxygen', 'Ubuntu', 'Cantarell', sans-serif;
	background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
	min-height: 100vh;
	padding: 2rem;
	color: #333;
	}

	.container {
	max-width: 1400px;
	margin: 0 auto;
	background: white;
	border-radius: 16px;
	box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3);
	overflow: hidden;
	}

	header {
	background: linear-gradient(135deg, #1e3c72 0%, #2a5298 100%);
	color: white;
	padding: 3rem 2rem;
	text-align: center;
	}

	h1 {
	font-size: 2.5rem;
	margin-bottom: 0.5rem;
	font-weight: 700;
	}

	.subtitle {
	font-size: 1.1rem;
	opacity: 0.9;
	font-weight: 300;
	}

	.content {
	padding: 2rem;
	}

	.section {
	margin-bottom: 3rem;
	}

	h2 {
	color: #1e3c72;
	font-size: 1.8rem;
	margin-bottom: 1.5rem;
	border-bottom: 3px solid #667eea;
	padding-bottom: 0.5rem;
	}

	.table-wrapper {
	overflow-x: auto;
	margin-bottom: 2rem;
	border-radius: 8px;
	box-shadow: 0 2px 8px rgba(0, 0, 0, 0.1);
	}

	table {
	width: 100%;
	border-collapse: collapse;
	font-size: 0.95rem;
	}

	thead {
	background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
	color: white;
	}

	th {
	padding: 1rem;
	text-align: left;
	font-weight: 600;
	text-transform: uppercase;
	font-size: 0.85rem;
	letter-spacing: 0.5px;
	}

	td {
	padding: 0.9rem 1rem;
	border-bottom: 1px solid #e5e7eb;
	}

	tbody tr {
	transition: background-color 0.2s;
	}

	tbody tr:hover {
	background-color: #f3f4f6;
	}

	tbody tr:nth-child(even) {
	background-color: #f9fafb;
	}

	.highlight {
	background: linear-gradient(135deg, #fef3c7 0%, #fde68a 100%);
	font-weight: 600;
	}

	.cuda-grid {
	display: grid;
	grid-template-columns: repeat(auto-fit, minmax(250px, 1fr));
	gap: 1.5rem;
	margin-top: 1rem;
	}

	.cuda-card {
	background: linear-gradient(135deg, #f3f4f6 0%, #e5e7eb 100%);
	padding: 1.5rem;
	border-radius: 8px;
	border-left: 4px solid #667eea;
	}

	.cuda-card h3 {
	color: #1e3c72;
	font-size: 1.2rem;
	margin-bottom: 0.5rem;
	}

	.cuda-card p {
	color: #6b7280;
	line-height: 1.6;
	}

	.notes-grid {
	display: grid;
	gap: 1rem;
	margin-top: 1rem;
	}

	.note-item {
	background: #f0f9ff;
	padding: 1rem;
	border-radius: 6px;
	border-left: 3px solid #3b82f6;
	}

	.note-item strong {
	color: #1e40af;
	}

	footer {
	background: #f9fafb;
	padding: 2rem;
	text-align: center;
	color: #6b7280;
	border-top: 1px solid #e5e7eb;
	}

	@media (max-width: 768px) {
	body {
	padding: 1rem;
	}

	h1 {
	font-size: 1.8rem;
	}

	.content {
	padding: 1rem;
	}

	table {
	font-size: 0.85rem;
	}

	th, td {
	padding: 0.7rem 0.5rem;
	}
	}
	</style>
	</head>
	<body>
	<div class="container">
	<header>
	<h1>🚀 Quantization Formats & CUDA Support</h1>
	<p class="subtitle">Complete reference guide for LLM quantization methods and hardware requirements</p>
	</header>

	<div class="content">
	<div class="section">
	<h2>📊 Quantization Formats</h2>
	<div class="table-wrapper">
	<table>
	<thead>
	<tr>
	<th>Format</th>
	<th>Bits</th>
	<th>Min CUDA</th>
	<th>GPU Examples</th>
	<th>Notes</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td><strong>FP16</strong></td>
	<td>16</td>
	<td>5.3+</td>
	<td>GTX 1000, RTX 2000+</td>
	<td>Native half precision</td>
	</tr>
	<tr>
	<td><strong>BF16</strong></td>
	<td>16</td>
	<td>8.0+</td>
	<td>A100, RTX 3090, 4090</td>
	<td>Better range than FP16</td>
	</tr>
	<tr class="highlight">
	<td><strong>FP8 (E4M3/E5M2)</strong></td>
	<td>8</td>
	<td>8.9+</td>
	<td>H100, H200, L40S</td>
	<td>Transformer Engine support</td>
	</tr>
	<tr class="highlight">
	<td><strong>MXFP8</strong></td>
	<td>8</td>
	<td>8.9+</td>
	<td>H100, H200, Blackwell</td>
	<td>Block-size 32, E8M0 scale</td>
	</tr>
	<tr class="highlight">
	<td><strong>FP6</strong></td>
	<td>6</td>
	<td>10.0+</td>
	<td>GB200, B100, B200</td>
	<td>Blackwell native support</td>
	</tr>
	<tr class="highlight">
	<td><strong>MXFP6</strong></td>
	<td>6</td>
	<td>8.9+</td>
	<td>H100+, Blackwell</td>
	<td>E2M3/E3M2, block-size 32</td>
	</tr>
	<tr>
	<td><strong>INT8</strong></td>
	<td>8</td>
	<td>6.1+</td>
	<td>GTX 1080+, P100+</td>
	<td>Wide compatibility</td>
	</tr>
	<tr>
	<td><strong>INT4</strong></td>
	<td>4</td>
	<td>7.5+</td>
	<td>RTX 2080+, T4, V100</td>
	<td>CUTLASS kernels</td>
	</tr>
	<tr class="highlight">
	<td><strong>MXFP4</strong></td>
	<td>4</td>
	<td>9.0+</td>
	<td>H100, H200, GB200</td>
	<td>E2M1, block-size 32, OpenAI</td>
	</tr>
	<tr class="highlight">
	<td><strong>NVFP4</strong></td>
	<td>4</td>
	<td>10.0+</td>
	<td>GB200, B100, B200</td>
	<td>E2M1, block-size 16, dual-scale</td>
	</tr>
	<tr>
	<td><strong>GPTQ</strong></td>
	<td>2-8</td>
	<td>7.0+</td>
	<td>RTX 2000+, V100+</td>
	<td>Group-wise quantization</td>
	</tr>
	<tr>
	<td><strong>AWQ</strong></td>
	<td>4</td>
	<td>7.5+</td>
	<td>RTX 3000+, A100+</td>
	<td>Activation-aware</td>
	</tr>
	<tr>
	<td><strong>QuIP</strong></td>
	<td>2-4</td>
	<td>7.0+</td>
	<td>RTX 2000+, V100+</td>
	<td>Incoherence processing</td>
	</tr>
	<tr>
	<td><strong>QuIP#</strong></td>
	<td>2-4</td>
	<td>8.0+</td>
	<td>RTX 3090+, A100+</td>
	<td>E8P lattice codebook</td>
	</tr>
	<tr>
	<td><strong>GGUF/GGML</strong></td>
	<td>2-8</td>
	<td>6.1+</td>
	<td>GTX 1060+, most GPUs</td>
	<td>CPU fallback available</td>
	</tr>
	<tr>
	<td><strong>EXL2</strong></td>
	<td>2-8</td>
	<td>7.5+</td>
	<td>RTX 2000+, V100+</td>
	<td>Variable bit-width</td>
	</tr>
	<tr>
	<td><strong>NF4</strong></td>
	<td>4</td>
	<td>7.0+</td>
	<td>RTX 2000+, V100+</td>
	<td>QLoRA, normal float</td>
	</tr>
	<tr>
	<td><strong>GGUF-IQ</strong></td>
	<td>1-8</td>
	<td>6.1+</td>
	<td>GTX 1060+</td>
	<td>Importance matrix</td>
	</tr>
	</tbody>
	</table>
	</div>
	</div>

	<div class="section">
	<h2>🎯 CUDA Compute Capabilities</h2>
	<div class="cuda-grid">
	<div class="cuda-card">
	<h3>6.1 - Pascal</h3>
	<p>GTX 1000 series, Tesla P100</p>
	</div>
	<div class="cuda-card">
	<h3>7.0 - Volta</h3>
	<p>Tesla V100, Titan V</p>
	</div>
	<div class="cuda-card">
	<h3>7.5 - Turing</h3>
	<p>RTX 2000 series, T4, RTX 6000</p>
	</div>
	<div class="cuda-card">
	<h3>8.0 - Ampere</h3>
	<p>A100, RTX 3090</p>
	</div>
	<div class="cuda-card">
	<h3>8.6 - Ampere</h3>
	<p>RTX 3000 series consumer</p>
	</div>
	<div class="cuda-card">
	<h3>8.9 - Ada Lovelace</h3>
	<p>RTX 4000 series, L40S</p>
	</div>
	<div class="cuda-card">
	<h3>9.0 - Hopper</h3>
	<p>H100, H200</p>
	</div>
	<div class="cuda-card">
	<h3>10.0 - Blackwell</h3>
	<p>GB200, B100, B200</p>
	</div>
	</div>
	</div>

	<div class="section">
	<h2>⚡ Performance Notes</h2>
	<div class="notes-grid">
	<div class="note-item">
	<strong>FP8/MXFP8:</strong> Transformer Engine support, 2x faster than BF16 on H100+
	</div>
	<div class="note-item">
	<strong>NVFP4:</strong> Native on Blackwell, 2x faster than FP8, ~3.5x memory reduction vs FP16
	</div>
	<div class="note-item">
	<strong>MXFP4:</strong> Requires H100+ (CC 9.0), uses Triton kernels, OpenAI GPT-OSS format
	</div>
	<div class="note-item">
	<strong>MXFP6:</strong> Training & inference on H100+, better accuracy than MXFP4
	</div>
	<div class="note-item">
	<strong>QuIP#:</strong> 2-4 bit with E8P lattice codebook, 50% peak bandwidth on RTX 4090
	</div>
	<div class="note-item">
	<strong>INT4/GPTQ/AWQ:</strong> ~3-4x memory reduction, 1.5-2x faster inference
	</div>
	<div class="note-item">
	<strong>GGUF:</strong> Best CPU/GPU hybrid performance
	</div>
	<div class="note-item">
	<strong>EXL2:</strong> Highest quality at low bits, slower than GPTQ
	</div>
	</div>
	</div>
	</div>

	<footer>
	<p>Last updated: November 2025 \| Reference for LLM quantization formats and CUDA compute capability requirements</p>
	</footer>
	</div>
	</body>
	</html>