WeReCooking
/

Gemma-3-R1984-27B-EXL3

Text Generation

exl3

Model card Files Files and versions

xet

Community

Nekochu commited on 9 days ago

Commit

3aef82c

verified ·

1 Parent(s): d119f92

README with CSS, KLD plot, branch table

Browse files

Files changed (1) hide show

README.md +167 -158

README.md CHANGED Viewed

@@ -1,159 +1,168 @@
-# Gemma-3-R1984-27B EXL3
-EXL3 quants of [VIDraft/Gemma-3-R1984-27B](https://huggingface.co/VIDraft/Gemma-3-R1984-27B) (27B).
-Each bpw variant is a separate branch. Attention tensors boosted to 8bpw via recompilation.
-Docs: [exllamav3 convert.md](https://github.com/turboderp-org/exllamav3/blob/master/doc/convert.md)
-## Branches
-| Branch | Target | Actual bpw | Method |
-|---|---|---|---|
-| `2.96bpw_H6` | 2.0 | 2.96 | base + recompile |
-| `2.98bpw_H6` | 2.5 | 2.98 | optimized (2.0+3.0) + recompile |
-| `3.80bpw_H6` | 3.0 | 3.80 | base + recompile |
-| `3.83bpw_H6` | 3.5 | 3.83 | optimized (3.0+5.0) + recompile |
-| `3.97bpw_H6` | 4.0 | 3.97 | optimized (3.0+5.0) + recompile |
-| `4.13bpw_H6` | 4.5 | 4.13 | optimized (3.0+5.0) + recompile |
-| `5.48bpw_H6` | 5.0 | 5.48 | base + recompile |
-| `6.32bpw_H6` | 6.0 | 6.32 | base + recompile |
-H6 = head_bits 6. All variants recompiled with `*.self_attn.*` boosted to 8bpw.
-Gemma-3 is dense (no MoE), so `*.shared_experts.*` is not applicable.
-## Build recipe
-### 1. Base quants
-```bash
-python convert.py -i <hf-model> -o <out> -w <work> -b <bpw>
-```
-5 base quants: 2.0, 3.0, 5.0, 6.0, 8.0 bpw.
-### 2. KLD measurement
-```bash
-python util/measure.py -r <hf-model> -ms 128 -i <2.0bpw> <8.0bpw> -o measurement.json
-```
-Reusable across all optimized targets. Included in main branch.
-### 3. Optimization (mixed-precision)
-```bash
-python util/optimize.py -i <lo-bpw> <hi-bpw> -m measurement.json -o <out> -b <target>
-```
-KLD-guided tensor replacement: tensors that matter most get higher-bpw versions.
-### 4. Recompilation (attn override)
-```yaml
-sources:
-  - id: 8
-    model_dir: /path/to/8.0bpw
-overrides:
-  - key: "*.self_attn.*"
-    source: 8
-```
-```bash
-python util/recompile.py -i <input> -o <final> -or override.yaml
-```
-Actual bpw is determined after recompile (attn@8bpw shifts average up).
-## Files
-- `main` branch: `measurement.json` (KLD map)
-- Each bpw branch: quantized model shards + config + tokenizer
-## Credits
-- Base model: [VIDraft/Gemma-3-R1984-27B](https://huggingface.co/VIDraft/Gemma-3-R1984-27B)
-- Quantization: [exllamav3](https://github.com/turboderp-org/exllamav3) v0.0.34
 ---
-# EXL3 Optimization Guide
-## Targets
-`2.5bpw_H6 3.0bpw_H6 3.5bpw_H6 4.0bpw_H6 4.5bpw_H6 5.0bpw_H6 6.0bpw_H6`
-| Target | Action |
-|---|---|
-| 2.5bpw_H6 | optimized |
-| 3.0bpw_H6 | direct convert |
-| 3.5bpw_H6 | optimized |
-| 4.0bpw_H6 | optimized |
-| 4.5bpw_H6 | optimized |
-| 5.0bpw_H6 | direct convert |
-| 6.0bpw_H6 | direct convert |
-## Overview
-Dynamic EXL3 quants mix tensor precision, similar to mixed-precision GGUFs. There are two frameworks:
-- **Optimization**
-- **Recompilation**
-Usually, optimization and recompilation are used together: create a mixed quant through optimization, then run recompilation on top of it.
-## Optimization
-1. Start with two quants at different bpw, for example 2bpw and 3bpw.
-2. `measure.py` measures KLD differences by replacing layer groups in the lower-bpw quant with groups from the higher-bpw quant; standard EXL3 calibration data is used.
-3. The resulting `measurement.json` can be reused. You only have to create it once, no matter how many mixed quants you make.
-4. `optimize.py` uses that `measurement.json` to create a third quant from two source quants, replacing the tensors that matter most with higher-bpw tensors.
-Measurement takes about 20min to an hour for big models. Optimization takes about 30s-1m.
-```bash
-python util/measure.py -i /path/to/model-2bpw /path/to/model-3bpw -r /path/to/hf-model -o measurement.json -cr 10 -cc 1024 -d 0
-python util/optimize.py -i /path/to/model-2bpw /path/to/model-3bpw -m measurement.json -o /path/to/model-optimized -b 2.5 -ss 8192
-```
-Alternative measure form with `-ms`:
-```bash
-python util/measure.py -r /workspace/models/original-model -ms 128 -i /workspace/models/quant-2.5bpw /workspace/models/quant-3.5bpw -o /workspace/measurement.json
-```
-Optimize example:
-```bash
-python util/optimize.py -i /workspace/models/quant-2.5bpw /workspace/models/quant-3.0bpw -o /workspace/models/new-quant-2.75bpw -m /workspace/measurement.json -b 2.75
-```
-## Recompilation
-`override.yaml` replaces tensors in one quant with tensors from another quant. It is manual optimization. The source notes that attention and shared expert tensors were replaced with 8bpw tensors for all optimized quants. Recompilation takes about 30s-1m, and the actual new bpw is known after recompilation is done.
-### Multi-source example
-```yaml
-sources:
-  - id: 6
-    model_dir: /path/to/6bpw
-  - id: 8
-    model_dir: /path/to/8bpw
-overrides:
-  - key: "*.self_attn.*"
-    source: 6
-  - key: "*.shared_experts.*"
-    source: 8
-```
-### GLM-Air example
-The GLM-Air example replaces attention and shared experts with 8bpw tensors, and layers 2, 43, 1, 29 with 5bpw tensors because `measurement.json` showed those layers had the worst KLD.
-```yaml
-sources:
-  - id: 8
-    model_dir: /workspace/models/quants-8.0bpw
-  - id: 5
-    model_dir: /workspace/models/quants-5.0bpw
-overrides:
-  - key: "*.self_attn.*"
-    source: 8
-  - key: "*.shared_experts.*"
-    source: 8
-  - key: "model.layers.2.*"
-    source: 5
-  - key: "model.layers.43.*"
-    source: 5
-  - key: "model.layers.1.*"
-    source: 5
-  - key: "model.layers.29.*"
-    source: 5
-```
-```bash
-python util/recompile.py -i /workspace/models/quant-2.75bpw -o /workspace/models/quant-recompiled -or override.yaml
-```

 ---
+base_model: VIDraft/Gemma-3-R1984-27B
+base_model_relation: quantized
+quantized_by: WeReCooking
+pipeline_tag: text-generation
+tags:
+- exl3
+---
+<style>
+  .container-dark { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Arial, sans-serif; line-height: 1.6; color: #d4d4d4; }
+  a { color: #569cd6; text-decoration: none; font-weight: 600; }
+  a:hover { text-decoration: underline; }
+  .card-dark { background-color: #252526; border-radius: 12px; padding: 24px; margin-bottom: 20px; box-shadow: 0 4px 12px rgba(0,0,0,0.3); border: 1px solid #3c3c3c; }
+  .card-dark h1 { font-size: 2.2em; color: #ffffff; text-align: center; margin-bottom: 10px; }
+  .card-dark .subtitle { text-align: center; font-size: 1.1em; color: #a0a0a0; }
+  .card-dark h2 { font-size: 1.5em; margin-top: 0; padding-bottom: 10px; border-bottom: 1px solid #3c3c3c; color: #c586c0; }
+  .styled-table { display: table; border: none; width: 100%; font-size: 0.95em; }
+  .styled-table thead th { background-color: #333333; color: #c586c0; text-align: left; padding: 12px 15px; }
+  .styled-table td { padding: 0; border-bottom: 1px solid #3c3c3c; }
+  .styled-table tbody tr { transition: background-color 0.1s ease; }
+  .styled-table tbody tr:hover { background-color: #3a3a3a; }
+  .styled-table tr:last-child td { border-bottom: none; }
+  .styled-table td a { display: block; padding: 12px 15px; }
+  .styled-table td a.fake-link { text-decoration:none; color:inherit; }
+  details { margin-top: 20px; border: 1px solid #3c3c3c; border-radius: 8px; overflow: hidden; }
+  summary { cursor: pointer; padding: 12px 18px; background-color: #6A5ACD; font-weight: 600; display: flex; align-items: center; gap: 10px; justify-content: space-between; list-style: none; }
+  summary::-webkit-details-marker { display: none; }
+  summary:hover { filter: brightness(1.1); }
+  summary::after { content: ''; display: inline-block; width: 8px; height: 8px; border-bottom: 2px solid white; border-right: 2px solid white; transform: rotate(45deg); transition: transform 0.3s ease; }
+  details[open] > summary::after { transform: rotate(225deg); }
+  .details-content { padding: 18px; }
+</style>
+<div class="container-dark">
+  <div class="card-dark">
+    <h1>Gemma-3-R1984-27B EXL3</h1>
+    <p class="subtitle">
+      EXL3 quants of <a href="https://huggingface.co/VIDraft/Gemma-3-R1984-27B">VIDraft/Gemma-3-R1984-27B</a>
+      using <a href="https://github.com/turboderp-org/exllamav3/">exllamav3</a> v0.0.34
+    </p>
+  </div>
+  <div class="card-dark">
+    <h2>KL Divergence vs VRAM</h2>
+    <img src="kld_plot.png" alt="KLD plot" style="width:100%; border-radius: 8px;" />
+    <p class="subtitle">Reference: 6.0bpw. Lower KLD = closer to reference quality. Measured on wikitext-2 (20 rows, 2048 ctx).</p>
+  </div>
+  <div class="card-dark">
+    <h2>Quants</h2>
+    <table class="styled-table">
+      <thead>
+        <tr><th>Branch</th><th>BPW</th><th>Head</th><th>VRAM (GB)</th><th>KLD</th><th>Type</th></tr>
+      </thead>
+      <tbody>
+        <tr>
+          <td><a href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/2.0bpw_H6">2.0bpw_H6</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/2.0bpw_H6">2.0</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/2.0bpw_H6">6</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/2.0bpw_H6">7.0</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/2.0bpw_H6">0.450</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/2.0bpw_H6">base</a></td>
+        </tr>
+        <tr>
+          <td><a href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/2.50bpw_H6">2.50bpw_H6</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/2.50bpw_H6">2.50</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/2.50bpw_H6">6</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/2.50bpw_H6">8.5</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/2.50bpw_H6">0.389</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/2.50bpw_H6">optimized</a></td>
+        </tr>
+        <tr>
+          <td><a href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/3.0bpw_H6">3.0bpw_H6</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/3.0bpw_H6">3.0</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/3.0bpw_H6">6</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/3.0bpw_H6">9.9</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/3.0bpw_H6">0.110</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/3.0bpw_H6">base</a></td>
+        </tr>
+        <tr>
+          <td><a href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/3.35bpw_H6">3.35bpw_H6</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/3.35bpw_H6">3.35</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/3.35bpw_H6">6</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/3.35bpw_H6">11.0</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/3.35bpw_H6">0.088</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/3.35bpw_H6">optimized</a></td>
+        </tr>
+        <tr>
+          <td><a href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/4.0bpw_H6">4.0bpw_H6</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/4.0bpw_H6">4.0</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/4.0bpw_H6">6</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/4.0bpw_H6">12.9</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/4.0bpw_H6">0.039</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/4.0bpw_H6">base</a></td>
+        </tr>
+        <tr>
+          <td><a href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/5.0bpw_H6">5.0bpw_H6</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/5.0bpw_H6">5.0</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/5.0bpw_H6">6</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/5.0bpw_H6">15.9</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/5.0bpw_H6">0.015</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/5.0bpw_H6">base</a></td>
+        </tr>
+        <tr>
+          <td><a href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/6.0bpw_H6">6.0bpw_H6</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/6.0bpw_H6">6.0</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/6.0bpw_H6">6</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/6.0bpw_H6">19.0</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/6.0bpw_H6">ref</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/6.0bpw_H6">base</a></td>
+        </tr>
+        <tr>
+          <td><a href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/7.0bpw_H6">7.0bpw_H6</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/7.0bpw_H6">7.0</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/7.0bpw_H6">6</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/7.0bpw_H6">~22</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/7.0bpw_H6">-</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/7.0bpw_H6">base</a></td>
+        </tr>
+        <tr>
+          <td><a href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/8.0bpw_H6">8.0bpw_H6</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/8.0bpw_H6">8.0</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/8.0bpw_H6">6</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/8.0bpw_H6">~29</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/8.0bpw_H6">-</a></td>
+          <td><a class="fake-link" href="https://huggingface.co/WeReCooking/Gemma-3-R1984-27B-EXL3/tree/8.0bpw_H6">base</a></td>
+        </tr>
+      </tbody>
+    </table>
+    <p class="subtitle">Optimized variants use KLD-guided tensor mixing + attn@5bpw recompile. Bases are direct converts. 7.0/8.0bpw KLD not measured (exceed 32 GB VRAM).</p>
+  </div>
+  <div class="card-dark">
+    <h2>Download</h2>
+    <details>
+      <summary>Download commands</summary>
+      <div class="details-content">
+        <b>Install CLI:</b>
+        <pre><code>pip install -U "huggingface_hub[cli]"</code></pre>
+        <b>Download a specific quant:</b>
+        <pre><code>huggingface-cli download WeReCooking/Gemma-3-R1984-27B-EXL3 --revision "4.0bpw_H6" --local-dir ./</code></pre>
+      </div>
+    </details>
+    <p class="subtitle">EXL3 quants run with <a href="https://github.com/theroyallab/tabbyapi">TabbyAPI</a> or any exllamav3-compatible backend.</p>
+  </div>
+  <div class="card-dark">
+    <h2>Build Details</h2>
+    <details>
+      <summary>How these were made</summary>
+      <div class="details-content">
+        <p><b>Base quants:</b> <code>convert.py -b &lt;bpw&gt;</code> (2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0)</p>
+        <p><b>KLD measurement:</b> <code>measure.py -r &lt;ref&gt; -ms 128 -i &lt;2.0bpw&gt; &lt;8.0bpw&gt;</code></p>
+        <p><b>Optimized (2.50, 3.35):</b> <code>optimize.py -i &lt;lo&gt; &lt;hi&gt; -m measurement.json -b &lt;target&gt;</code> then <code>recompile.py -or override.yaml</code> with <code>*.self_attn.* -> 5bpw</code></p>
+        <p><b>Note:</b> Gemma-3 is dense (no MoE), so <code>*.shared_experts.*</code> is not applicable. Only optimized variants are recompiled; bases stay at exact bpw.</p>
+        <p>Docs: <a href="https://github.com/turboderp-org/exllamav3/blob/master/doc/convert.md">exllamav3 convert.md</a></p>
+      </div>
+    </details>
+  </div>
+  <div class="card-dark">
+    <h2>Files</h2>
+    <p><code>main</code> branch: <code>measurement.json</code> (KLD map) + <code>kld_plot.png</code></p>
+    <p>Each bpw branch: quantized model shards + config + tokenizer</p>
+  </div>
+</div>