File size: 17,328 Bytes

84fe9f2

---
license: unknown
---

# Tensor Type Testing

> [!TIP]
> Skip to the bottom of this document for a TL;DR

For more info, see [llama.cpp #12511: Handle user-defined quantization levels for additional tensors](https://github.com/ggml-org/llama.cpp/pull/12511) by @EAddario

Testing done by @ddh0 using [this branch](https://github.com/EAddario/llama.cpp/tree/quantize) as of committ [5a304b8](https://github.com/EAddario/llama.cpp/commit/5a304b8e26b8c53f43e8d12515e52f9bb7d199f0). Using libllama built for Linux CUDA.

## Quantization naming scheme

```
Model-Name-E{TYPE_EMBD}-F{TYPE_FFN}-A{TYPE_ATTN}-O{TYPE_OUTPUT}.gguf
```

for example `Llama-3.1-8B-Instruct-EQ4_K-FQ4_K-AQ8_0-OQ8_0.gguf`:
- Model is Llama 3.1 8B Instruct
- TYPE_EMBD (token embeddings) are in Q4_K
- TYPE_FFN (MLP / feed-forward tensors) are in Q4_K
- TYPE_ATTN (K,Q,V attention and attention output tensors) are in Q8_0
- TYPE_OUTPUT (output tensor) is in Q8_0

---

## Command template

```bash
TYPE_EMBD=GGML_TYPE
TYPE_FFN=GGML_TYPE
TYPE_ATTN=GGML_TYPE
TYPE_OUTPUT=GGML_TYPE
SRC_GGUF=/my/model/orig.gguf
DST_GGUF=/my/model/quant.gguf
N_THREADS=4

./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
```

---

## Commands used for Llama 3.2

---

### Crush token embeddings to Q2_K, otherwise Q8_0

```bash
TYPE_EMBD=Q2_K
TYPE_FFN=Q8_0
TYPE_ATTN=Q8_0
TYPE_OUTPUT=Q8_0
SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
N_THREADS=16

./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
```

---

### Crush FFN to Q2_K, otherwise Q8_0

```bash
TYPE_EMBD=Q8_0
TYPE_FFN=Q2_K
TYPE_ATTN=Q8_0
TYPE_OUTPUT=Q8_0
SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
N_THREADS=16

./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
```

---

### Crush attention to Q2_K, otherwise Q8_0

```bash
TYPE_EMBD=Q8_0
TYPE_FFN=Q8_0
TYPE_ATTN=Q2_K
TYPE_OUTPUT=Q8_0
SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
N_THREADS=16

./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
```

---

### Crush output tensor to Q2_K, otherwise Q8_0 ⚠️

> **This quant was not included in the testing because Llama 3.2 3B has no output tensor! The resulting file is the same as a normal Q8_0.**

```bash
TYPE_EMBD=Q8_0
TYPE_FFN=Q8_0
TYPE_ATTN=Q8_0
TYPE_OUTPUT=Q2_K
SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf
N_THREADS=16

./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
```

---

## Raw results for Llama 3.2 3B

```
          Number of input texts: 10
Shortest input length in tokens: 55
 Longest input length in tokens: 4678
 Average input length in tokens: 1605.5
   Total number of input tokens: 16055
--------------------------------------------------------------------------------
Evaluating baseline model Llama-3.2-3B-BF16.gguf...
Load model...
Evaluate prompts...
Unload model...
--------------------------------------------------------------------------------
Now processing: Llama-3.2-3B-Q2_K.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-Q2_K.gguf:
-- Prompt 0: 1.2261667251586914
-- Prompt 1: 1.1347604990005493
-- Prompt 2: 1.388033390045166
-- Prompt 3: 1.1053369045257568
-- Prompt 4: 1.7510676383972168
-- Prompt 5: 4.586221218109131
-- Prompt 6: 1.3651360273361206
-- Prompt 7: 0.8970077037811279
-- Prompt 8: 0.3409916162490845
-- Prompt 9: 1.2506738901138306
Average MSD: 1.5045396089553833
--------------------------------------------------------------------------------
Now processing: Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf:
-- Prompt 0: 0.3589555025100708
-- Prompt 1: 0.1420530527830124
-- Prompt 2: 0.3871675133705139
-- Prompt 3: 0.38336610794067383
-- Prompt 4: 0.4630553722381592
-- Prompt 5: 0.3928600549697876
-- Prompt 6: 0.46294596791267395
-- Prompt 7: 0.41983363032341003
-- Prompt 8: 0.0822080597281456
-- Prompt 9: 0.3548887372016907
Average MSD: 0.34473341703414917
--------------------------------------------------------------------------------
Now processing: Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf:
-- Prompt 0: 4.409396648406982
-- Prompt 1: 2.431891679763794
-- Prompt 2: 5.892056941986084
-- Prompt 3: 4.688146591186523
-- Prompt 4: 6.351741313934326
-- Prompt 5: 8.826679229736328
-- Prompt 6: 4.506043434143066
-- Prompt 7: 4.613113880157471
-- Prompt 8: 1.0596126317977905
-- Prompt 9: 4.1558661460876465
Average MSD: 4.693454742431641
--------------------------------------------------------------------------------
Now processing: Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf:
-- Prompt 0: 1.0618470907211304
-- Prompt 1: 1.1212399005889893
-- Prompt 2: 1.3122810125350952
-- Prompt 3: 0.9195016026496887
-- Prompt 4: 1.201547622680664
-- Prompt 5: 5.760651111602783
-- Prompt 6: 1.0914928913116455
-- Prompt 7: 0.9646959900856018
-- Prompt 8: 0.41648873686790466
-- Prompt 9: 1.4317259788513184
Average MSD: 1.5281471014022827
--------------------------------------------------------------------------------
Now processing: Llama-3.2-3B-Q8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-Q8_0.gguf:
-- Prompt 0: 0.0023212190717458725
-- Prompt 1: 0.0014450754970312119
-- Prompt 2: 0.003914575092494488
-- Prompt 3: 0.002514646854251623
-- Prompt 4: 0.003313937224447727
-- Prompt 5: 0.004224818665534258
-- Prompt 6: 0.0026909655425697565
-- Prompt 7: 0.0033839084208011627
-- Prompt 8: 0.0015104531776160002
-- Prompt 9: 0.002354747150093317
Average MSD: 0.0027674345765262842
--------------------------------------------------------------------------------
Average Mean-Squared Deviation compared to Llama-3.2-3B-BF16.gguf:
--------------------------------------------------------------------------------
                                      Llama-3.2-3B-Q2_K.gguf -- 1.5045396089553833
                   Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf -- 0.34473341703414917
                   Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf -- 4.693454742431641
                   Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf -- 1.5281471014022827
                                      Llama-3.2-3B-Q8_0.gguf -- 0.0027674345765262842
--------------------------------------------------------------------------------
```

---

## Commands used for Qwen2.5-14B

---

### Crush token embeddings to Q2_K, otherwise Q8_0

```bash
TYPE_EMBD=Q2_K
TYPE_FFN=Q8_0
TYPE_ATTN=Q8_0
TYPE_OUTPUT=Q8_0
SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
N_THREADS=16

./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
```

---

### Crush FFNs to Q2_K, otherwise Q8_0

```bash
TYPE_EMBD=Q8_0
TYPE_FFN=Q2_K
TYPE_ATTN=Q8_0
TYPE_OUTPUT=Q8_0
SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
N_THREADS=16

./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
```

---

### Crush attention to Q2_K, otherwise Q8_0

```bash
TYPE_EMBD=Q8_0
TYPE_FFN=Q8_0
TYPE_ATTN=Q2_K
TYPE_OUTPUT=Q8_0
SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
N_THREADS=16

./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
```

---

### Crush output tensor to Q2_K, otherwise Q8_0

```bash
TYPE_EMBD=Q8_0
TYPE_FFN=Q8_0
TYPE_ATTN=Q8_0
TYPE_OUTPUT=Q2_K
SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf
N_THREADS=16

./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
```

---

## Raw results for Qwen2.5-14B

```
          Number of input texts: 10
Shortest input length in tokens: 60
 Longest input length in tokens: 4801
 Average input length in tokens: 1589.3
   Total number of input tokens: 15893
--------------------------------------------------------------------------------
Evaluating baseline model Qwen2.5-14B-BF16.gguf...
Load model...
Evaluate prompts...
Unload model...
--------------------------------------------------------------------------------
Now processing: Qwen2.5-14B-Q2_K.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-Q2_K.gguf:
-- Prompt 0: 1.568434476852417
-- Prompt 1: 1.8605916500091553
-- Prompt 2: 1.2912431955337524
-- Prompt 3: 1.3367090225219727
-- Prompt 4: 1.1364308595657349
-- Prompt 5: 2.3384993076324463
-- Prompt 6: 1.2926896810531616
-- Prompt 7: 1.4084643125534058
-- Prompt 8: 0.32443684339523315
-- Prompt 9: 1.3756331205368042
Average MSD: 1.3933132886886597
--------------------------------------------------------------------------------
Now processing: Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf:
-- Prompt 0: 0.012962134554982185
-- Prompt 1: 0.019185630604624748
-- Prompt 2: 0.05430002510547638
-- Prompt 3: 0.008174948394298553
-- Prompt 4: 0.011592703871428967
-- Prompt 5: 0.012105505913496017
-- Prompt 6: 0.007557644974440336
-- Prompt 7: 0.01957087405025959
-- Prompt 8: 0.013395288027822971
-- Prompt 9: 0.007488884497433901
Average MSD: 0.01663336530327797
--------------------------------------------------------------------------------
Now processing: Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf:
-- Prompt 0: 2.483222246170044
-- Prompt 1: 2.20788836479187
-- Prompt 2: 2.2648935317993164
-- Prompt 3: 2.175588607788086
-- Prompt 4: 1.624481439590454
-- Prompt 5: 4.104475498199463
-- Prompt 6: 2.0161893367767334
-- Prompt 7: 2.0660784244537354
-- Prompt 8: 0.46407243609428406
-- Prompt 9: 2.1939690113067627
Average MSD: 2.160086154937744
--------------------------------------------------------------------------------
Now processing: Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf:
-- Prompt 0: 0.7283403277397156
-- Prompt 1: 1.0912593603134155
-- Prompt 2: 0.9022651314735413
-- Prompt 3: 0.4880850911140442
-- Prompt 4: 0.29713207483291626
-- Prompt 5: 0.6994995474815369
-- Prompt 6: 0.45846545696258545
-- Prompt 7: 0.5286242365837097
-- Prompt 8: 0.2947601079940796
-- Prompt 9: 0.5722559690475464
Average MSD: 0.6060687303543091
--------------------------------------------------------------------------------
Now processing: Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf:
-- Prompt 0: 1.2783535718917847
-- Prompt 1: 0.4481557607650757
-- Prompt 2: 1.1880418062210083
-- Prompt 3: 1.0997036695480347
-- Prompt 4: 0.8093082308769226
-- Prompt 5: 0.6486296057701111
-- Prompt 6: 1.1238276958465576
-- Prompt 7: 1.1459368467330933
-- Prompt 8: 0.23579858243465424
-- Prompt 9: 1.238993525505066
Average MSD: 0.9216748476028442
--------------------------------------------------------------------------------
Now processing: Qwen2.5-14B-Q8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-Q8_0.gguf:
-- Prompt 0: 0.0059487177059054375
-- Prompt 1: 0.004823403432965279
-- Prompt 2: 0.011750683188438416
-- Prompt 3: 0.004459250718355179
-- Prompt 4: 0.004037810489535332
-- Prompt 5: 0.0039064036682248116
-- Prompt 6: 0.004684466868638992
-- Prompt 7: 0.004520604852586985
-- Prompt 8: 0.004727284424006939
-- Prompt 9: 0.004541514907032251
Average MSD: 0.0053400141187012196
--------------------------------------------------------------------------------
Average Mean-Squared Deviation compared to Qwen2.5-14B-BF16.gguf:
--------------------------------------------------------------------------------
                                       Qwen2.5-14B-Q2_K.gguf -- 1.3933132886886597
                    Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf -- 0.01663336530327797
                    Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf -- 2.160086154937744
                    Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf -- 0.6060687303543091
                    Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf -- 0.9216748476028442
                                       Qwen2.5-14B-Q8_0.gguf -- 0.0053400141187012196
--------------------------------------------------------------------------------
```

---

## TL;DR

Mean-Squared Deviation as compared to BF16, average over 10 inputs (lower is better):

|              | Q2_K     | Crush TYPE_EMBD | Crush TYPE_FFN | Crush TYPE_ATTN | Crush TYPE_OUTPUT | Q8_0       |
| ------------ | -------- | --------------- | -------------- | --------------- | ----------------- | ---------- |
| Llama 3.2 3B | 1.504    | 0.344           | 4.693          | 1.528           | N/A               | 0.002      |
| Qwen2.5-14B  | 1.393    | 0.016           | 2.160          | 0.606           | 0.921             | 0.005      |
| **Average**  | **1.44** | **0.18**        | **3.42**       | **1.06**        | **0.921**         | **0.0035** |

In short, we can see that aggressive quantization of the FFN tensors causes the greatest deviation from BF16, and aggressive quantization of the token embeddings causes the least deviation. Note that deviations greater than ~0.1 start to have a noticeable effect on the quality of the model's output. Realistically, it's probably wise to stick to any combination of Q3_K, Q4_K, Q5_K, Q6_K, and Q8_0 depending on your situation.