ddh0
/

tensor-type-testing

Model card Files Files and versions

xet

Community

ddh0 commited on Apr 3, 2025

Commit

0a930e1

verified ·

1 Parent(s): 92ca806

Delete tensor_type_testing.txt

Browse files

Files changed (1) hide show

tensor_type_testing.txt +0 -449

tensor_type_testing.txt DELETED Viewed

@@ -1,449 +0,0 @@
----
-# Tensor Type Testing
----
-## Quantization naming scheme:
-```
-Model-Name-E{TYPE_EMBD}-F{TYPE_FFN}-A{TYPE_ATTN}-O{TYPE_OUTPUT}.gguf
-```
-for example `Llama-3.1-8B-Instruct-EQ4_K-FQ4_K-AQ8_0-OQ8_0.gguf`:
-- Model is Llama 3.1 8B Instruct
-- TYPE_EMBD (token embeddings) are in Q4_K
-- TYPE_FFN (MLP / feed-forward tensors) are in Q4_K
-- TYPE_ATTN (K,Q,V attention and attention output tensors) are in Q8_0
-- TYPE_OUTPUT (output tensor) is in Q8_0
----
-## Command template:
-```bash
-TYPE_EMBD=GGML_TYPE
-TYPE_FFN=GGML_TYPE
-TYPE_ATTN=GGML_TYPE
-TYPE_OUTPUT=GGML_TYPE
-SRC_GGUF=/my/model/orig.gguf
-DST_GGUF=/my/model/quant.gguf
-N_THREADS=4
-./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
-```
----
-## Commands used for Llama 3.2
----
-### Llama 3.2 3B - Crush token embeddings to Q2_K, otherwise Q8_0
-```bash
-TYPE_EMBD=Q2_K
-TYPE_FFN=Q8_0
-TYPE_ATTN=Q8_0
-TYPE_OUTPUT=Q8_0
-SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
-DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
-N_THREADS=16
-./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
-```
----
-### Llama 3.2 3B - Crush FFN to Q2_K, otherwise Q8_0
-```bash
-TYPE_EMBD=Q8_0
-TYPE_FFN=Q2_K
-TYPE_ATTN=Q8_0
-TYPE_OUTPUT=Q8_0
-SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
-DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
-N_THREADS=16
-./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
-```
----
-### Llama 3.2 3B - Crush attention to Q2_K, otherwise Q8_0
-```bash
-TYPE_EMBD=Q8_0
-TYPE_FFN=Q8_0
-TYPE_ATTN=Q2_K
-TYPE_OUTPUT=Q8_0
-SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
-DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
-N_THREADS=16
-./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
-```
----
-### Llama 3.2 3B - Crush output tensor to Q2_K, otherwise Q8_0
-> **This result was not included because Llama 3.2 3B has no output tensor! The resulting file is the same as a normal Q8_0.**
-```bash
-TYPE_EMBD=Q8_0
-TYPE_FFN=Q8_0
-TYPE_ATTN=Q8_0
-TYPE_OUTPUT=Q2_K
-SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
-DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf
-N_THREADS=16
-./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
-```
----
-## Raw results for Llama 3.2 3B
-```
-          Number of input texts: 10
-Shortest input length in tokens: 55
- Longest input length in tokens: 4678
- Average input length in tokens: 1605.5
-   Total number of input tokens: 16055
---------------------------------------------------------------------------------
-Evaluating baseline model Llama-3.2-3B-BF16.gguf...
-Load model...
-Evaluate prompts...
-Unload model...
---------------------------------------------------------------------------------
-Now processing: Llama-3.2-3B-Q2_K.gguf
-Load model...
-Evaluate prompts...
-Unload model...
-Compute MSD...
-Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-Q2_K.gguf:
--- Prompt 0: 1.2261667251586914
--- Prompt 1: 1.1347604990005493
--- Prompt 2: 1.388033390045166
--- Prompt 3: 1.1053369045257568
--- Prompt 4: 1.7510676383972168
--- Prompt 5: 4.586221218109131
--- Prompt 6: 1.3651360273361206
--- Prompt 7: 0.8970077037811279
--- Prompt 8: 0.3409916162490845
--- Prompt 9: 1.2506738901138306
-Average MSD: 1.5045396089553833
---------------------------------------------------------------------------------
-Now processing: Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
-Load model...
-Evaluate prompts...
-Unload model...
-Compute MSD...
-Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf:
--- Prompt 0: 0.3589555025100708
--- Prompt 1: 0.1420530527830124
--- Prompt 2: 0.3871675133705139
--- Prompt 3: 0.38336610794067383
--- Prompt 4: 0.4630553722381592
--- Prompt 5: 0.3928600549697876
--- Prompt 6: 0.46294596791267395
--- Prompt 7: 0.41983363032341003
--- Prompt 8: 0.0822080597281456
--- Prompt 9: 0.3548887372016907
-Average MSD: 0.34473341703414917
---------------------------------------------------------------------------------
-Now processing: Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
-Load model...
-Evaluate prompts...
-Unload model...
-Compute MSD...
-Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf:
--- Prompt 0: 4.409396648406982
--- Prompt 1: 2.431891679763794
--- Prompt 2: 5.892056941986084
--- Prompt 3: 4.688146591186523
--- Prompt 4: 6.351741313934326
--- Prompt 5: 8.826679229736328
--- Prompt 6: 4.506043434143066
--- Prompt 7: 4.613113880157471
--- Prompt 8: 1.0596126317977905
--- Prompt 9: 4.1558661460876465
-Average MSD: 4.693454742431641
---------------------------------------------------------------------------------
-Now processing: Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
-Load model...
-Evaluate prompts...
-Unload model...
-Compute MSD...
-Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf:
--- Prompt 0: 1.0618470907211304
--- Prompt 1: 1.1212399005889893
--- Prompt 2: 1.3122810125350952
--- Prompt 3: 0.9195016026496887
--- Prompt 4: 1.201547622680664
--- Prompt 5: 5.760651111602783
--- Prompt 6: 1.0914928913116455
--- Prompt 7: 0.9646959900856018
--- Prompt 8: 0.41648873686790466
--- Prompt 9: 1.4317259788513184
-Average MSD: 1.5281471014022827
---------------------------------------------------------------------------------
-Now processing: Llama-3.2-3B-Q8_0.gguf
-Load model...
-Evaluate prompts...
-Unload model...
-Compute MSD...
-Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-Q8_0.gguf:
--- Prompt 0: 0.0023212190717458725
--- Prompt 1: 0.0014450754970312119
--- Prompt 2: 0.003914575092494488
--- Prompt 3: 0.002514646854251623
--- Prompt 4: 0.003313937224447727
--- Prompt 5: 0.004224818665534258
--- Prompt 6: 0.0026909655425697565
--- Prompt 7: 0.0033839084208011627
--- Prompt 8: 0.0015104531776160002
--- Prompt 9: 0.002354747150093317
-Average MSD: 0.0027674345765262842
---------------------------------------------------------------------------------
-Average Mean-Squared Deviation compared to Llama-3.2-3B-BF16.gguf:
---------------------------------------------------------------------------------
-                                      Llama-3.2-3B-Q2_K.gguf -- 1.5045396089553833
-                   Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf -- 0.34473341703414917
-                   Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf -- 4.693454742431641
-                   Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf -- 1.5281471014022827
-                                      Llama-3.2-3B-Q8_0.gguf -- 0.0027674345765262842
---------------------------------------------------------------------------------
-```
----
-## Summarized results for Llama 3.2 3B
-Approximate Mean-Squared Deviation as compared to BF16, average over 10 inputs (lower is better):
-- Standard Q8_0 quant: **0.002**
-- Crush token embeddings to Q2_K, otherwise Q8_0: **0.344**
-- Standard Q2_K quant: **1.504**
-- Crush attention to Q2_K, otherwise Q8_0: **1.528**
-- Crush FFN to Q2_K, otherwise Q8_0: **4.693**
----
-## Commands used for Qwen2.5-14B
----
-### Qwen2.5-14B - Crush token embeddings to Q2_K, otherwise Q8_0
-```bash
-TYPE_EMBD=Q2_K
-TYPE_FFN=Q8_0
-TYPE_ATTN=Q8_0
-TYPE_OUTPUT=Q8_0
-SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
-DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
-N_THREADS=16
-./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
-```
----
-### Qwen2.5-14B - Crush FFNs to Q2_K, otherwise Q8_0
-```bash
-TYPE_EMBD=Q8_0
-TYPE_FFN=Q2_K
-TYPE_ATTN=Q8_0
-TYPE_OUTPUT=Q8_0
-SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
-DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
-N_THREADS=16
-./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
-```
----
-### Qwen2.5-14B - Crush attention to Q2_K, otherwise Q8_0
-```bash
-TYPE_EMBD=Q8_0
-TYPE_FFN=Q8_0
-TYPE_ATTN=Q2_K
-TYPE_OUTPUT=Q8_0
-SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
-DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
-N_THREADS=16
-./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
-```
----
-### Qwen2.5-14B - Crush output tensor to Q2_K, otherwise Q8_0
-```bash
-TYPE_EMBD=Q8_0
-TYPE_FFN=Q8_0
-TYPE_ATTN=Q8_0
-TYPE_OUTPUT=Q2_K
-SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
-DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf
-N_THREADS=16
-./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
-```
----
-## Raw results for Qwen2.5-14B
-```
-          Number of input texts: 10
-Shortest input length in tokens: 60
- Longest input length in tokens: 4801
- Average input length in tokens: 1589.3
-   Total number of input tokens: 15893
---------------------------------------------------------------------------------
-Evaluating baseline model Qwen2.5-14B-BF16.gguf...
-Load model...
-Evaluate prompts...
-Unload model...
---------------------------------------------------------------------------------
-Now processing: Qwen2.5-14B-Q2_K.gguf
-Load model...
-Evaluate prompts...
-Unload model...
-Compute MSD...
-Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-Q2_K.gguf:
--- Prompt 0: 1.568434476852417
--- Prompt 1: 1.8605916500091553
--- Prompt 2: 1.2912431955337524
--- Prompt 3: 1.3367090225219727
--- Prompt 4: 1.1364308595657349
--- Prompt 5: 2.3384993076324463
--- Prompt 6: 1.2926896810531616
--- Prompt 7: 1.4084643125534058
--- Prompt 8: 0.32443684339523315
--- Prompt 9: 1.3756331205368042
-Average MSD: 1.3933132886886597
---------------------------------------------------------------------------------
-Now processing: Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
-Load model...
-Evaluate prompts...
-Unload model...
-Compute MSD...
-Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf:
--- Prompt 0: 0.012962134554982185
--- Prompt 1: 0.019185630604624748
--- Prompt 2: 0.05430002510547638
--- Prompt 3: 0.008174948394298553
--- Prompt 4: 0.011592703871428967
--- Prompt 5: 0.012105505913496017
--- Prompt 6: 0.007557644974440336
--- Prompt 7: 0.01957087405025959
--- Prompt 8: 0.013395288027822971
--- Prompt 9: 0.007488884497433901
-Average MSD: 0.01663336530327797
---------------------------------------------------------------------------------
-Now processing: Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
-Load model...
-Evaluate prompts...
-Unload model...
-Compute MSD...
-Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf:
--- Prompt 0: 2.483222246170044
--- Prompt 1: 2.20788836479187
--- Prompt 2: 2.2648935317993164
--- Prompt 3: 2.175588607788086
--- Prompt 4: 1.624481439590454
--- Prompt 5: 4.104475498199463
--- Prompt 6: 2.0161893367767334
--- Prompt 7: 2.0660784244537354
--- Prompt 8: 0.46407243609428406
--- Prompt 9: 2.1939690113067627
-Average MSD: 2.160086154937744
---------------------------------------------------------------------------------
-Now processing: Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
-Load model...
-Evaluate prompts...
-Unload model...
-Compute MSD...
-Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf:
--- Prompt 0: 0.7283403277397156
--- Prompt 1: 1.0912593603134155
--- Prompt 2: 0.9022651314735413
--- Prompt 3: 0.4880850911140442
--- Prompt 4: 0.29713207483291626
--- Prompt 5: 0.6994995474815369
--- Prompt 6: 0.45846545696258545
--- Prompt 7: 0.5286242365837097
--- Prompt 8: 0.2947601079940796
--- Prompt 9: 0.5722559690475464
-Average MSD: 0.6060687303543091
---------------------------------------------------------------------------------
-Now processing: Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf
-Load model...
-Evaluate prompts...
-Unload model...
-Compute MSD...
-Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf:
--- Prompt 0: 1.2783535718917847
--- Prompt 1: 0.4481557607650757
--- Prompt 2: 1.1880418062210083
--- Prompt 3: 1.0997036695480347
--- Prompt 4: 0.8093082308769226
--- Prompt 5: 0.6486296057701111
--- Prompt 6: 1.1238276958465576
--- Prompt 7: 1.1459368467330933
--- Prompt 8: 0.23579858243465424
--- Prompt 9: 1.238993525505066
-Average MSD: 0.9216748476028442
---------------------------------------------------------------------------------
-Now processing: Qwen2.5-14B-Q8_0.gguf
-Load model...
-Evaluate prompts...
-Unload model...
-Compute MSD...
-Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-Q8_0.gguf:
--- Prompt 0: 0.0059487177059054375
--- Prompt 1: 0.004823403432965279
--- Prompt 2: 0.011750683188438416
--- Prompt 3: 0.004459250718355179
--- Prompt 4: 0.004037810489535332
--- Prompt 5: 0.0039064036682248116
--- Prompt 6: 0.004684466868638992
--- Prompt 7: 0.004520604852586985
--- Prompt 8: 0.004727284424006939
--- Prompt 9: 0.004541514907032251
-Average MSD: 0.0053400141187012196
---------------------------------------------------------------------------------
-Average Mean-Squared Deviation compared to Qwen2.5-14B-BF16.gguf:
---------------------------------------------------------------------------------
-                                       Qwen2.5-14B-Q2_K.gguf -- 1.3933132886886597
-                    Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf -- 0.01663336530327797
-                    Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf -- 2.160086154937744
-                    Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf -- 0.6060687303543091
-                    Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf -- 0.9216748476028442
-                                       Qwen2.5-14B-Q8_0.gguf -- 0.0053400141187012196
---------------------------------------------------------------------------------
-```
----
-## Summarized results for Qwen2.5-14B
-Approximate Mean-Squared Deviation as compared to BF16, average over 10 inputs (lower is better):
-- Standard Q8_0 quant: **0.005**
-- Crush token embeddings to Q2_K, otherwise Q8_0: **0.016**
-- Crush attention to Q2_K, otherwise Q8_0: **0.606**
-- Crush output tensor to Q2_K, otherwise Q8_0: **0.921**
-- Standard Q2_K quant: **1.393**
-- Crush FFN to Q2_K, otherwise Q8_0: **2.160**
----