|
|
--- |
|
|
base_model: |
|
|
- mistralai/Mistral-Large-Instruct-2407 |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- mistral |
|
|
- 3bit |
|
|
--- |
|
|
This is a 3bit AutoRound GPTQ version of Mistral-Large-Instruct-2407. |
|
|
This conversion used model-*.safetensors. |
|
|
|
|
|
This quantized model needs at least ~ 50GB + context (~5GB) VRAM. I quantized it so that it could fit 64GB VRAM. |
|
|
|
|
|
Quantization script (it takes around 520 GB RAM and A40 GPU 48GB around 20 hours to convert): |
|
|
``` |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
import torch |
|
|
model_name = "mistralai/Mistral-Large-Instruct-2407" |
|
|
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
|
|
from auto_round import AutoRound |
|
|
|
|
|
bits, group_size, sym = 3, 128, True |
|
|
|
|
|
autoround = AutoRound(model, tokenizer, nsamples=256, iters=512, low_gpu_mem_usage=True, batch_size=4, bits=bits, group_size=group_size, sym=sym, |
|
|
device='cuda') |
|
|
autoround.quantize() |
|
|
output_dir = "./Mistral-Large-Instruct-2407-3bit" |
|
|
autoround.save_quantized(output_dir, format='auto_gptq', inplace=True) |
|
|
|
|
|
``` |
|
|
Evals using lm-eval-harness. |
|
|
|
|
|
``` |
|
|
example command: |
|
|
# !pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git auto-gptq optimum |
|
|
m="VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-256-woft" |
|
|
!lm_eval --model hf --model_args pretrained={m},dtype=auto --tasks wikitext --num_fewshot 0 --batch_size 1 --output_path ./eval/ |
|
|
``` |
|
|
|
|
|
hf (pretrained=MLDataScientist/Mistral-Large-Instruct-2407-GPTQ-3bit,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 2 |
|
|
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |
|
|
|--------|------:|------|-----:|---------------|---|-----:|---|------| |
|
|
|wikitext| 2|none | 0|bits_per_byte |↓ |0.4103|± | N/A| |
|
|
| | |none | 0|byte_perplexity|↓ |1.3290|± | N/A| |
|
|
| | |none | 0|word_perplexity|↓ |4.5765|± | N/A| |
|
|
|
|
|
vs 3bit VPTQ hf (pretrained=VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-256-woft,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 1 |
|
|
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |
|
|
|--------|------:|------|-----:|---------------|---|-----:|---|------| |
|
|
|wikitext| 2|none | 0|bits_per_byte |↓ |0.4017|± | N/A| |
|
|
| | |none | 0|byte_perplexity|↓ |1.3211|± | N/A| |
|
|
| | |none | 0|word_perplexity|↓ |4.4324|± | N/A| |
|
|
|
|
|
vs 4bit GPTQ: hf (pretrained=ModelCloud/Mistral-Large-Instruct-2407-gptq-4bit,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 1: |
|
|
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |
|
|
|--------|------:|------|-----:|---------------|---|-----:|---|------| |
|
|
|wikitext| 2|none | 0|bits_per_byte |↓ |0.3536|± | N/A| |
|
|
| | |none | 0|byte_perplexity|↓ |1.2777|± | N/A| |
|
|
| | |none | 0|word_perplexity|↓ |3.7082|± | N/A| |
|
|
|
|
|
vs 4bit VPTQ |
|
|
hf (pretrained=VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-65536-woft,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 1 |
|
|
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |
|
|
|--------|------:|------|-----:|---------------|---|-----:|---|------| |
|
|
|wikitext| 2|none | 0|bits_per_byte |↓ |0.3415|± | N/A| |
|
|
| | |none | 0|byte_perplexity|↓ |1.2671|± | N/A| |
|
|
| | |none | 0|word_perplexity|↓ |3.5463|± | N/A| |
|
|
|
|
|
vs exl2 4bpw (I think the tests are different) |
|
|
| |Wikitext| C4 |FineWeb|Max VRAM| |
|
|
|-------------|--------|-----|-------|--------| |
|
|
|EXL2 4.00 bpw| 2.885 |6.484| 6.246 |60.07 GB| |
|
|
|