MLDataScientist
/

Mistral-Large-Instruct-2407-GPTQ-3bit

Text Generation

Model card Files Files and versions

Mistral-Large-Instruct-2407-GPTQ-3bit / README.md

MLDataScientist's picture

MLDataScientist

Update README.md

afcb7cc verified about 1 year ago

|

history blame contribute delete

3.74 kB

	---
	base_model:
	- mistralai/Mistral-Large-Instruct-2407
	pipeline_tag: text-generation
	tags:
	- mistral
	- 3bit
	---
	This is a 3bit AutoRound GPTQ version of Mistral-Large-Instruct-2407.
	This conversion used model-*.safetensors.

	This quantized model needs at least ~ 50GB + context (~5GB) VRAM. I quantized it so that it could fit 64GB VRAM.

	Quantization script (it takes around 520 GB RAM and A40 GPU 48GB around 20 hours to convert):
	```
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch
	model_name = "mistralai/Mistral-Large-Instruct-2407"
	model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	from auto_round import AutoRound

	bits, group_size, sym = 3, 128, True

	autoround = AutoRound(model, tokenizer, nsamples=256, iters=512, low_gpu_mem_usage=True, batch_size=4, bits=bits, group_size=group_size, sym=sym,
	device='cuda')
	autoround.quantize()
	output_dir = "./Mistral-Large-Instruct-2407-3bit"
	autoround.save_quantized(output_dir, format='auto_gptq', inplace=True)

	```
	Evals using lm-eval-harness.

	```
	example command:
	# !pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git auto-gptq optimum
	m="VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-256-woft"
	!lm_eval --model hf --model_args pretrained={m},dtype=auto --tasks wikitext --num_fewshot 0 --batch_size 1 --output_path ./eval/
	```

	hf (pretrained=MLDataScientist/Mistral-Large-Instruct-2407-GPTQ-3bit,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 2
	\| Tasks \|Version\|Filter\|n-shot\| Metric \| \|Value \| \|Stderr\|
	\|--------\|------:\|------\|-----:\|---------------\|---\|-----:\|---\|------\|
	\|wikitext\| 2\|none \| 0\|bits_per_byte \|↓ \|0.4103\|± \| N/A\|
	\| \| \|none \| 0\|byte_perplexity\|↓ \|1.3290\|± \| N/A\|
	\| \| \|none \| 0\|word_perplexity\|↓ \|4.5765\|± \| N/A\|

	vs 3bit VPTQ hf (pretrained=VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-256-woft,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 1
	\| Tasks \|Version\|Filter\|n-shot\| Metric \| \|Value \| \|Stderr\|
	\|--------\|------:\|------\|-----:\|---------------\|---\|-----:\|---\|------\|
	\|wikitext\| 2\|none \| 0\|bits_per_byte \|↓ \|0.4017\|± \| N/A\|
	\| \| \|none \| 0\|byte_perplexity\|↓ \|1.3211\|± \| N/A\|
	\| \| \|none \| 0\|word_perplexity\|↓ \|4.4324\|± \| N/A\|

	vs 4bit GPTQ: hf (pretrained=ModelCloud/Mistral-Large-Instruct-2407-gptq-4bit,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 1:
	\| Tasks \|Version\|Filter\|n-shot\| Metric \| \|Value \| \|Stderr\|
	\|--------\|------:\|------\|-----:\|---------------\|---\|-----:\|---\|------\|
	\|wikitext\| 2\|none \| 0\|bits_per_byte \|↓ \|0.3536\|± \| N/A\|
	\| \| \|none \| 0\|byte_perplexity\|↓ \|1.2777\|± \| N/A\|
	\| \| \|none \| 0\|word_perplexity\|↓ \|3.7082\|± \| N/A\|

	vs 4bit VPTQ
	hf (pretrained=VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-65536-woft,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 1
	\| Tasks \|Version\|Filter\|n-shot\| Metric \| \|Value \| \|Stderr\|
	\|--------\|------:\|------\|-----:\|---------------\|---\|-----:\|---\|------\|
	\|wikitext\| 2\|none \| 0\|bits_per_byte \|↓ \|0.3415\|± \| N/A\|
	\| \| \|none \| 0\|byte_perplexity\|↓ \|1.2671\|± \| N/A\|
	\| \| \|none \| 0\|word_perplexity\|↓ \|3.5463\|± \| N/A\|

	vs exl2 4bpw (I think the tests are different)
	\| \|Wikitext\| C4 \|FineWeb\|Max VRAM\|
	\|-------------\|--------\|-----\|-------\|--------\|
	\|EXL2 4.00 bpw\| 2.885 \|6.484\| 6.246 \|60.07 GB\|