Compumacy
/

Meta-Llama-3-8B-Instruct-WINT8

Text Generation

text-generation-inference

8-bit precision

compressed-tensors

Model card Files Files and versions

Meta-Llama-3-8B-Instruct-WINT8 / README.md

NoorNizar's picture

Initial full copy from personal account

bb932ae verified 8 months ago

|

history blame contribute delete

2.14 kB

	---
	library_name: transformers
	tags:
	- llmcompressor
	- quantization
	- wint8
	---

	# Meta-Llama-3-8B-Instruct-WINT8

	This model is a 8-bit quantized version of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) "using the [llmcompressor](https://github.com/neuralmagic/llmcompressor) library.

	## Quantization Details

	* Base Model: [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
	* Quantization Library: `llmcompressor`
	* Quantization Method: Weight-only 8-bit int (WINT8)
	* Quantization Recipe:
	```yaml
	quant_stage:
	quant_modifiers:
	QuantizationModifier:
	ignore: [lm_head]
	config_groups:
	group_0:
	weights: {num_bits: 8, type: int, symmetric: true, strategy: channel, dynamic: false}
	targets: [Linear]
	```

	## Evaluation Results

	The following table shows the evaluation results on various benchmarks compared to the baseline (non-quantized) model.

	\| Task \| Baseline Metric (10.0% Threshold) \| Quantized Metric \| Metric Type \|
	\|------------------\|-------------------------------------------------------\|------------------\|---------------------\|
	\| winogrande \| 0.7577 \| 0.7616 \| acc,none \|

	## How to Use

	You can load the quantized model and tokenizer using the `transformers` library:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "NoorNizar/Meta-Llama-3-8B-Instruct-WINT8"

	model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
	tokenizer = AutoTokenizer.from_pretrained(model_id)

	# Example usage (replace with your specific task)
	prompt = "Hello, world!"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=50)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Disclaimer

	This model was quantized automatically using a script. Performance and behavior might differ slightly from the original base model.