Llama 3.1 8B Instruct - W8A8 Quantized

This is a W8A8 (8-bit weights and 8-bit activations) quantized version of meta-llama/Llama-3.1-8B-Instruct.

Model Description

Base Model: meta-llama/Llama-3.1-8B-Instruct
Quantization: W8A8 (8-bit weights, 8-bit activations)
Quantization Method: GPTQ using llmcompressor
Model Size: ~8GB (reduced from ~16GB FP16)

Quantization Details

This model was quantized using llmcompressor with the following configuration:

Calibration Dataset: HuggingFaceH4/ultrachat_200k (128 samples)
Max Sequence Length: 128
Scheme: W8A8
Targets: Linear layers (excluding lm_head)

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "lsm0729/Meta-Llama-3.1-8B-Instruct-quantized.w8a8"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Example inference
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Performance

Memory Usage: ~8GB GPU memory
Inference Speed: Faster than FP16 with minimal accuracy loss
Accuracy: Comparable to the original model on most tasks

License

This model inherits the Llama 3.1 Community License from the base model. Please refer to the original model card for full license details.

Citation

If you use this model, please cite both the original Llama 3.1 paper and the quantization library:

@article{llama3.1,
  title={The Llama 3 Herd of Models},
  author={Meta AI},
  year={2024},
  url={https://arxiv.org/abs/2407.21783}
}

Acknowledgments

Base model by Meta AI
Quantization using llmcompressor
Calibration dataset by HuggingFace H4 team

Downloads last month: 141

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for lsm0729/Meta-Llama-3.1-8B-Instruct-quantized.w8a8

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Quantized

(569)

this model

Paper for lsm0729/Meta-Llama-3.1-8B-Instruct-quantized.w8a8

The Llama 3 Herd of Models

Paper • 2407.21783 • Published Jul 31, 2024 • 117