Llama 3.1 8B Instruct - W8A8 Quantized

This is a W8A8 (8-bit weights and 8-bit activations) quantized version of meta-llama/Llama-3.1-8B-Instruct.

Model Description

  • Base Model: meta-llama/Llama-3.1-8B-Instruct
  • Quantization: W8A8 (8-bit weights, 8-bit activations)
  • Quantization Method: GPTQ using llmcompressor
  • Model Size: ~8GB (reduced from ~16GB FP16)

Quantization Details

This model was quantized using llmcompressor with the following configuration:

  • Calibration Dataset: HuggingFaceH4/ultrachat_200k (128 samples)
  • Max Sequence Length: 128
  • Scheme: W8A8
  • Targets: Linear layers (excluding lm_head)

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "lsm0729/Meta-Llama-3.1-8B-Instruct-quantized.w8a8"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Example inference
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Performance

  • Memory Usage: ~8GB GPU memory
  • Inference Speed: Faster than FP16 with minimal accuracy loss
  • Accuracy: Comparable to the original model on most tasks

License

This model inherits the Llama 3.1 Community License from the base model. Please refer to the original model card for full license details.

Citation

If you use this model, please cite both the original Llama 3.1 paper and the quantization library:

@article{llama3.1,
  title={The Llama 3 Herd of Models},
  author={Meta AI},
  year={2024},
  url={https://arxiv.org/abs/2407.21783}
}

Acknowledgments

  • Base model by Meta AI
  • Quantization using llmcompressor
  • Calibration dataset by HuggingFace H4 team
Downloads last month
141
Safetensors
Model size
8B params
Tensor type
BF16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lsm0729/Meta-Llama-3.1-8B-Instruct-quantized.w8a8

Quantized
(569)
this model

Paper for lsm0729/Meta-Llama-3.1-8B-Instruct-quantized.w8a8