The Llama 3 Herd of Models
Paper
•
2407.21783
•
Published
•
117
This is a W8A8 (8-bit weights and 8-bit activations) quantized version of meta-llama/Llama-3.1-8B-Instruct.
This model was quantized using llmcompressor with the following configuration:
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "lsm0729/Meta-Llama-3.1-8B-Instruct-quantized.w8a8"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Example inference
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=256,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
This model inherits the Llama 3.1 Community License from the base model. Please refer to the original model card for full license details.
If you use this model, please cite both the original Llama 3.1 paper and the quantization library:
@article{llama3.1,
title={The Llama 3 Herd of Models},
author={Meta AI},
year={2024},
url={https://arxiv.org/abs/2407.21783}
}
Base model
meta-llama/Llama-3.1-8B