๐ Model Description
This model is a 4-bit NormalFloat (NF4) quantized version of the Meta-Llama-3.1-8B-Instruct-Abliterated, fine-tuned by mlabonne.
The quantization process significantly reduces the memory footprint (VRAM usage) and improves inference speed, making it highly accessible for deployment on consumer-grade GPUs and limited-resource hardware, while maintaining high performance due to the nature of the NF4 method. ๐ Original Model Source
Original Model Name: mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated
Original Base Model: Llama 3.1 8B Instruct
Original Description: A version of Llama 3.1 8B Instruct that has undergone "Abliteration" (further fine-tuning) to enhance its capabilities and alignment.
โ๏ธ Quantization Details
Quantization Technique: NF4 (NormalFloat 4-bit)
Library Used: Typically implemented using bitsandbytes via the Hugging Face transformers library.
Purpose: To enable loading and running the model in 4-bit precision, drastically cutting down VRAM requirements.
๐ ๏ธ How to Use the Model (4-bit Loading)
This model is intended to be used with the Hugging Face transformers library and bitsandbytes for 4-bit loading. ๐ป Installation
To utilize the 4-bit configuration, you must have the necessary libraries installed:
pip install torch transformers accelerate bitsandbytes
Python Usage Example
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "ikarius/Meta-Llama-3.1-8B-Instruct-Abliterated-NF4"
# 1. Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
)
# 2. Run Inference using the Instruct template
messages = [
{"role": "system", "content": "You are a helpful and friendly AI assistant."},
{"role": "user", "content": "What is the main benefit of 4-bit NF4 quantization?"}
]
# Apply the Llama 3.1 chat template
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=256,
temperature=0.7,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 11
Model tree for ikarius/Meta-Llama-3.1-8B-Instruct-Abliterated-NF4
Base model
meta-llama/Llama-3.1-8B