๐Ÿ“œ Model Description

This model is a 4-bit NormalFloat (NF4) quantized version of the Meta-Llama-3.1-8B-Instruct-Abliterated, fine-tuned by mlabonne.

The quantization process significantly reduces the memory footprint (VRAM usage) and improves inference speed, making it highly accessible for deployment on consumer-grade GPUs and limited-resource hardware, while maintaining high performance due to the nature of the NF4 method. ๐Ÿ”— Original Model Source

Original Model Name: mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated

Original Base Model: Llama 3.1 8B Instruct

Original Description: A version of Llama 3.1 8B Instruct that has undergone "Abliteration" (further fine-tuning) to enhance its capabilities and alignment.

โš™๏ธ Quantization Details

Quantization Technique: NF4 (NormalFloat 4-bit)

Library Used: Typically implemented using bitsandbytes via the Hugging Face transformers library.

Purpose: To enable loading and running the model in 4-bit precision, drastically cutting down VRAM requirements.

๐Ÿ› ๏ธ How to Use the Model (4-bit Loading)

This model is intended to be used with the Hugging Face transformers library and bitsandbytes for 4-bit loading. ๐Ÿ’ป Installation

To utilize the 4-bit configuration, you must have the necessary libraries installed:

pip install torch transformers accelerate bitsandbytes

Python Usage Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "ikarius/Meta-Llama-3.1-8B-Instruct-Abliterated-NF4"

# 1. Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# 2. Run Inference using the Instruct template
messages = [
    {"role": "system", "content": "You are a helpful and friendly AI assistant."},
    {"role": "user", "content": "What is the main benefit of 4-bit NF4 quantization?"}
]

# Apply the Llama 3.1 chat template
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
11
Safetensors
Model size
8B params
Tensor type
F16
ยท
F32
ยท
U8
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ikarius/Meta-Llama-3.1-8B-Instruct-Abliterated-NF4