Quantized Model

This model has been quantized using bitsandbytes (4bit quantization).

Loading the Model

To load this quantized model, you need to specify the quantization configuration:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Configure quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "WethosAI/llama3.3_70B_stu_persona_verbose_4bit",
    quantization_config=quantization_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("WethosAI/llama3.3_70B_stu_persona_verbose_4bit")

Requirements

transformers
bitsandbytes
torch (with CUDA support for quantization)
peft (if this was originally a LoRA adapter)

Notes

This model uses 4bit quantization to reduce memory usage
The quantization config is saved in the model's config.json
Make sure to load with the quantization_config parameter to use the quantized weights
Quantized models require CUDA-enabled GPUs to run efficiently

Downloads last month: 2

Safetensors

Model size

71B params

Tensor type

F32

BF16