Quantized Model

This model has been quantized using bitsandbytes (4bit quantization).

Loading the Model

To load this quantized model, you need to specify the quantization configuration:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Configure quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "WethosAI/llama3.3_70B_stu_persona_verbose_4bit",
    quantization_config=quantization_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("WethosAI/llama3.3_70B_stu_persona_verbose_4bit")

Requirements

  • transformers
  • bitsandbytes
  • torch (with CUDA support for quantization)
  • peft (if this was originally a LoRA adapter)

Notes

  • This model uses 4bit quantization to reduce memory usage
  • The quantization config is saved in the model's config.json
  • Make sure to load with the quantization_config parameter to use the quantized weights
  • Quantized models require CUDA-enabled GPUs to run efficiently
Downloads last month
2
Safetensors
Model size
71B params
Tensor type
F32
BF16
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support