Tower+ 9B (4-bit bitsandbytes)

This is a 4-bit quantized version of Tower+ 9B using bitsandbytes.

Model Description Original Model: https://huggingface.co/Unbabel/Tower-Plus-9B

Quantization Method: 4-bit NormalFloat (NF4)

Quantization Library: bitsandbytes

Compute Dtype: float16

This model was quantized to reduce memory usage and improve inference efficiency while maintaining high performance. It can be loaded directly with Hugging Face transformers on a GPU.

Usage You can load this model using the transformers library. Ensure you have bitsandbytes and accelerate installed.

Prerequisites

pip install transformers bitsandbytes accelerate

Loading the Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "lonq/Tower-Plus-9B-bnb-4bit"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)


# Example generation
input_text = "Once upon a time"
source_name = "English"
target_name = "French"
messages = [
                {"role": "user", "content": f"Translate the following text from {source_name} to {target_name}.\nSource: {input_text}\nTarget:"}
            ]
inputs = self._tokenizer.apply_chat_template(
            messages,
            tokenize=True,
            add_generation_prompt=True,
            return_tensors="pt"
        ).to(self._model.device)

outputs = model.generate(**inputs, max_new_tokens=4096)
new_tokens = outputs[0][inputs.shape[1]:]
print(self._tokenizer.decode(new_tokens, skip_special_tokens=True).strip())

Intended Use This model is intended for efficient inference on consumer-grade GPUs or environments with limited VRAM.

Downloads last month
28
Safetensors
Model size
10B params
Tensor type
F32
F16
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for lonq/Tower-Plus-9B-bnb-4bit

Base model

google/gemma-2-9b
Quantized
(6)
this model