lonq
/

Tower-Plus-9B-bnb-4bit

4-bit precision

Model card Files Files and versions

lonq commited on Jan 31

Commit

9fad644

·

verified ·

1 Parent(s): 6cdb530

Update README.md

Files changed (1) hide show

README.md +58 -1

README.md CHANGED Viewed

@@ -2,4 +2,61 @@
 license: cc-by-nc-sa-4.0
 base_model:
 - Unbabel/Tower-Plus-9B
----

 license: cc-by-nc-sa-4.0
 base_model:
 - Unbabel/Tower-Plus-9B
+pipeline_tag: translation
+---
+# Tower+ 9B (4-bit bitsandbytes)
+This is a 4-bit quantized version of Tower+ 9B using bitsandbytes.
+Model Description
+Original Model: https://huggingface.co/Unbabel/Tower-Plus-9B
+Quantization Method: 4-bit NormalFloat (NF4)
+Quantization Library: bitsandbytes
+Compute Dtype: float16
+This model was quantized to reduce memory usage and improve inference efficiency while maintaining high performance. It can be loaded directly with Hugging Face transformers on a GPU.
+Usage
+You can load this model using the transformers library. Ensure you have bitsandbytes and accelerate installed.
+Prerequisites
+```Bash
+pip install transformers bitsandbytes accelerate
+```
+Loading the Model
+```Python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "lonq/Tower-Plus-9B-bnb-4bit"
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+# Example generation
+input_text = "Once upon a time"
+source_name = "English"
+target_name = "French"
+messages = [
+                {"role": "user", "content": f"Translate the following text from {source_name} to {target_name}.\nSource: {input_text}\nTarget:"}
+            ]
+inputs = self._tokenizer.apply_chat_template(
+            messages,
+            tokenize=True,
+            add_generation_prompt=True,
+            return_tensors="pt"
+        ).to(self._model.device)
+outputs = model.generate(**inputs, max_new_tokens=4096)
+new_tokens = outputs[0][inputs.shape[1]:]
+print(self._tokenizer.decode(new_tokens, skip_special_tokens=True).strip())
+```
+Intended Use
+This model is intended for efficient inference on consumer-grade GPUs or environments with limited VRAM.