lonq commited on
Commit
9fad644
·
verified ·
1 Parent(s): 6cdb530

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -1
README.md CHANGED
@@ -2,4 +2,61 @@
2
  license: cc-by-nc-sa-4.0
3
  base_model:
4
  - Unbabel/Tower-Plus-9B
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: cc-by-nc-sa-4.0
3
  base_model:
4
  - Unbabel/Tower-Plus-9B
5
+ pipeline_tag: translation
6
+ ---
7
+ # Tower+ 9B (4-bit bitsandbytes)
8
+ This is a 4-bit quantized version of Tower+ 9B using bitsandbytes.
9
+
10
+ Model Description
11
+ Original Model: https://huggingface.co/Unbabel/Tower-Plus-9B
12
+
13
+ Quantization Method: 4-bit NormalFloat (NF4)
14
+
15
+ Quantization Library: bitsandbytes
16
+
17
+ Compute Dtype: float16
18
+
19
+ This model was quantized to reduce memory usage and improve inference efficiency while maintaining high performance. It can be loaded directly with Hugging Face transformers on a GPU.
20
+
21
+ Usage
22
+ You can load this model using the transformers library. Ensure you have bitsandbytes and accelerate installed.
23
+
24
+ Prerequisites
25
+
26
+ ```Bash
27
+ pip install transformers bitsandbytes accelerate
28
+ ```
29
+ Loading the Model
30
+
31
+ ```Python
32
+ from transformers import AutoModelForCausalLM, AutoTokenizer
33
+
34
+ model_id = "lonq/Tower-Plus-9B-bnb-4bit"
35
+
36
+ model = AutoModelForCausalLM.from_pretrained(
37
+ model_id,
38
+ device_map="auto"
39
+ )
40
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
41
+
42
+
43
+ # Example generation
44
+ input_text = "Once upon a time"
45
+ source_name = "English"
46
+ target_name = "French"
47
+ messages = [
48
+ {"role": "user", "content": f"Translate the following text from {source_name} to {target_name}.\nSource: {input_text}\nTarget:"}
49
+ ]
50
+ inputs = self._tokenizer.apply_chat_template(
51
+ messages,
52
+ tokenize=True,
53
+ add_generation_prompt=True,
54
+ return_tensors="pt"
55
+ ).to(self._model.device)
56
+
57
+ outputs = model.generate(**inputs, max_new_tokens=4096)
58
+ new_tokens = outputs[0][inputs.shape[1]:]
59
+ print(self._tokenizer.decode(new_tokens, skip_special_tokens=True).strip())
60
+ ```
61
+ Intended Use
62
+ This model is intended for efficient inference on consumer-grade GPUs or environments with limited VRAM.