Transformers
English
ctranslate2
int8
float16
michaelfeil commited on
Commit
62a96a7
·
1 Parent(s): f311b96

Upload togethercomputer/RedPajama-INCITE-7B-Chat ctranslate fp16 weights

Browse files
README.md ADDED
@@ -0,0 +1,261 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - ctranslate2
4
+ - int8
5
+ - float16
6
+
7
+ license: apache-2.0
8
+ language:
9
+ - en
10
+ datasets:
11
+ - togethercomputer/RedPajama-Data-1T
12
+ - OpenAssistant/oasst1
13
+ - databricks/databricks-dolly-15k
14
+ widget:
15
+ - text: "<human>: Write an email to my friends inviting them to come to my home on Friday for a dinner party, bring their own food to share.\n<bot>:"
16
+ example_title: "Email Writing"
17
+ - text: "<human>: Create a list of things to do in San Francisco\n<bot>:"
18
+ example_title: "Brainstorming"
19
+ inference:
20
+ parameters:
21
+ temperature: 0.7
22
+ top_p: 0.7
23
+ top_k: 50
24
+ max_new_tokens: 128
25
+ ---
26
+ # # Fast-Inference with Ctranslate2
27
+ Speedup inference while reducing memory by 2x-4x using int8 inference in C++ on CPU or GPU.
28
+
29
+ quantized version of [togethercomputer/RedPajama-INCITE-7B-Chat](https://huggingface.co/togethercomputer/RedPajama-INCITE-7B-Chat)
30
+ ```bash
31
+ pip install hf-hub-ctranslate2>=2.0.8 ctranslate2>=3.14.0
32
+ ```
33
+ Converted on 2023-06-07 using
34
+ ```
35
+ ct2-transformers-converter --model togethercomputer/RedPajama-INCITE-7B-Chat --output_dir /home/michael/tmp-ct2fast-RedPajama-INCITE-7B-Chat --force --copy_files tokenizer.json README.md tokenizer_config.json generation_config.json special_tokens_map.json .gitattributes --quantization int8_float16 --trust_remote_code
36
+ ```
37
+
38
+ Checkpoint compatible to [ctranslate2>=3.15.0](https://github.com/OpenNMT/CTranslate2)
39
+ and [hf-hub-ctranslate2>=2.0.8](https://github.com/michaelfeil/hf-hub-ctranslate2)
40
+ - `compute_type=int8_float16` for `device="cuda"`
41
+ - `compute_type=int8` for `device="cpu"`
42
+
43
+ ```python
44
+ from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub
45
+ from transformers import AutoTokenizer
46
+
47
+ model_name = "michaelfeil/ct2fast-RedPajama-INCITE-7B-Chat"
48
+ # use either TranslatorCT2fromHfHub or GeneratorCT2fromHfHub here, depending on model.
49
+ model = GeneratorCT2fromHfHub(
50
+ # load in int8 on CUDA
51
+ model_name_or_path=model_name,
52
+ device="cuda",
53
+ compute_type="int8_float16",
54
+ # tokenizer=AutoTokenizer.from_pretrained("togethercomputer/RedPajama-INCITE-7B-Chat")
55
+ )
56
+ outputs = model.generate(
57
+ text=["def fibonnaci(", "User: How are you doing? Bot:"],
58
+ max_length=64,
59
+ include_prompt_in_result=False
60
+ )
61
+ print(outputs)
62
+ ```
63
+
64
+ # Licence and other remarks:
65
+ This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.
66
+
67
+ # Original description
68
+
69
+
70
+ # RedPajama-INCITE-7B-Chat
71
+
72
+ RedPajama-INCITE-7B-Chat was developed by Together and leaders from the open-source AI community including Ontocord.ai, ETH DS3Lab, AAI CERC, Université de Montréal, MILA - Québec AI Institute, Stanford Center for Research on Foundation Models (CRFM), Stanford Hazy Research research group and LAION.
73
+
74
+ It is fine-tuned on OASST1 and Dolly2 to enhance chatting ability.
75
+
76
+ - Base Model: [RedPajama-INCITE-7B-Base](https://huggingface.co/togethercomputer/RedPajama-INCITE-7B-Base)
77
+ - Instruction-tuned Version: [RedPajama-INCITE-7B-Instruct](https://huggingface.co/togethercomputer/RedPajama-INCITE-7B-Instruct)
78
+ - Chat Version: [RedPajama-INCITE-7B-Chat](https://huggingface.co/togethercomputer/RedPajama-INCITE-7B-Chat)
79
+
80
+
81
+ ## Model Details
82
+ - **Developed by**: Together Computer.
83
+ - **Model type**: Language Model
84
+ - **Language(s)**: English
85
+ - **License**: Apache 2.0
86
+ - **Model Description**: A 6.9B parameter pretrained language model.
87
+
88
+ # Quick Start
89
+
90
+ Please note that the model requires `transformers` version >= 4.25.1.
91
+
92
+ To prompt the chat model, use the following format:
93
+ ```
94
+ <human>: [Instruction]
95
+ <bot>:
96
+ ```
97
+
98
+ ## GPU Inference
99
+
100
+ This requires a GPU with 16GB memory.
101
+
102
+ ```python
103
+ import torch
104
+ import transformers
105
+ from transformers import AutoTokenizer, AutoModelForCausalLM
106
+
107
+ MIN_TRANSFORMERS_VERSION = '4.25.1'
108
+
109
+ # check transformers version
110
+ assert transformers.__version__ >= MIN_TRANSFORMERS_VERSION, f'Please upgrade transformers to version {MIN_TRANSFORMERS_VERSION} or higher.'
111
+
112
+ # init
113
+ tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-INCITE-7B-Chat")
114
+ model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-INCITE-7B-Chat", torch_dtype=torch.float16)
115
+ model = model.to('cuda:0')
116
+ # infer
117
+ prompt = "<human>: Who is Alan Turing?\n<bot>:"
118
+ inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
119
+ input_length = inputs.input_ids.shape[1]
120
+ outputs = model.generate(
121
+ **inputs, max_new_tokens=128, do_sample=True, temperature=0.7, top_p=0.7, top_k=50, return_dict_in_generate=True
122
+ )
123
+ token = outputs.sequences[0, input_length:]
124
+ output_str = tokenizer.decode(token)
125
+ print(output_str)
126
+ """
127
+ Alan Mathison Turing (23 June 1912 7 June 1954) was an English computer scientist, mathematician, logician, cryptanalyst, philosopher, mathematician, and theoretical biologist.
128
+ """
129
+ ```
130
+
131
+ ## GPU Inference in Int8
132
+
133
+ This requires a GPU with 12GB memory.
134
+
135
+ To run inference with int8, please ensure you have installed accelerate and bitandbytes. You can install them with the following command:
136
+
137
+ ```bash
138
+ pip install accelerate
139
+ pip install bitsandbytes
140
+ ```
141
+
142
+ Then you can run inference with int8 as follows:
143
+
144
+ ```python
145
+ import torch
146
+ import transformers
147
+ from transformers import AutoTokenizer, AutoModelForCausalLM
148
+
149
+ MIN_TRANSFORMERS_VERSION = '4.25.1'
150
+
151
+ # check transformers version
152
+ assert transformers.__version__ >= MIN_TRANSFORMERS_VERSION, f'Please upgrade transformers to version {MIN_TRANSFORMERS_VERSION} or higher.'
153
+
154
+ # init
155
+ tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-INCITE-7B-Chat")
156
+ model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-INCITE-7B-Chat", device_map='auto', torch_dtype=torch.float16, load_in_8bit=True)
157
+
158
+ # infer
159
+ prompt = "<human>: Who is Alan Turing?\n<bot>:"
160
+ inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
161
+ input_length = inputs.input_ids.shape[1]
162
+ outputs = model.generate(
163
+ **inputs, max_new_tokens=128, do_sample=True, temperature=0.7, top_p=0.7, top_k=50, return_dict_in_generate=True
164
+ )
165
+ token = outputs.sequences[0, input_length:]
166
+ output_str = tokenizer.decode(token)
167
+ print(output_str)
168
+ """
169
+ Alan Mathison Turing (23 June 1912 – 7 June 1954) was an English computer scientist, mathematician, logician, cryptanalyst, philosopher, and theoretical biologist.
170
+ """
171
+ ```
172
+
173
+ ## CPU Inference
174
+
175
+ ```python
176
+ import torch
177
+ import transformers
178
+ from transformers import AutoTokenizer, AutoModelForCausalLM
179
+
180
+ MIN_TRANSFORMERS_VERSION = '4.25.1'
181
+
182
+ # check transformers version
183
+ assert transformers.__version__ >= MIN_TRANSFORMERS_VERSION, f'Please upgrade transformers to version {MIN_TRANSFORMERS_VERSION} or higher.'
184
+
185
+ # init
186
+ tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-INCITE-7B-Chat")
187
+ model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-INCITE-7B-Chat", torch_dtype=torch.bfloat16)
188
+ # infer
189
+ prompt = "<human>: Who is Alan Turing?\n<bot>:"
190
+ inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
191
+ input_length = inputs.input_ids.shape[1]
192
+ outputs = model.generate(
193
+ **inputs, max_new_tokens=128, do_sample=True, temperature=0.7, top_p=0.7, top_k=50, return_dict_in_generate=True
194
+ )
195
+ token = outputs.sequences[0, input_length:]
196
+ output_str = tokenizer.decode(token)
197
+ print(output_str)
198
+ """
199
+ Alan Mathison Turing, OBE, FRS, (23 June 1912 – 7 June 1954) was an English computer scientist, mathematician, logician, cryptanalyst, philosopher, and theoretical biologist.
200
+ """
201
+ ```
202
+
203
+ Please note that since `LayerNormKernelImpl` is not implemented in fp16 for CPU, we use `bfloat16` for CPU inference.
204
+
205
+
206
+ # Uses
207
+
208
+ ## Direct Use
209
+
210
+ Excluded uses are described below.
211
+
212
+ ### Misuse, Malicious Use, and Out-of-Scope Use
213
+
214
+ It is the responsibility of the end user to ensure that the model is used in a responsible and ethical manner.
215
+
216
+ #### Out-of-Scope Use
217
+
218
+ `RedPajama-INCITE-7B-Chat` is a language model and may not perform well for other use cases outside of its intended scope.
219
+ For example, it may not be suitable for use in safety-critical applications or for making decisions that have a significant impact on individuals or society.
220
+ It is important to consider the limitations of the model and to only use it for its intended purpose.
221
+
222
+ #### Misuse and Malicious Use
223
+
224
+ `RedPajama-INCITE-7B-Chat` is designed for language modeling.
225
+ Misuse of the model, such as using it to engage in illegal or unethical activities, is strictly prohibited and goes against the principles of the project.
226
+
227
+ Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:
228
+
229
+ - Generating fake news, misinformation, or propaganda
230
+ - Promoting hate speech, discrimination, or violence against individuals or groups
231
+ - Impersonating individuals or organizations without their consent
232
+ - Engaging in cyberbullying or harassment
233
+ - Defamatory content
234
+ - Spamming or scamming
235
+ - Sharing confidential or sensitive information without proper authorization
236
+ - Violating the terms of use of the model or the data used to train it
237
+ - Creating automated bots for malicious purposes such as spreading malware, phishing scams, or spamming
238
+
239
+ ## Limitations
240
+
241
+ `RedPajama-INCITE-7B-Chat`, like other language models, has limitations that should be taken into consideration.
242
+ For example, the model may not always provide accurate or relevant answers, particularly for questions that are complex, ambiguous, or outside of its training data.
243
+ We therefore welcome contributions from individuals and organizations, and encourage collaboration towards creating a more robust and inclusive chatbot.
244
+
245
+ ## Training
246
+
247
+ **Training Data**
248
+
249
+ Please refer to [togethercomputer/RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)
250
+
251
+ **Training Procedure**
252
+
253
+ - **Hardware:** 8 A100
254
+ - **Optimizer:** Adam
255
+ - **Gradient Accumulations**: 1
256
+ - **Num of Tokens:** 79M tokens
257
+ - **Learning rate:** 1e-5
258
+
259
+ ## Community
260
+
261
+ Join us on [Together Discord](https://discord.gg/6ZVDU8tTD4)
config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "layer_norm_epsilon": null,
5
+ "unk_token": "<|endoftext|>"
6
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "eos_token_id": 0,
5
+ "transformers_version": "4.29.1"
6
+ }
model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fe18f0c0dd6d1ecd14f1a523046bceb6cd8963614204b9b54a855d40180fc57d
3
+ size 6864170920
special_tokens_map.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "unk_token": "<|endoftext|>"
5
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": "<|endoftext|>",
4
+ "clean_up_tokenization_spaces": true,
5
+ "eos_token": "<|endoftext|>",
6
+ "model_max_length": 2048,
7
+ "tokenizer_class": "GPTNeoXTokenizer",
8
+ "unk_token": "<|endoftext|>"
9
+ }
vocabulary.txt ADDED
The diff for this file is too large to render. See raw diff