alexmarques commited on
Commit
971d339
·
verified ·
1 Parent(s): 10fb995

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -49
README.md CHANGED
@@ -12,6 +12,8 @@ pipeline_tag: text-generation
12
  - **Output:** Text
13
  - **Model Optimizations:**
14
  - **Weight quantization:** INT8
 
 
15
  - **Release Date:** 7/2/2024
16
  - **Version:** 1.0
17
  - **Model Developers:** Neural Magic
@@ -28,57 +30,46 @@ Only the weights of the linear operators within transformers blocks are quantize
28
  [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) is used for quantization with 1% damping factor and 256 sequences of 8,192 random tokens.
29
 
30
 
31
- ## Usage and Creation
32
-
33
- - **Intended Use Cases:** Intended for commercial and research use in English. Similarly to [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), this models is intended for assistant-like chat.
34
- - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
35
-
36
- ### Use with transformers
37
 
38
- This model is supported by Transformers leveraging the integration with the [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) data format.
39
- The following examples contemplate how the model can be used as part of a Transformers pipeline or using the `generate()` function.
40
 
41
- #### Transformers pipeline
42
 
43
  ```python
44
- import transformers
45
- import torch
46
 
47
  model_id = "neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16"
48
 
49
- pipeline = transformers.pipeline(
50
- "text-generation",
51
- model=model_id,
52
- model_kwargs={"torch_dtype": "auto"},
53
- device_map="auto",
54
- )
55
 
56
  messages = [
57
  {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
58
  {"role": "user", "content": "Who are you?"},
59
  ]
60
 
61
- terminators = [
62
- pipeline.tokenizer.eos_token_id,
63
- pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
64
- ]
65
 
66
- outputs = pipeline(
67
- messages,
68
- max_new_tokens=256,
69
- eos_token_id=terminators,
70
- do_sample=True,
71
- temperature=0.6,
72
- top_p=0.9,
73
- )
74
- print(outputs[0]["generated_text"][-1])
75
  ```
76
 
77
- #### Transformers AutoModelForCausalLM
 
 
 
 
 
78
 
79
  ```python
80
  from transformers import AutoTokenizer, AutoModelForCausalLM
81
- import torch
82
 
83
  model_id = "neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16"
84
 
@@ -117,40 +108,63 @@ response = outputs[0][input_ids.shape[-1]:]
117
  print(tokenizer.decode(response, skip_special_tokens=True))
118
  ```
119
 
120
- ### vLLM Deployment
121
 
122
- This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
 
123
 
124
  ```python
125
- from vllm import LLM, SamplingParams
126
  from transformers import AutoTokenizer
 
 
127
 
128
- model_id = "neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16"
129
 
130
- sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=300)
 
131
 
132
  tokenizer = AutoTokenizer.from_pretrained(model_id)
133
 
134
- messages = [
135
- {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
136
- {"role": "user", "content": "Who are you?"},
137
- ]
138
-
139
- prompts = tokenizer.apply_chat_template(messages, tokenize=False)
 
 
 
140
 
141
- llm = LLM(model=model_id)
 
 
 
 
 
 
142
 
143
- outputs = llm.generate(prompts, sampling_params)
 
 
 
 
144
 
145
- generated_text = outputs[0].outputs[0].text
146
- print(generated_text)
147
  ```
148
 
149
- vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
150
 
151
  ## Evaluation
152
 
153
- The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) using the [vLLM](https://docs.vllm.ai/en/stable/) engine.
 
 
 
 
 
 
 
154
 
155
  ### Accuracy
156
 
 
12
  - **Output:** Text
13
  - **Model Optimizations:**
14
  - **Weight quantization:** INT8
15
+ - **Intended Use Cases:** Intended for commercial and research use in English. Similarly to [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), this models is intended for assistant-like chat.
16
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
17
  - **Release Date:** 7/2/2024
18
  - **Version:** 1.0
19
  - **Model Developers:** Neural Magic
 
30
  [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) is used for quantization with 1% damping factor and 256 sequences of 8,192 random tokens.
31
 
32
 
33
+ ## Deployment
 
 
 
 
 
34
 
35
+ ### Use with vLLM
 
36
 
37
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
38
 
39
  ```python
40
+ from vllm import LLM, SamplingParams
41
+ from transformers import AutoTokenizer
42
 
43
  model_id = "neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16"
44
 
45
+ sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
46
+
47
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
 
 
 
48
 
49
  messages = [
50
  {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
51
  {"role": "user", "content": "Who are you?"},
52
  ]
53
 
54
+ prompts = tokenizer.apply_chat_template(messages, tokenize=False)
 
 
 
55
 
56
+ llm = LLM(model=model_id)
57
+
58
+ outputs = llm.generate(prompts, sampling_params)
59
+
60
+ generated_text = outputs[0].outputs[0].text
61
+ print(generated_text)
 
 
 
62
  ```
63
 
64
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
65
+
66
+ ### Use with transformers
67
+
68
+ This model is supported by Transformers leveraging the integration with the [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) data format.
69
+ The following example contemplates how the model can be used using the `generate()` function.
70
 
71
  ```python
72
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
73
 
74
  model_id = "neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16"
75
 
 
108
  print(tokenizer.decode(response, skip_special_tokens=True))
109
  ```
110
 
111
+ ## Creation
112
 
113
+ This model was created by applying the [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) library as presented in the code snipet below.
114
+ Although AutoGPTQ was used for this particular model, Neural Magic is transitioning to using [llm-compressor](https://github.com/vllm-project/llm-compressor) which supports several quantization schemes and models not supported by AutoGPTQ.
115
 
116
  ```python
 
117
  from transformers import AutoTokenizer
118
+ from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
119
+ import random
120
 
121
+ model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
122
 
123
+ num_samples = 256
124
+ max_seq_len = 8192
125
 
126
  tokenizer = AutoTokenizer.from_pretrained(model_id)
127
 
128
+ max_token_id = len(tokenizer.get_vocab()) - 1
129
+ examples = []
130
+ for _ in range(num_samples):
131
+ examples.append(
132
+ {
133
+ "input_ids": [random.randint(0, max_token_id) for _ in range(max_seq_len)],
134
+ "attention_mask": max_seq_len*[1],
135
+ }
136
+ )
137
 
138
+ quantize_config = BaseQuantizeConfig(
139
+ bits=8,
140
+ group_size=-1,
141
+ desc_act=False,
142
+ model_file_base_name="model",
143
+ damp_percent=0.01,
144
+ )
145
 
146
+ model = AutoGPTQForCausalLM.from_pretrained(
147
+ model_id,
148
+ quantize_config,
149
+ device_map="auto",
150
+ )
151
 
152
+ model.quantize(examples)
153
+ model.save_pretrained("Meta-Llama-3-8B-Instruct-quantized.w8a16")
154
  ```
155
 
156
+
157
 
158
  ## Evaluation
159
 
160
+ The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/383bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
161
+ ```
162
+ lm_eval \
163
+ --model vllm \
164
+ --model_args pretrained="neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16",dtype=auto,gpu_memory_utilization=0.4,add_bos_token=True,max_model_len=4096,trust_remote_code=True \
165
+ --tasks openllm \
166
+ --batch_size auto
167
+ ```
168
 
169
  ### Accuracy
170