RedHatAI
/

Meta-Llama-3-8B-Instruct-quantized.w8a16

@@ -12,6 +12,8 @@ pipeline_tag: text-generation
   - **Output:** Text
 - **Model Optimizations:**
   - **Weight quantization:** INT8
 - **Release Date:** 7/2/2024
 - **Version:** 1.0
 - **Model Developers:** Neural Magic
@@ -28,57 +30,46 @@ Only the weights of the linear operators within transformers blocks are quantize
 [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) is used for quantization with 1% damping factor and 256 sequences of 8,192 random tokens.
-## Usage and Creation
-- **Intended Use Cases:** Intended for commercial and research use in English. Similarly to [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), this models is intended for assistant-like chat.
-- **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
-### Use with transformers
-This model is supported by Transformers leveraging the integration with the [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) data format.
-The following examples contemplate how the model can be used as part of a Transformers pipeline or using the `generate()` function.
-#### Transformers pipeline
 ```python
-import transformers
-import torch
 model_id = "neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16"
-pipeline = transformers.pipeline(
-    "text-generation",
-    model=model_id,
-    model_kwargs={"torch_dtype": "auto"},
-    device_map="auto",
-)
 messages = [
     {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
     {"role": "user", "content": "Who are you?"},
 ]
-terminators = [
-    pipeline.tokenizer.eos_token_id,
-    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
-]
-outputs = pipeline(
-    messages,
-    max_new_tokens=256,
-    eos_token_id=terminators,
-    do_sample=True,
-    temperature=0.6,
-    top_p=0.9,
-)
-print(outputs[0]["generated_text"][-1])
 ```
-#### Transformers AutoModelForCausalLM
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
-import torch
 model_id = "neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16"
@@ -117,40 +108,63 @@ response = outputs[0][input_ids.shape[-1]:]
 print(tokenizer.decode(response, skip_special_tokens=True))
 ```
-### vLLM Deployment
-This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
 ```python
-from vllm import LLM, SamplingParams
 from transformers import AutoTokenizer
-model_id = "neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16"
-sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=300)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
-messages = [
-    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
-    {"role": "user", "content": "Who are you?"},
-]
-prompts = tokenizer.apply_chat_template(messages, tokenize=False)
-llm = LLM(model=model_id)
-outputs = llm.generate(prompts, sampling_params)
-generated_text = outputs[0].outputs[0].text
-print(generated_text)
 ```
-vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
 ## Evaluation
-The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) using the [vLLM](https://docs.vllm.ai/en/stable/) engine.
 ### Accuracy

   - **Output:** Text
 - **Model Optimizations:**
   - **Weight quantization:** INT8
+- **Intended Use Cases:** Intended for commercial and research use in English. Similarly to [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), this models is intended for assistant-like chat.
+- **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
 - **Release Date:** 7/2/2024
 - **Version:** 1.0
 - **Model Developers:** Neural Magic
 [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) is used for quantization with 1% damping factor and 256 sequences of 8,192 random tokens.
+## Deployment
+### Use with vLLM
+This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
 ```python
+from vllm import LLM, SamplingParams
+from transformers import AutoTokenizer
 model_id = "neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16"
+sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
 messages = [
     {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
     {"role": "user", "content": "Who are you?"},
 ]
+prompts = tokenizer.apply_chat_template(messages, tokenize=False)
+llm = LLM(model=model_id)
+outputs = llm.generate(prompts, sampling_params)
+generated_text = outputs[0].outputs[0].text
+print(generated_text)
 ```
+vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
+### Use with transformers
+This model is supported by Transformers leveraging the integration with the [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) data format.
+The following example contemplates how the model can be used using the `generate()` function.
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 model_id = "neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16"
 print(tokenizer.decode(response, skip_special_tokens=True))
 ```
+## Creation
+This model was created by applying the [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) library as presented in the code snipet below.
+Although AutoGPTQ was used for this particular model, Neural Magic is transitioning to using [llm-compressor](https://github.com/vllm-project/llm-compressor) which supports several quantization schemes and models not supported by AutoGPTQ.
 ```python
 from transformers import AutoTokenizer
+from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
+import random
+model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
+num_samples = 256
+max_seq_len = 8192
 tokenizer = AutoTokenizer.from_pretrained(model_id)
+max_token_id = len(tokenizer.get_vocab()) - 1
+examples = []
+for _ in range(num_samples):
+  examples.append(
+  {
+    "input_ids": [random.randint(0, max_token_id) for _ in range(max_seq_len)],
+    "attention_mask": max_seq_len*[1],
+  }
+)
+quantize_config = BaseQuantizeConfig(
+  bits=8,
+  group_size=-1,
+  desc_act=False,
+  model_file_base_name="model",
+  damp_percent=0.01,
+)
+model = AutoGPTQForCausalLM.from_pretrained(
+  model_id,
+  quantize_config,
+  device_map="auto",
+)
+model.quantize(examples)
+model.save_pretrained("Meta-Llama-3-8B-Instruct-quantized.w8a16")
 ```
 ## Evaluation
+The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/383bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
+```
+lm_eval \
+  --model vllm \
+  --model_args pretrained="neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w8a16",dtype=auto,gpu_memory_utilization=0.4,add_bos_token=True,max_model_len=4096,trust_remote_code=True \
+  --tasks openllm \
+  --batch_size auto
+```
 ### Accuracy