RedHatAI
/

Mistral-Nemo-Instruct-2407-FP8

Text Generation

text-generation-inference

Model card Files Files and versions

Lin-K76 commited on Jul 19, 2024

Commit

2545394

·

verified ·

1 Parent(s): 90ff7a0

Update README.md

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -34,7 +34,7 @@ This optimization reduces the number of bits per parameter from 16 to 8, reducin
 Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.
 [AutoFP8](https://github.com/neuralmagic/AutoFP8) is used for quantization with 512 sequences of UltraChat.
-<!-- ## Deployment
 ### Use with vLLM
@@ -46,7 +46,7 @@ from transformers import AutoTokenizer
 model_id = "neuralmagic/Mistral-Nemo-Instruct-2407-FP8"
-sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
@@ -57,7 +57,7 @@ messages = [
 prompts = tokenizer.apply_chat_template(messages, tokenize=False)
-llm = LLM(model=model_id)
 outputs = llm.generate(prompts, sampling_params)
@@ -65,7 +65,7 @@ generated_text = outputs[0].outputs[0].text
 print(generated_text)
 ```
-vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. -->
 ## Creation

 Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.
 [AutoFP8](https://github.com/neuralmagic/AutoFP8) is used for quantization with 512 sequences of UltraChat.
+## Deployment
 ### Use with vLLM
 model_id = "neuralmagic/Mistral-Nemo-Instruct-2407-FP8"
+sampling_params = SamplingParams(temperature=0.3, top_p=0.9, max_tokens=256)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 prompts = tokenizer.apply_chat_template(messages, tokenize=False)
+llm = LLM(model=model_id, max_model_len=4096)
 outputs = llm.generate(prompts, sampling_params)
 print(generated_text)
 ```
+vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
 ## Creation