Update README.md
Browse files
README.md
CHANGED
|
@@ -34,7 +34,7 @@ This optimization reduces the number of bits per parameter from 16 to 8, reducin
|
|
| 34 |
Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.
|
| 35 |
[AutoFP8](https://github.com/neuralmagic/AutoFP8) is used for quantization with 512 sequences of UltraChat.
|
| 36 |
|
| 37 |
-
|
| 38 |
|
| 39 |
### Use with vLLM
|
| 40 |
|
|
@@ -46,7 +46,7 @@ from transformers import AutoTokenizer
|
|
| 46 |
|
| 47 |
model_id = "neuralmagic/Mistral-Nemo-Instruct-2407-FP8"
|
| 48 |
|
| 49 |
-
sampling_params = SamplingParams(temperature=0.
|
| 50 |
|
| 51 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 52 |
|
|
@@ -57,7 +57,7 @@ messages = [
|
|
| 57 |
|
| 58 |
prompts = tokenizer.apply_chat_template(messages, tokenize=False)
|
| 59 |
|
| 60 |
-
llm = LLM(model=model_id)
|
| 61 |
|
| 62 |
outputs = llm.generate(prompts, sampling_params)
|
| 63 |
|
|
@@ -65,7 +65,7 @@ generated_text = outputs[0].outputs[0].text
|
|
| 65 |
print(generated_text)
|
| 66 |
```
|
| 67 |
|
| 68 |
-
vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
|
| 69 |
|
| 70 |
## Creation
|
| 71 |
|
|
|
|
| 34 |
Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.
|
| 35 |
[AutoFP8](https://github.com/neuralmagic/AutoFP8) is used for quantization with 512 sequences of UltraChat.
|
| 36 |
|
| 37 |
+
## Deployment
|
| 38 |
|
| 39 |
### Use with vLLM
|
| 40 |
|
|
|
|
| 46 |
|
| 47 |
model_id = "neuralmagic/Mistral-Nemo-Instruct-2407-FP8"
|
| 48 |
|
| 49 |
+
sampling_params = SamplingParams(temperature=0.3, top_p=0.9, max_tokens=256)
|
| 50 |
|
| 51 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 52 |
|
|
|
|
| 57 |
|
| 58 |
prompts = tokenizer.apply_chat_template(messages, tokenize=False)
|
| 59 |
|
| 60 |
+
llm = LLM(model=model_id, max_model_len=4096)
|
| 61 |
|
| 62 |
outputs = llm.generate(prompts, sampling_params)
|
| 63 |
|
|
|
|
| 65 |
print(generated_text)
|
| 66 |
```
|
| 67 |
|
| 68 |
+
vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
|
| 69 |
|
| 70 |
## Creation
|
| 71 |
|