Update README.md
Browse files
README.md
CHANGED
|
@@ -34,7 +34,7 @@ This optimization reduces the number of bits per parameter from 16 to 8, reducin
|
|
| 34 |
Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.
|
| 35 |
[AutoFP8](https://github.com/neuralmagic/AutoFP8) is used for quantization with 512 sequences of UltraChat.
|
| 36 |
|
| 37 |
-
## Deployment
|
| 38 |
|
| 39 |
### Use with vLLM
|
| 40 |
|
|
@@ -65,12 +65,13 @@ generated_text = outputs[0].outputs[0].text
|
|
| 65 |
print(generated_text)
|
| 66 |
```
|
| 67 |
|
| 68 |
-
vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
|
| 69 |
|
| 70 |
## Creation
|
| 71 |
|
| 72 |
This model was created by applying [AutoFP8 with calibration samples from ultrachat](https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py), as presented in the code snipet below.
|
| 73 |
Although AutoFP8 was used for this particular model, Neural Magic is transitioning to using [llm-compressor](https://github.com/vllm-project/llm-compressor) which supports several quantization schemes and models not supported by AutoFP8.
|
|
|
|
| 74 |
|
| 75 |
```python
|
| 76 |
from datasets import load_dataset
|
|
@@ -105,6 +106,7 @@ model.save_quantized(quantized_model_dir)
|
|
| 105 |
## Evaluation
|
| 106 |
|
| 107 |
The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/383bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
|
|
|
|
| 108 |
```
|
| 109 |
lm_eval \
|
| 110 |
--model vllm \
|
|
|
|
| 34 |
Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.
|
| 35 |
[AutoFP8](https://github.com/neuralmagic/AutoFP8) is used for quantization with 512 sequences of UltraChat.
|
| 36 |
|
| 37 |
+
<!-- ## Deployment
|
| 38 |
|
| 39 |
### Use with vLLM
|
| 40 |
|
|
|
|
| 65 |
print(generated_text)
|
| 66 |
```
|
| 67 |
|
| 68 |
+
vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. -->
|
| 69 |
|
| 70 |
## Creation
|
| 71 |
|
| 72 |
This model was created by applying [AutoFP8 with calibration samples from ultrachat](https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py), as presented in the code snipet below.
|
| 73 |
Although AutoFP8 was used for this particular model, Neural Magic is transitioning to using [llm-compressor](https://github.com/vllm-project/llm-compressor) which supports several quantization schemes and models not supported by AutoFP8.
|
| 74 |
+
Note that ```transformers``` must be built from source.
|
| 75 |
|
| 76 |
```python
|
| 77 |
from datasets import load_dataset
|
|
|
|
| 106 |
## Evaluation
|
| 107 |
|
| 108 |
The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/383bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
|
| 109 |
+
Note that ```vllm``` must be built from source.
|
| 110 |
```
|
| 111 |
lm_eval \
|
| 112 |
--model vllm \
|