Update README.md
Browse files
README.md
CHANGED
|
@@ -18,8 +18,8 @@ tags:
|
|
| 18 |
- **Input:** Text
|
| 19 |
- **Output:** Text
|
| 20 |
- **Model Optimizations:**
|
| 21 |
-
- **Activation quantization:**
|
| 22 |
-
- **Weight quantization:**
|
| 23 |
- **Intended Use Cases:** Intended for commercial and research use multiple languages. Similarly to [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B), this models is intended for assistant-like chat.
|
| 24 |
- **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
|
| 25 |
- **Release Date:** 11/27/2024
|
|
@@ -32,13 +32,13 @@ It achieves an average score of 71.00 on the [OpenLLM](https://huggingface.co/sp
|
|
| 32 |
|
| 33 |
### Model Optimizations
|
| 34 |
|
| 35 |
-
This model was obtained by quantizing the weights of [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) to
|
| 36 |
This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
|
| 37 |
Weight quantization also reduces disk size requirements by approximately 50%.
|
| 38 |
|
| 39 |
Only weights and activations of the linear operators within transformers blocks are quantized.
|
| 40 |
-
Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between
|
| 41 |
-
Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between
|
| 42 |
|
| 43 |
## Deployment
|
| 44 |
|
|
|
|
| 18 |
- **Input:** Text
|
| 19 |
- **Output:** Text
|
| 20 |
- **Model Optimizations:**
|
| 21 |
+
- **Activation quantization:** FP8
|
| 22 |
+
- **Weight quantization:** FP8
|
| 23 |
- **Intended Use Cases:** Intended for commercial and research use multiple languages. Similarly to [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B), this models is intended for assistant-like chat.
|
| 24 |
- **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
|
| 25 |
- **Release Date:** 11/27/2024
|
|
|
|
| 32 |
|
| 33 |
### Model Optimizations
|
| 34 |
|
| 35 |
+
This model was obtained by quantizing the weights of [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) to FP8 data type.
|
| 36 |
This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
|
| 37 |
Weight quantization also reduces disk size requirements by approximately 50%.
|
| 38 |
|
| 39 |
Only weights and activations of the linear operators within transformers blocks are quantized.
|
| 40 |
+
Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between FP8 and floating point representations for each output channel dimension.
|
| 41 |
+
Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between FP8 and floating point representations.
|
| 42 |
|
| 43 |
## Deployment
|
| 44 |
|