Update README.md
Browse files
README.md
CHANGED
|
@@ -13,6 +13,9 @@ library_name: transformers
|
|
| 13 |
|
| 14 |
FP8 quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash).
|
| 15 |
|
|
|
|
|
|
|
|
|
|
| 16 |
## Quantization Details
|
| 17 |
|
| 18 |
- **Method**: FP8 E4M3 per-tensor quantization with embedded scales
|
|
@@ -20,6 +23,8 @@ FP8 quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/
|
|
| 20 |
- **Quantized size**: ~30GB (FP8)
|
| 21 |
- **Preserved in BF16**: lm_head, embed_tokens, layernorms, router weights
|
| 22 |
|
|
|
|
|
|
|
| 23 |
## Performance
|
| 24 |
|
| 25 |
Tested on 2x RTX 3090 (24GB each) with vLLM 0.13.0:
|
|
|
|
| 13 |
|
| 14 |
FP8 quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash).
|
| 15 |
|
| 16 |
+
**NOTE**: For optimal generation parameters, unsloth are working on finding the optimal parameters for practical use as of now.
|
| 17 |
+
They are for llama.cpp, but I'm sure they translate to vLLM as well: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF
|
| 18 |
+
|
| 19 |
## Quantization Details
|
| 20 |
|
| 21 |
- **Method**: FP8 E4M3 per-tensor quantization with embedded scales
|
|
|
|
| 23 |
- **Quantized size**: ~30GB (FP8)
|
| 24 |
- **Preserved in BF16**: lm_head, embed_tokens, layernorms, router weights
|
| 25 |
|
| 26 |
+
|
| 27 |
+
|
| 28 |
## Performance
|
| 29 |
|
| 30 |
Tested on 2x RTX 3090 (24GB each) with vLLM 0.13.0:
|