marksverdhei
/

GLM-4.7-Flash-FP8

Text Generation

Mixture of Experts

Model card Files Files and versions

marksverdhei commited on 4 days ago

Commit

8921e2e

·

verified ·

1 Parent(s): 5d2df64

Update README.md

Files changed (1) hide show

README.md +5 -0

README.md CHANGED Viewed

@@ -13,6 +13,9 @@ library_name: transformers
 FP8 quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash).
 ## Quantization Details
 - **Method**: FP8 E4M3 per-tensor quantization with embedded scales
@@ -20,6 +23,8 @@ FP8 quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/
 - **Quantized size**: ~30GB (FP8)
 - **Preserved in BF16**: lm_head, embed_tokens, layernorms, router weights
 ## Performance
 Tested on 2x RTX 3090 (24GB each) with vLLM 0.13.0:

 FP8 quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash).
+**NOTE**: For optimal generation parameters, unsloth are working on finding the optimal parameters for practical use as of now.
+They are for llama.cpp, but I'm sure they translate to vLLM as well: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF
 ## Quantization Details
 - **Method**: FP8 E4M3 per-tensor quantization with embedded scales
 - **Quantized size**: ~30GB (FP8)
 - **Preserved in BF16**: lm_head, embed_tokens, layernorms, router weights
 ## Performance
 Tested on 2x RTX 3090 (24GB each) with vLLM 0.13.0: