marksverdhei commited on
Commit
8921e2e
·
verified ·
1 Parent(s): 5d2df64

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -0
README.md CHANGED
@@ -13,6 +13,9 @@ library_name: transformers
13
 
14
  FP8 quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash).
15
 
 
 
 
16
  ## Quantization Details
17
 
18
  - **Method**: FP8 E4M3 per-tensor quantization with embedded scales
@@ -20,6 +23,8 @@ FP8 quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/
20
  - **Quantized size**: ~30GB (FP8)
21
  - **Preserved in BF16**: lm_head, embed_tokens, layernorms, router weights
22
 
 
 
23
  ## Performance
24
 
25
  Tested on 2x RTX 3090 (24GB each) with vLLM 0.13.0:
 
13
 
14
  FP8 quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash).
15
 
16
+ **NOTE**: For optimal generation parameters, unsloth are working on finding the optimal parameters for practical use as of now.
17
+ They are for llama.cpp, but I'm sure they translate to vLLM as well: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF
18
+
19
  ## Quantization Details
20
 
21
  - **Method**: FP8 E4M3 per-tensor quantization with embedded scales
 
23
  - **Quantized size**: ~30GB (FP8)
24
  - **Preserved in BF16**: lm_head, embed_tokens, layernorms, router weights
25
 
26
+
27
+
28
  ## Performance
29
 
30
  Tested on 2x RTX 3090 (24GB each) with vLLM 0.13.0: