marksverdhei
/

GLM-4.7-Flash-FP8

Text Generation

Mixture of Experts

Model card Files Files and versions

marksverdhei commited on 1 day ago

Commit

02419e3

·

verified ·

1 Parent(s): 82cc0ee

Fix license to MIT

Files changed (1) hide show

README.md +49 -17

README.md CHANGED Viewed

@@ -1,27 +1,46 @@
 ---
-license: other
-license_name: glm-4
-license_link: https://huggingface.co/THUDM/glm-4-9b/blob/main/LICENSE
 base_model: zai-org/GLM-4.7-Flash
-base_model_relation: quantized
 tags:
-- fp8
-- quantized
-- glm4
-library_name: transformers
 ---
-# GLM-4.7-Flash FP8 (Work in progress)
-This is an FP8 (E4M3) quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash).
-## Quantization Details
-- **Format**: FP8 E4M3
-- **Quantized layers**: All Linear layers except embeddings and layer norms
-- **Scale storage**: Per-tensor scales stored alongside weights
-## Usage
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -33,9 +52,22 @@ model = AutoModelForCausalLM.from_pretrained(
     device_map="auto",
     trust_remote_code=True
 )
-tokenizer = AutoTokenizer.from_pretrained("marksverdhei/GLM-4.7-Flash-fp8", trust_remote_code=True)
 ```
 ## Original Model
-See the original model card at [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) for full details on capabilities and usage.

 ---
+language:
+  - en
+  - zh
+library_name: transformers
+license: mit
 base_model: zai-org/GLM-4.7-Flash
 tags:
+  - fp8
+  - quantized
+  - glm4
+  - vllm
+pipeline_tag: text-generation
 ---
+# GLM-4.7-Flash FP8
+FP8 quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) for efficient inference.
+## Model Details
+- **Base Model**: zai-org/GLM-4.7-Flash
+- **Quantization**: FP8 (E4M3) weight quantization
+- **Architecture**: GLM-4 MoE Lite (47 layers)
+## Usage with vLLM
+```python
+from vllm import LLM, SamplingParams
+llm = LLM(
+    model="marksverdhei/GLM-4.7-Flash-fp8",
+    trust_remote_code=True,
+    dtype="bfloat16",
+    quantization="fp8"
+)
+sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
+outputs = llm.generate(["Hello, how are you?"], sampling_params)
+print(outputs[0].outputs[0].text)
+```
+## Usage with Transformers
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
     device_map="auto",
     trust_remote_code=True
 )
+tokenizer = AutoTokenizer.from_pretrained(
+    "marksverdhei/GLM-4.7-Flash-fp8",
+    trust_remote_code=True
+)
 ```
 ## Original Model
+See the original model at [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) for full capabilities and benchmarks.
+GLM-4.7 features improvements in:
+- Core coding and agentic tasks
+- UI/Vibe coding
+- Tool using
+- Complex reasoning
+## License
+MIT License (same as base model)