Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +180 -62

README.md CHANGED Viewed

@@ -1,92 +1,210 @@
 ---
-license: other
-license_name: glm-4
-license_link: https://huggingface.co/THUDM/glm-4-9b/blob/main/LICENSE
-base_model: THUDM/GLM-4-9B-0414
 tags:
-  - glm
-  - moe
-  - nvfp4
-  - quantized
-  - vllm
 library_name: transformers
 ---
-# GLM-4.7-Flash-MTP-NVFP4
-NVFP4 (Native FP4) quantized version of [THUDM/GLM-4-9B-0414](https://huggingface.co/THUDM/GLM-4-9B-0414) with MTP (Multi-Token Prediction) layers preserved in BF16.
-## Model Details
-- **Base Model**: THUDM/GLM-4-9B-0414 (GLM-4.7-Flash)
-- **Architecture**: Glm4MoeLiteForCausalLM (MoE with 64 experts, top-4 routing)
-- **Quantization**: NVFP4 with FP8 scales, block size 16
-- **Size**: 20.9 GB (3.0x compression from 62.4 GB BF16)
-- **MTP Layers**: Preserved in BF16 for speculative decoding compatibility
-## Quantization Details
-| Component | Precision | Notes |
-|-----------|-----------|-------|
-| MLP/FFN layers | NVFP4 | 4-bit weights, 4-bit activations |
-| Attention (self_attn) | BF16 | MLA architecture preserved |
-| MTP layers (eh_proj, shared_head) | BF16 | Speculative decoding compatible |
-| Embeddings | BF16 | Not quantized |
-| Gates | BF16 | Router gates preserved |
-### Calibration Settings
-- **Samples**: 512 (from wikitext)
-- **Sequence Length**: 4096
-- **Strategy**: tensor_group with group_size=16
-## Benchmark Results
-### MMLU-Pro Accuracy
-| Model | Accuracy |
-|-------|----------|
-| BF16 (baseline) | 24.83% |
-| NVFP4-v2 (this model) | 23.91% |
 ### MTP Acceptance Rate
-- **BF16**: 60% acceptance, 1.60 mean accepted length
-- **NVFP4-v2**: 63% acceptance, 1.63 mean accepted length
-MTP quality is preserved after quantization.
-### Performance Note
-MTP speculative decoding currently shows overhead rather than speedup due to missing torch.compile support for the MTP drafter model. For best throughput, run without MTP enabled.
-## Usage with vLLM
 ```bash
-# Standard inference (recommended for performance)
-VLLM_ATTENTION_BACKEND=TRITON_MLA python -m vllm.entrypoints.openai.api_server \
-  --model GadflyII/GLM-4.7-Flash-MTP-NVFP4 \
-  --trust-remote-code \
-  --max-model-len 4096 \
-  --gpu-memory-utilization 0.95
 # With MTP speculative decoding (experimental)
-VLLM_ATTENTION_BACKEND=TRITON_MLA python -m vllm.entrypoints.openai.api_server \
-  --model GadflyII/GLM-4.7-Flash-MTP-NVFP4 \
-  --trust-remote-code \
-  --max-model-len 4096 \
-  --gpu-memory-utilization 0.90 \
-  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'
 ```
-## Requirements
-- vLLM with NVFP4 support (v0.8.0+)
-- NVIDIA GPU with FP4 support (Blackwell/Ada Lovelace with appropriate kernels)
-- transformers >= 5.0.0
-## License
-This model inherits the [GLM-4 License](https://huggingface.co/THUDM/glm-4-9b/blob/main/LICENSE) from the base model.
-## Acknowledgments
-Quantized using [llm-compressor](https://github.com/vllm-project/llm-compressor) with compressed-tensors format.

 ---
+license: apache-2.0
+language:
+- en
+- zh
+base_model: zai-org/GLM-4.7-Flash
 tags:
+- moe
+- nvfp4
+- quantized
+- vllm
+- glm
+- 30b
+- mtp
+- speculative-decoding
 library_name: transformers
+pipeline_tag: text-generation
 ---
+# Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream).
+https://github.com/Gadflyii/vllm/tree/main
+# GLM-4.7-Flash-MTP-NVFP4 (Mixed Precision with MTP Support)
+This is a **mixed precision NVFP4 quantization** of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), a 30B-A3B (30B total, 3B active) Mixture-of-Experts model. This version preserves **MTP (Multi-Token Prediction) layers in BF16** for speculative decoding compatibility.
+## What's Different from GLM-4.7-Flash-NVFP4?
+| Feature | GLM-4.7-Flash-NVFP4 | **This Model** |
+|---------|---------------------|----------------|
+| MTP Layers | Quantized (broken) | **BF16 (working)** |
+| MTP Speculative Decoding | ❌ Not supported | ✅ Supported |
+| Calibration Samples | 128 | **512** |
+| Calibration Seq Length | 2048 | **4096** |
+| MMLU-Pro Accuracy | 23.56% | **23.91%** |
+## Quantization Strategy
+This model uses **mixed precision** to preserve accuracy and MTP functionality:
+| Component | Precision | Rationale |
+|-----------|-----------|-----------|
+| MLP Experts | FP4 (E2M1) | 64 routed experts, 4 active per token |
+| Dense MLP | FP4 (E2M1) | First layer dense MLP |
+| **Attention (MLA)** | **BF16** | Low-rank compressed Q/KV projections are sensitive |
+| **MTP Layers** | **BF16** | `eh_proj`, `shared_head.head` for speculative decoding |
+| Norms, Gates, Embeddings | BF16 | Standard practice |
+## Performance
+| Metric | BF16 | NVFP4-v1 | **This Model** |
+|--------|------|----------|----------------|
+| MMLU-Pro | 24.83% | 23.56% | **23.91%** |
+| Size | 62.4 GB | 20.4 GB | **20.9 GB** |
+| Compression | 1x | 3.1x | **3.0x** |
+| Accuracy Loss | - | -1.27% | **-0.92%** |
+| MTP Working | ✅ | ❌ | ✅ |
 ### MTP Acceptance Rate
+| Model | Acceptance Rate | Mean Accepted Length |
+|-------|-----------------|----------------------|
+| BF16 (baseline) | 60% | 1.60 |
+| **This Model** | **63%** | **1.63** |
+MTP quality is preserved (actually slightly improved) after quantization.
+### MTP Performance Note
+MTP speculative decoding currently shows overhead rather than speedup due to missing `torch.compile` support for the MTP drafter model in vLLM. For best throughput, run without MTP enabled until this is resolved upstream.
+| Configuration | Tokens/sec | Recommendation |
+|---------------|------------|----------------|
+| Without MTP | 78.1 tok/s | ✅ **Use this** |
+| With MTP (1 token) | 64.7 tok/s | ❌ |
+| With MTP (2 tokens) | 56.8 tok/s | ❌ |
+| With MTP (4 tokens) | 44.5 tok/s | ❌ |
+## Usage
+### Requirements
+- **vLLM**: 0.8.0+ (for compressed-tensors NVFP4 support)
+- **transformers**: 5.0.0+ (for `glm4_moe_lite` architecture)
+- **GPU**: NVIDIA GPU with FP4 tensor core support (Blackwell, Hopper, Ada Lovelace)
+### Installation
+```bash
+pip install vllm>=0.8.0
+pip install git+https://github.com/huggingface/transformers.git
+```
+### Inference with vLLM (Recommended)
+```python
+from vllm import LLM, SamplingParams
+model = LLM(
+    "GadflyII/GLM-4.7-Flash-MTP-NVFP4",
+    tensor_parallel_size=1,
+    max_model_len=4096,
+    trust_remote_code=True,
+    gpu_memory_utilization=0.90,
+)
+params = SamplingParams(temperature=0.7, max_tokens=512)
+outputs = model.generate(["Explain quantum computing in simple terms."], params)
+print(outputs[0].outputs[0].text)
+```
+### Serving with vLLM
 ```bash
+# Standard serving (recommended for performance)
+VLLM_ATTENTION_BACKEND=TRITON_MLA vllm serve GadflyII/GLM-4.7-Flash-MTP-NVFP4 \
+    --tensor-parallel-size 1 \
+    --max-model-len 4096 \
+    --trust-remote-code \
+    --gpu-memory-utilization 0.90
 # With MTP speculative decoding (experimental)
+VLLM_ATTENTION_BACKEND=TRITON_MLA vllm serve GadflyII/GLM-4.7-Flash-MTP-NVFP4 \
+    --tensor-parallel-size 1 \
+    --max-model-len 4096 \
+    --trust-remote-code \
+    --gpu-memory-utilization 0.90 \
+    --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'
 ```
+## Model Details
+- **Base Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
+- **Architecture**: `Glm4MoeLiteForCausalLM`
+- **Parameters**: 30B total, 3B active per token (30B-A3B)
+- **MoE Configuration**: 64 routed experts, 4 active, 1 shared expert
+- **Layers**: 47 (with 1 MTP layer)
+- **Context Length**: 202,752 tokens (max)
+- **Languages**: English, Chinese
+## Quantization Details
+- **Format**: compressed-tensors (NVFP4)
+- **Block Size**: 16
+- **Scale Format**: FP8 (E4M3)
+- **Calibration**: 512 samples from wikitext dataset
+- **Calibration Sequence Length**: 4096
+- **Full Expert Calibration**: All 64 experts calibrated per sample
+### Tensors by Precision
+| Precision | Count | Description |
+|-----------|-------|-------------|
+| NVFP4 | 9,168 | MLP/FFN weights |
+| BF16 | 240 | Attention weights (MLA) |
+| BF16 | 2 | MTP layers (eh_proj, shared_head.head) |
+## Evaluation
+### MMLU-Pro Overall Results
+| Model | Accuracy | Correct | Total |
+|-------|----------|---------|-------|
+| BF16 (baseline) | 24.83% | 2988 | 12032 |
+| NVFP4-v1 | 23.56% | 2835 | 12032 |
+| **This Model** | **23.91%** | **2877** | 12032 |
+### MMLU-Pro by Category
+| Category | BF16 | This Model | Difference |
+|----------|------|------------|------------|
+| Social Sciences | 32.70% | 31.26% | -1.44% |
+| Other | 31.57% | 29.85% | -1.72% |
+| Humanities | 23.78% | 22.82% | -0.96% |
+| STEM | 19.94% | 19.48% | -0.46% |
+### MMLU-Pro by Subject
+| Subject | BF16 | This Model | Difference |
+|---------|------|------------|------------|
+| Biology | 50.35% | 48.12% | -2.23% |
+| Psychology | 44.99% | 41.23% | -3.76% |
+| History | 33.60% | 34.12% | +0.52% |
+| Health | 35.21% | 34.11% | -1.10% |
+| Economics | 36.37% | 33.06% | -3.31% |
+| Philosophy | 31.46% | 29.26% | -2.20% |
+| Other | 28.35% | 26.08% | -2.27% |
+| Computer Science | 26.10% | 21.95% | -4.15% |
+| Business | 16.35% | 19.26% | +2.91% |
+| Law | 16.89% | 15.99% | -0.90% |
+| Math | 14.06% | 14.73% | +0.67% |
+| Physics | 15.32% | 15.24% | -0.08% |
+| Engineering | 16.00% | 14.96% | -1.04% |
+| Chemistry | 14.13% | 14.84% | +0.71% |
+## Citation
+If you use this model, please cite the original GLM-4.7-Flash:
+```bibtex
+@misc{glm4flash2025,
+  title={GLM-4.7-Flash},
+  author={Zhipu AI},
+  year={2025},
+  howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}}
+}
+```
+## License
+This model inherits the Apache 2.0 license from the base model.