Upload folder using huggingface_hub

Browse files

Files changed (15) hide show

.gitattributes +1 -0
README.md +179 -0
base_weights.safetensors +3 -0
config.json +45 -0
model-00001-of-00007.safetensors +3 -0
model-00002-of-00007.safetensors +3 -0
model-00003-of-00007.safetensors +3 -0
model-00004-of-00007.safetensors +3 -0
model-00005-of-00007.safetensors +3 -0
model-00006-of-00007.safetensors +3 -0
model-00007-of-00007.safetensors +3 -0
model.safetensors.index.json +0 -0
quantization_index.json +0 -0
tokenizer.json +3 -0
tokenizer_config.json +321 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,179 @@

+---
+language:
+  - en
+  - zh
+library_name: transformers
+license: mit
+pipeline_tag: text-generation
+base_model: zai-org/GLM-4.7-Flash
+tags:
+  - trellis
+  - quantized
+  - moe
+  - 3-bit
+  - mixed-precision
+  - cuda
+  - glm
+quantized_by: Metal Marlin
+---
+# GLM-4.7-Flash-Trellis-3.8bpw
+<div align="center">
+<img src="https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/logo.svg" width="15%"/>
+</div>
+**Trellis-quantized GLM-4.7-Flash** — a 30B-A3B MoE model compressed to **3.78 bits per weight** using sensitivity-aware mixed-precision quantization.
+| Metric | Value |
+|--------|-------|
+| **Effective bits** | 3.78 bpw |
+| **Compression** | 4.2× vs FP16 |
+| **Model size** | ~14 GB (vs ~60 GB FP16) |
+| **Parameters** | 29.3B |
+| **Format** | HuggingFace sharded safetensors |
+## Model Description
+This is a quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), the strongest model in the 30B class that balances performance and efficiency.
+GLM-4.7-Flash features:
+- **30B-A3B MoE architecture** (64 experts + shared expert, 2-4 active per token)
+- **Multi-head Latent Attention (MLA)** for 8× KV cache compression
+- **State-of-the-art reasoning** (91.6% on AIME 2025, 59.2% on SWE-bench Verified)
+- **Bilingual** (English + Chinese)
+## Quantization Details
+Quantized using **Trellis** (EXL3-style) with Metal Marlin acceleration:
+### Bit Allocation
+| Bit Width | Tensors | Parameters | % of Model |
+|-----------|---------|------------|------------|
+| 6-bit | 3,037 | 9.4B | 32.2% |
+| 3-bit | 2,710 | 8.6B | 29.3% |
+| 2-bit | 2,736 | 8.6B | 29.3% |
+| 4-bit | 575 | 2.1B | 7.2% |
+| 5-bit | 196 | 591M | 2.0% |
+### Sensitivity-Aware Allocation
+- **8-bit**: Router weights, embeddings, LM head, layer norms
+- **6-bit**: Gate layers, attention projections with high outlier ratios
+- **4-5 bit**: Standard attention layers (q/k/v/o projections)
+- **2-3 bit**: MoE expert layers (lowest sensitivity)
+### Quantization Statistics
+- **Average MSE**: 0.000223
+- **Average RMSE**: 0.0149
+- **Quantization time**: ~110 seconds (RTX 3090 Ti)
+- **Method**: Trellis with Hadamard preprocessing, Viterbi nearest-neighbor, group-wise scales (g=128)
+## Files
+```
+GLM-4.7-Flash-Trellis-MM/
+├── model-00001-of-00007.safetensors   # ~2 GB each
+├── model-00002-of-00007.safetensors
+├── model-00003-of-00007.safetensors
+├── model-00004-of-00007.safetensors
+├── model-00005-of-00007.safetensors
+├── model-00006-of-00007.safetensors
+├── model-00007-of-00007.safetensors
+├── model.safetensors.index.json       # Weight map
+├── base_weights.safetensors           # Embeddings, norms (FP16)
+├── config.json                        # Model config
+├── tokenizer.json                     # Tokenizer
+├── tokenizer_config.json
+└── quantization_index.json            # Quantization metadata
+```
+## Usage
+### With Metal Marlin (Apple Silicon)
+```python
+from metal_marlin.trellis import TrellisForCausalLM
+from transformers import AutoTokenizer
+model = TrellisForCausalLM.from_pretrained(
+    "RESMP-DEV/GLM-4.7-Flash-Trellis-3.8bpw",
+    device="mps"
+)
+tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")
+prompt = "<|user|>\nExplain quantum computing in simple terms.\n<|assistant|>\n"
+input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps")
+output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+### Tensor Format
+Each quantized tensor has 4 components:
+- `{name}__indices`: Packed uint8 Trellis indices
+- `{name}__scales`: FP16 per-group scales (group_size=128)
+- `{name}__su`: FP16 row scaling factors
+- `{name}__sv`: FP16 column scaling factors
+## Hardware Requirements
+| Device | VRAM | Notes |
+|--------|------|-------|
+| Apple M2 Ultra | 64 GB+ | Via Metal Marlin |
+| Apple M4 Max | 36 GB+ | Via Metal Marlin |
+## Benchmarks
+### Original Model Performance (from Z.AI)
+| Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B | GPT-OSS-20B |
+|-----------|---------------|---------------|-------------|
+| AIME 2025 | **91.6** | 85.0 | 91.7 |
+| GPQA | **75.2** | 73.4 | 71.5 |
+| SWE-bench Verified | **59.2** | 22.0 | 34.0 |
+| τ²-Bench | **79.5** | 49.0 | 47.7 |
+| BrowseComp | **42.8** | 2.29 | 28.3 |
+### Quantized Model (Metal Marlin, M4 Max)
+| Metric | Value |
+|--------|-------|
+| Decode | 5.4 tok/s |
+| Prefill (2K) | 42 tok/s |
+| Memory | 16.9 GB |
+## Limitations
+- **Not compatible with standard transformers** — requires Trellis-aware inference code
+- **No speculative decoding** yet
+- **Quality loss**: ~1-2% on benchmarks vs FP16 (typical for 3-4 bit quantization)
+## Credits
+- **Original model**: [Z.AI / GLM Team](https://huggingface.co/zai-org/GLM-4.7-Flash)
+- **Quantization method**: [Trellis/EXL3](https://github.com/turboderp/exllamav3)
+- **Quantization toolkit**: [Metal Marlin](https://github.com/RESMP-DEV/metal-marlin)
+## Citation
+If you use this model, please cite the original GLM-4.5 paper:
+```bibtex
+@misc{glm2025glm45,
+      title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models},
+      author={GLM Team and Aohan Zeng and Xin Lv and others},
+      year={2025},
+      eprint={2508.06471},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2508.06471},
+}
+```
+## License
+This quantized model inherits the **MIT License** from the original GLM-4.7-Flash model.

base_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e2eba0bf7f0c7390f02258831b37e405f77acdf3c7d1dc2257566c94275329aa
+size 1281372424

config.json ADDED Viewed

	@@ -0,0 +1,45 @@

+{
+  "architectures": [
+    "Glm4MoeLiteForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "pad_token_id": 154820,
+  "eos_token_id": [
+    154820,
+    154827,
+    154829
+  ],
+  "hidden_act": "silu",
+  "hidden_size": 2048,
+  "intermediate_size": 10240,
+  "max_position_embeddings": 202752,
+  "model_type": "glm4_moe_lite",
+  "moe_intermediate_size": 1536,
+  "topk_method": "noaux_tc",
+  "norm_topk_prob": true,
+  "num_attention_heads": 20,
+  "n_group": 1,
+  "topk_group": 1,
+  "n_routed_experts": 64,
+  "n_shared_experts": 1,
+  "routed_scaling_factor": 1.8,
+  "num_experts_per_tok": 4,
+  "first_k_dense_replace": 1,
+  "num_hidden_layers": 47,
+  "num_key_value_heads": 20,
+  "num_nextn_predict_layers": 1,
+  "partial_rotary_factor": 1.0,
+  "rms_norm_eps": 1e-05,
+  "rope_scaling": null,
+  "rope_theta": 1000000,
+  "tie_word_embeddings": false,
+  "dtype": "bfloat16",
+  "transformers_version": "5.0.0rc0",
+  "q_lora_rank": 768,
+  "kv_lora_rank": 512,
+  "qk_nope_head_dim": 192,
+  "qk_rope_head_dim": 64,
+  "v_head_dim": 256,
+  "vocab_size": 154880
+}

model-00001-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:00378dc267fde0efcf44d6dce0a62d781ab6c68959a7c5b548de7656756e6538
+size 2147572838

model-00002-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4644617fd9d47acdd4bdba6624c2c3ce13fce4f1bd334f2e93da0421643715a4
+size 2146066344

model-00003-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8d9eb56bda20452832e2a5361a68459f2b77beb9368d5b523b0a7e5ba280258a
+size 2147438273

model-00004-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:444bc6949759b73e1bcd8e66e874123f79e58dd3121291c6d9fd0044a6eb8fea
+size 2147456636

model-00005-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b3b883bb05d63495cf5883b1c9ef75a508aecbd0bb0720ec4e2791df642b3b69
+size 2147436778

model-00006-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9ca14e004d3d8f13ab7f5064d5a279f83fd8f4db9d7b070bebba63cb1cdbbb2e
+size 2146626763

model-00007-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:888f9d7dce7576fcf3ca29756273eba96541f30338d45dfcee88449481a88266
+size 2038631358

model.safetensors.index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

quantization_index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:19e773648cb4e65de8660ea6365e10acca112d42a854923df93db4a6f333a82d
+size 20217442

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,321 @@

+{
+  "added_tokens_decoder": {
+    "154820": {
+      "content": "<|endoftext|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "154821": {
+      "content": "[MASK]",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "154822": {
+      "content": "[gMASK]",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "154823": {
+      "content": "[sMASK]",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "154824": {
+      "content": "<sop>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "154825": {
+      "content": "<eop>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "154826": {
+      "content": "<|system|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "154827": {
+      "content": "<|user|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "154828": {
+      "content": "<|assistant|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "154829": {
+      "content": "<|observation|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "154830": {
+      "content": "<|begin_of_image|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "154831": {
+      "content": "<|end_of_image|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "154832": {
+      "content": "<|begin_of_video|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "154833": {
+      "content": "<|end_of_video|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "154834": {
+      "content": "<|begin_of_audio|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "154835": {
+      "content": "<|end_of_audio|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "154836": {
+      "content": "<|begin_of_transcription|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "154837": {
+      "content": "<|end_of_transcription|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    "154838": {
+      "content": "<|code_prefix|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": false
+    },
+    "154839": {
+      "content": "<|code_middle|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": false
+    },
+    "154840": {
+      "content": "<|code_suffix|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": false
+    },
+    "154841": {
+      "content": "<think>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": false
+    },
+    "154842": {
+      "content": "</think>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": false
+    },
+    "154843": {
+      "content": "<tool_call>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": false
+    },
+    "154844": {
+      "content": "</tool_call>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": false
+    },
+    "154845": {
+      "content": "<tool_response>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": false
+    },
+    "154846": {
+      "content": "</tool_response>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": false
+    },
+    "154847": {
+      "content": "<arg_key>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": false
+    },
+    "154848": {
+      "content": "</arg_key>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": false
+    },
+    "154849": {
+      "content": "<arg_value>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": false
+    },
+    "154850": {
+      "content": "</arg_value>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": false
+    },
+    "154851": {
+      "content": "/nothink",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": false
+    },
+    "154852": {
+      "content": "<|begin_of_box|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": false
+    },
+    "154853": {
+      "content": "<|end_of_box|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": false
+    },
+    "154854": {
+      "content": "<|image|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": false
+    },
+    "154855": {
+      "content": "<|video|>",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "[MASK]",
+    "[gMASK]",
+    "[sMASK]",
+    "<sop>",
+    "<eop>",
+    "<|system|>",
+    "<|user|>",
+    "<|assistant|>",
+    "<|observation|>",
+    "<|begin_of_image|>",
+    "<|end_of_image|>",
+    "<|begin_of_video|>",
+    "<|end_of_video|>",
+    "<|begin_of_audio|>",
+    "<|end_of_audio|>",
+    "<|begin_of_transcription|>",
+    "<|end_of_transcription|>"
+  ],
+  "clean_up_tokenization_spaces": false,
+  "do_lower_case": false,
+  "eos_token": "<|endoftext|>",
+  "extra_special_tokens": {},
+  "model_max_length": 128000,
+  "pad_token": "<|endoftext|>",
+  "padding_side": "left",
+  "remove_space": false,
+  "tokenizer_class": "PreTrainedTokenizer"
+}