test / README.md

Add files using upload-large-folder tool

cdfb602 verified about 1 month ago

3.63 kB

	---
	license: mit
	base_model: zai-org/GLM-5.1
	tags:
	- nvidia
	- nvfp4
	- quantized
	- moe
	- modelopt
	- glm
	library_name: transformers
	pipeline_tag: text-generation
	---

	# CortexLM/GLM-5.1-NVFP4-MTP

	NVFP4 quantized version of [zai-org/GLM-5.1](https://huggingface.co/zai-org/GLM-5.1), a 754B parameter Mixture-of-Experts language model with 256 routed experts per layer.

	Quantized using [NVIDIA Model Optimizer (modelopt)](https://github.com/NVIDIA/Model-Optimizer) with full activation calibration on all 58,459 linear modules including every individual routed expert.

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Base model \| [zai-org/GLM-5.1](https://huggingface.co/zai-org/GLM-5.1) \|
	\| Architecture \| GlmMoeDsaForCausalLM (754B MoE) \|
	\| Layers \| 78 transformer layers + 1 MTP layer \|
	\| Experts \| 256 routed + 1 shared per MoE layer (layers 3-77) \|
	\| Hidden size \| 6144 \|
	\| Context length \| 202,752 tokens \|
	\| Quantization \| NVFP4 (4-bit float weights, FP8 block scales, group size 16) \|
	\| KV cache \| FP8 quantized \|
	\| MTP layer \| BF16 (stored separately in `mtp.safetensors`) \|
	\| Total size \| ~441 GB (vs 1.4 TB BF16 original) \|

	## Quantization Details

	This model was quantized using NVIDIA's official [Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) (`modelopt`) NVFP4 pipeline with proper per-expert calibration:

	- Quantization format: NVFP4 -- 4-bit floating point with FP8 per-block scaling factors (`float8_e4m3fn`) and a global FP32 `weight_scale_2`, block size of 16
	- Calibration: 256 samples from [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) and [nvidia/Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) (chat, code, math, stem splits), sequence length 2048
	- Quantized modules: 58,459 `nn.Linear` modules, including all 256 routed experts per layer individually quantized with calibrated `input_scale` (activation statistics)
	- KV cache: FP8 cast quantization on all attention layers
	- Excluded: `lm_head` (kept in BF16)
	- MTP: Multi-Token Prediction layer (layer 78) kept in BF16 as a separate `mtp.safetensors` file (19.9 GB)
	- Hardware: 8x NVIDIA B300 SXM6 275GB GPUs
	- Calibration time: ~21 minutes
	- modelopt version: 0.42.0.dev (from source, April 2026)
	- transformers version: 5.5.0

	### Weight format

	Each quantized linear layer is stored as:
	- `weight`: `uint8` (two FP4 values packed per byte)
	- `weight_scale`: `float8_e4m3fn` (per-block FP8 scale, one per 16 elements)
	- `weight_scale_2`: `float32` scalar (global per-tensor scale)
	- `input_scale`: `float32` scalar (calibrated activation scale, where applicable)

	## Usage

	This checkpoint is designed for use with inference engines that support the NVFP4 format, such as [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and [vLLM](https://github.com/vllm-project/vllm) with modelopt backend.

	## Files

	- 85 model shards (`model-00001-of-00085.safetensors` to `model-00085-of-00085.safetensors`) -- NVFP4 quantized layers 0-77
	- `mtp.safetensors` -- BF16 Multi-Token Prediction layer (layer 78, 791 keys, 19.9 GB)
	- `model.safetensors.index.json` -- shard index mapping
	- `config.json` -- model configuration with `quantization_config`
	- `hf_quant_config.json` -- NVFP4 quantization metadata
	- `tokenizer.json`, `tokenizer_config.json` -- tokenizer files
	- `generation_config.json` -- generation defaults

	## Acknowledgements

	- Base model by [ZhipuAI](https://huggingface.co/zai-org)
	- Quantization tooling by [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer)