Text Generation
Transformers
Safetensors
glm_moe_dsa
nvidia
nvfp4
quantized
Mixture of Experts
modelopt
glm
8-bit precision
Instructions to use CortexLM/test with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use CortexLM/test with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="CortexLM/test")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("CortexLM/test") model = AutoModelForCausalLM.from_pretrained("CortexLM/test") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use CortexLM/test with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "CortexLM/test" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CortexLM/test", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/CortexLM/test
- SGLang
How to use CortexLM/test with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "CortexLM/test" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CortexLM/test", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "CortexLM/test" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CortexLM/test", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use CortexLM/test with Docker Model Runner:
docker model run hf.co/CortexLM/test
| license: mit | |
| base_model: zai-org/GLM-5.1 | |
| tags: | |
| - nvidia | |
| - nvfp4 | |
| - quantized | |
| - moe | |
| - modelopt | |
| - glm | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| # CortexLM/GLM-5.1-NVFP4-MTP | |
| NVFP4 quantized version of [zai-org/GLM-5.1](https://huggingface.co/zai-org/GLM-5.1), a 754B parameter Mixture-of-Experts language model with 256 routed experts per layer. | |
| Quantized using [NVIDIA Model Optimizer (modelopt)](https://github.com/NVIDIA/Model-Optimizer) with full activation calibration on all 58,459 linear modules including every individual routed expert. | |
| ## Model Details | |
| | | | | |
| |---|---| | |
| | **Base model** | [zai-org/GLM-5.1](https://huggingface.co/zai-org/GLM-5.1) | | |
| | **Architecture** | GlmMoeDsaForCausalLM (754B MoE) | | |
| | **Layers** | 78 transformer layers + 1 MTP layer | | |
| | **Experts** | 256 routed + 1 shared per MoE layer (layers 3-77) | | |
| | **Hidden size** | 6144 | | |
| | **Context length** | 202,752 tokens | | |
| | **Quantization** | NVFP4 (4-bit float weights, FP8 block scales, group size 16) | | |
| | **KV cache** | FP8 quantized | | |
| | **MTP layer** | BF16 (stored separately in `mtp.safetensors`) | | |
| | **Total size** | ~441 GB (vs 1.4 TB BF16 original) | | |
| ## Quantization Details | |
| This model was quantized using NVIDIA's official [Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) (`modelopt`) NVFP4 pipeline with proper per-expert calibration: | |
| - **Quantization format**: NVFP4 -- 4-bit floating point with FP8 per-block scaling factors (`float8_e4m3fn`) and a global FP32 `weight_scale_2`, block size of 16 | |
| - **Calibration**: 256 samples from [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) and [nvidia/Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) (chat, code, math, stem splits), sequence length 2048 | |
| - **Quantized modules**: 58,459 `nn.Linear` modules, including all 256 routed experts per layer individually quantized with calibrated `input_scale` (activation statistics) | |
| - **KV cache**: FP8 cast quantization on all attention layers | |
| - **Excluded**: `lm_head` (kept in BF16) | |
| - **MTP**: Multi-Token Prediction layer (layer 78) kept in BF16 as a separate `mtp.safetensors` file (19.9 GB) | |
| - **Hardware**: 8x NVIDIA B300 SXM6 275GB GPUs | |
| - **Calibration time**: ~21 minutes | |
| - **modelopt version**: 0.42.0.dev (from source, April 2026) | |
| - **transformers version**: 5.5.0 | |
| ### Weight format | |
| Each quantized linear layer is stored as: | |
| - `weight`: `uint8` (two FP4 values packed per byte) | |
| - `weight_scale`: `float8_e4m3fn` (per-block FP8 scale, one per 16 elements) | |
| - `weight_scale_2`: `float32` scalar (global per-tensor scale) | |
| - `input_scale`: `float32` scalar (calibrated activation scale, where applicable) | |
| ## Usage | |
| This checkpoint is designed for use with inference engines that support the NVFP4 format, such as [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and [vLLM](https://github.com/vllm-project/vllm) with modelopt backend. | |
| ## Files | |
| - 85 model shards (`model-00001-of-00085.safetensors` to `model-00085-of-00085.safetensors`) -- NVFP4 quantized layers 0-77 | |
| - `mtp.safetensors` -- BF16 Multi-Token Prediction layer (layer 78, 791 keys, 19.9 GB) | |
| - `model.safetensors.index.json` -- shard index mapping | |
| - `config.json` -- model configuration with `quantization_config` | |
| - `hf_quant_config.json` -- NVFP4 quantization metadata | |
| - `tokenizer.json`, `tokenizer_config.json` -- tokenizer files | |
| - `generation_config.json` -- generation defaults | |
| ## Acknowledgements | |
| - Base model by [ZhipuAI](https://huggingface.co/zai-org) | |
| - Quantization tooling by [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) | |