Instructions to use Zigeng/DMax-16B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Zigeng/DMax-16B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Zigeng/DMax-16B", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Zigeng/DMax-16B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Zigeng/DMax-16B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Zigeng/DMax-16B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Zigeng/DMax-16B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Zigeng/DMax-16B
- SGLang
How to use Zigeng/DMax-16B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Zigeng/DMax-16B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Zigeng/DMax-16B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Zigeng/DMax-16B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Zigeng/DMax-16B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Zigeng/DMax-16B with Docker Model Runner:
docker model run hf.co/Zigeng/DMax-16B
| base_model: | |
| - inclusionAI/LLaDA2.0-mini | |
| license: apache-2.0 | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| <div align="center"> | |
| <h1>π DMax: Aggressive Parallel Decoding for dLLMs</h1> | |
| <div align="center"> | |
| <a href="https://github.com/czg1225/DMax/blob/main/LICENSE"> | |
| <img alt="Apache" src="https://img.shields.io/badge/License-Apache-4E94CE.svg"> | |
| </a> | |
| <a href="https://arxiv.org/abs/2604.08302"> | |
| <img src="https://img.shields.io/badge/Paper-Arxiv-darkred.svg" alt="Paper"> | |
| </a> | |
| <a href="https://github.com/czg1225/DMax"> | |
| <img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github&" alt="GitHub"> | |
| </a> | |
| </div> | |
| </div> | |
| DMax is a new paradigm for efficient diffusion language models (dLLMs) that enables aggressive decoding parallelism while preserving generation quality. This repository hosts **DMax-16B**, a highly parallel general-purpose diffusion language model (dLLM) capable of handling code generation, mathematical reasoning, and daily conversation. | |
| ## πͺ Highlights | |
| - **Aggressive Decoding Parallelism**: Achieves 6.0 TPF on math and reasoning tasks and 6.6 TPF on code tasks while preserving accuracy. | |
| - **Self-Revising dLLM**: Extends a pretrained MDLM into a UDLM with an intrinsic ability to revise its own erroneous predictions during decoding. | |
| - **Soft Parallel Decoding**: Uses interpolation between mask and token embeddings to propagate confidence priors from previous steps. | |
| <div align="center"> | |
| <img src="assets/tradeoff.png" width="100%" /> | |
| <br> | |
| <em>Superior Parallelism-Accuracy Trade-off, Increased TPF with Maintained Accuracy.</em> | |
| </div> | |
| ## π» Model and Datasets | |
| | Model | Description | Source Model | Link | | |
| | --- | --- | --- | --- | | |
| | π€ DMax-16B | Highly parallel general-purpose dLLM. | LLaDA-2.0-mini | [HF](https://huggingface.co/Zigeng/DMax-16B) | | |
| | π€ DMax-Math-16B | Highly parallel dLLM for math and reasoning. | LLaDA-2.0-mini | [HF](https://huggingface.co/Zigeng/DMax-Math-16B) | | |
| | π€ DMax-Coder-16B | Highly parallel dLLM for code generation. | LLaDA-2.0-mini | [HF](https://huggingface.co/Zigeng/DMax-Coder-16B) | | |
| | Dataset | Description | Link | | |
| | --- | --- | --- | | |
| | π DMax-Math-Training-Data | math trajectories generated by LLaDA-2.0-mini | [HF](https://huggingface.co/datasets/Zigeng/DMax-LLaDA-2.0-Mini-Math-Trajectories) | | |
| | π DMax-Code-Training-Data | code trajectories generated by LLaDA-2.0-mini | [HF](https://huggingface.co/datasets/Zigeng/DMax-LLaDA-2.0-Mini-Code-Trajectories) | | |
| ## π Quick Start | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM | |
| from transformers import AutoTokenizer | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "Zigeng/DMax-16B", trust_remote_code=True, device_map="cuda:0" | |
| ) | |
| model = model.to(torch.bfloat16) | |
| model.eval() | |
| tokenizer = AutoTokenizer.from_pretrained("Zigeng/DMax-16B", trust_remote_code=True) | |
| prompt = "A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?" + " | |
| Let's think step by step | |
| " | |
| input_ids = tokenizer.apply_chat_template( | |
| [{"role": "user", "content": prompt}], | |
| add_generation_prompt=True, | |
| tokenize=True, | |
| return_tensors="pt", | |
| ) | |
| nfe, generated_tokens = model.generate_spd( | |
| inputs=input_ids, | |
| gen_length=2048, | |
| block_length=32, | |
| threshold=0.5, | |
| ) | |
| generated_answer = tokenizer.decode( | |
| generated_tokens[0], | |
| skip_special_tokens=True, | |
| ) | |
| print(generated_answer) | |
| print("nfe:",nfe,"token length",len(generated_tokens[0])) | |
| ``` | |
| ## π Experimental Results | |
|  | |
| ## π Citation | |
| ```bibtex | |
| @article{chen2026dmax, | |
| title={DMax: Aggressive Parallel Decoding for dLLMs}, | |
| author={Chen, Zigeng and Fang, Gongfan and Ma, Xinyin and Yu, Ruonan and Wang, Xinchao}, | |
| journal={arXiv preprint arXiv:2604.08302}, | |
| year={2026} | |
| } | |