Instructions to use nvidia/Nemotron-Labs-Diffusion-8B-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/Nemotron-Labs-Diffusion-8B-Base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/Nemotron-Labs-Diffusion-8B-Base", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("nvidia/Nemotron-Labs-Diffusion-8B-Base", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use nvidia/Nemotron-Labs-Diffusion-8B-Base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/Nemotron-Labs-Diffusion-8B-Base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-Diffusion-8B-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/Nemotron-Labs-Diffusion-8B-Base

SGLang

How to use nvidia/Nemotron-Labs-Diffusion-8B-Base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/Nemotron-Labs-Diffusion-8B-Base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-Diffusion-8B-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/Nemotron-Labs-Diffusion-8B-Base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-Diffusion-8B-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/Nemotron-Labs-Diffusion-8B-Base with Docker Model Runner:
```
docker model run hf.co/nvidia/Nemotron-Labs-Diffusion-8B-Base
```

YongganFu

abhgarg

trias702

pmolchanov commited on 1 day ago

Commit

cf02602

0 Parent(s):

Initial release of Nemotron-Labs-Diffusion-8B-Base

Browse files

Co-authored-by: abhgarg <abhgarg@users.noreply.huggingface.co>
Co-authored-by: trias702 <trias702@users.noreply.huggingface.co>
Co-authored-by: trias702 <trias702@users.noreply.huggingface.co>
Co-authored-by: pmolchanov <pmolchanov@users.noreply.huggingface.co>

Files changed (21) hide show

.gitattributes +41 -0
README.md +160 -0
assets/demo.gif +3 -0
assets/demo.mp4 +3 -0
assets/result_acc.png +3 -0
assets/result_efficiency.png +3 -0
assets/teaser.png +3 -0
chat_template.jinja +7 -0
config.json +49 -0
configuration_nemotron_labs_diffusion.py +186 -0
generation_config.json +7 -0
model.safetensors +3 -0
model_cards/bias.md +4 -0
model_cards/explainability.md +13 -0
model_cards/privacy.md +11 -0
model_cards/safety.md +6 -0
modeling_ministral.py +459 -0
modeling_nemotron_labs_diffusion.py +870 -0
special_tokens_map.json +23 -0
tokenizer.json +3 -0
tokenizer_config.json +0 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,41 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text
+assets/demo.gif filter=lfs diff=lfs merge=lfs -text
+assets/demo.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/result_acc.png filter=lfs diff=lfs merge=lfs -text
+assets/result_efficiency.png filter=lfs diff=lfs merge=lfs -text
+assets/teaser.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,160 @@

+---
+library_name: transformers
+license: other
+license_name: nvidia-nemotron-open-model-license
+license_link: >-
+  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/
+pipeline_tag: text-generation
+tags:
+- nvidia
+- pytorch
+---
+# Nemotron-Labs-Diffusion-8B-Base
+<div align="center" style="line-height: 1;">
+<a href="https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_Diffusion_Tech_Report_v1.pdf?VersionId=db8_EMO8B.vmU26.jr7Le9pN3MqcUDNL" target="_blank" style="margin: 2px;">
+    <img alt="Chat" src="https://img.shields.io/badge/📝Paper-Read Now!-536af5?color=76B900&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
+</a>
+<a href="https://huggingface.co/collections/nvidia/nemotron-labs-diffusion" target="_blank" style="margin: 2px;">
+    <img alt="Nemotron-Labs-Diffusion Model Family" src="https://img.shields.io/badge/%F0%9F%A4%97-Nemotron--Labs--Diffusion_Model_Family-76B900" style="display: inline-block; vertical-align: middle;"/>
+</a>
+<a href="https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/" style="margin: 2px;">
+  <img alt="License" src="https://img.shields.io/badge/License-NVIDIA Open Model License-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/>
+</a>
+</div>
+[![Demo](./assets/demo.gif)](./assets/demo.mp4)
+## Model Overview
+Nemotron-Labs-Diffusion is a tri-mode language model that supports both AR decoding and diffusion-based parallel decoding by simply switching the attention pattern of the same model during inference. The synergy between these two modes enables a third mode, called self-speculation: the same model performs diffusion-based parallel drafting and AR verification with shared KV cache, achieving high acceptance lengths and decoding efficiency. The seamless mode switching by simply changing attention patterns enables high efficiency at different concurrency levels in varying deployment scenarios with one single model.
+<div align="center">
+<img src="./assets/teaser.png" alt="An illustration of Tri-Mode LMs" width="500">
+</div>
+## Highlights
+- SOTA 3B, 8B, 14B dense LM family (base, instruct, and vision-language variants) supporting AR, diffusion, and self-speculation with the focus on decode efficiency.
+- Generation moved from a memory-bound regime toward a compute-bound regime. Model weights are loaded once and reused to compute multiple tokens during generation.
+- Self-speculation uses diffusion for drafting and AR for verification, providing a stronger alternative to MTP approaches:
+  * 3x higher acceptance length and 2.2x speed-up vs. Qwen3-8B-Eagle3 in SGLang.
+  * 5.9× tokens per forward over Qwen3-8B (no MTP) with the same accuracy.
+- Real-device speed-up across platforms:
+  * DGX Spark (8B, concurrency 1): 2.7x faster with 112 tok/sec vs. 41.8 tok/sec AR using w4a16.
+  * GB200 (8B, concurrency 1): 3.3x faster with 850 tok/sec vs. 253 tok/sec AR and 360 tok/sec Eagle3. Custom CUDA kernels boost to 1015 tok/sec (4x).
+- Diffusion speedup-of-light analysis shows that throughput can be further doubled (vs. current best) for a single user with better sampling - future research.
+<div align="center">
+<img src="./assets/result_acc.png" alt="Efficiency Results" width="800">
+</div>
+<div align="center">
+<img src="./assets/result_efficiency.png" alt="Acc Results" width="800">
+</div>
+## License/Terms of Use
+Use of this model is governed by the [NVIDIA Nemotron Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/).
+## Environment
+```bash
+transformers>=5.0.0
+```
+## Chat with Our Model
+```
+from transformers import AutoModel, AutoTokenizer
+import torch
+repo_name = "nvidia/Nemotron-Labs-Diffusion-8B-Base"
+tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
+model = AutoModel.from_pretrained(repo_name, trust_remote_code=True)
+model = model.cuda().to(torch.bfloat16)
+history = []
+user_input = input("User: ").strip()
+history.append({"role": "user", "content": user_input})
+prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True)
+prompt_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device='cuda')
+## Chat in AR Mode
+out_ids, nfe = model.ar_generate(inputs.input_ids, max_new_tokens=512)
+## Chat in dLM Mode
+out_ids, nfe = model.generate(prompt_ids, max_new_tokens=512, block_length=32, threshold=0.9, eos_token_id=tokenizer.eos_token_id)
+## Chat in Linear Self-Speculation Mode
+out_ids, nfe = model.linear_spec_generate(prompt_ids, max_new_tokens=512, block_length=32, eos_token_id=tokenizer.eos_token_id)
+tokenized_out = tokenizer.batch_decode(out_ids[:, prompt_ids.shape[1]:], skip_special_tokens=True)[0]
+print(f"Model: {tokenized_out}")
+print(f"[Num Function Eval (NFE)={nfe}]")
+```
+## Inference with Linear Self-Speculation + LoRA-enhanced Drafter
+An optional LoRA adatper can be applied to the diffusion drafter in the linear self-speculation mode to further increase the acceptance length:
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+from peft import PeftModel
+repo = "nvidia/Nemotron-Labs-Diffusion-8B-Base"
+tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
+model = AutoModel.from_pretrained(repo, trust_remote_code=True)
+model = model.cuda().to(torch.bfloat16)
+# Attach the linear_spec LoRA adapter.
+model = PeftModel.from_pretrained(model, repo, subfolder="linear_spec_lora").eval()
+# Unwrap so we can call linear_spec_generate directly (it toggles LoRA internally).
+base = model.model
+history = [{"role": "user", "content": "Solve: What is 15% of 240?"}]
+prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True)
+prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
+out_ids, nfe = base.linear_spec_generate(
+    prompt_ids, max_new_tokens=512, block_length=32,
+    eos_token_id=tokenizer.eos_token_id,
+)
+print(tokenizer.decode(out_ids[0, prompt_ids.shape[1]:], skip_special_tokens=True))
+print(f"[NFE={nfe}]")
+```
+## Ethical Considerations
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the [bias](./model_cards/bias.md), [explainability](./model_cards/explainability.md), [safety & security](./model_cards/safety.md), and [privacy](./model_cards/privacy.md) subcards.
+Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
+## Citations
+```bibtex
+@techreport{fu2026nemotronlabsdiffusion,
+  title       = {Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding},
+  author      = {Yonggan Fu and Lexington Whalen and Abhinav Garg and Chengyue Wu and Maksim Khadkevich and Nicolai Oswald and Enze Xie and Daniel Egert and Sharath Turuvekere Sreenivas and Shizhe Diao and Chenhan Yu and Ye Yu and Weijia Chen and Sajad Norouzi and Shiyi Lan and Ligeng Zhu and Jin Wang and Jindong Jiang and Morteza Mardani and Mehran Maghoumi and Song Han and Ante Jukic and Nima Tajbakhsh and Jan Kautz and Pavlo Molchanov},
+  institution = {NVIDIA},
+  year        = {2026},
+  note        = {Technical report}
+}
+```

assets/demo.gif ADDED Viewed

Git LFS Details

SHA256: 0d09264e272ac0f82dee36417f6a16511287ec1f8dee3b5dba3da222d791fd2c
Pointer size: 132 Bytes
Size of remote file: 8.25 MB

assets/demo.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:666d8785ac4af75931d9c677757c4ef9945bf114d07f1c4e2ebb7b893ac39006
+size 9454873

assets/result_acc.png ADDED Viewed

Git LFS Details

SHA256: 992aa22ca9eca3d0bddbcd9f49837e2a9f377bbc0f7545563b129a50b3811448
Pointer size: 131 Bytes
Size of remote file: 405 kB

assets/result_efficiency.png ADDED Viewed

Git LFS Details

SHA256: 4f6161912e2aa703e0ef1bdccbb85039529b97e759d6247c33afa2a209806ede
Pointer size: 131 Bytes
Size of remote file: 801 kB

assets/teaser.png ADDED Viewed

Git LFS Details

SHA256: 6c94aa7b0c6cf8fb739724d0c1ce45749c76443c592eeab94d7cbb9083c6c6b1
Pointer size: 131 Bytes
Size of remote file: 581 kB

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,7 @@

+{{'<SPECIAL_10>System'}}{% for message in messages %}{% if message['role'] == 'system' %}{{'
+' + message['content'].strip()}}{% endif %}{% endfor %}{{'
+'}}{% for message in messages %}{% if message['role'] == 'user' %}{{ '
+<SPECIAL_11>User
+' + message['content'].strip() + '
+<SPECIAL_11>Assistant
+' }}{% elif message['role'] == 'assistant' %}{{ message['content'].strip() }}{% endif %}{% endfor %}

config.json ADDED Viewed

	@@ -0,0 +1,49 @@

+{
+  "ar_loss_weight": 1.0,
+  "architectures": [
+    "NemotronLabsDiffusionModel"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "attn_implementation": "sdpa",
+  "auto_map": {
+    "AutoConfig": "configuration_nemotron_labs_diffusion.NemotronLabsDiffusionConfig",
+    "AutoModel": "modeling_nemotron_labs_diffusion.NemotronLabsDiffusionModel"
+  },
+  "block_size": 32,
+  "bos_token_id": 1,
+  "dlm_loss_weight": null,
+  "dlm_paradigm": "bidirectional",
+  "dp_varying_mask_ratio": false,
+  "eos_token_id": 2,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 4096,
+  "initializer_range": 0.02,
+  "intermediate_size": 14336,
+  "mask_token_id": 100,
+  "max_position_embeddings": 4096,
+  "mlp_bias": false,
+  "model_type": "nemotron_labs_diffusion",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 34,
+  "num_key_value_heads": 8,
+  "rms_norm_eps": 1e-05,
+  "rope_parameters": {
+    "beta_fast": 32.0,
+    "beta_slow": 1.0,
+    "factor": 0.25,
+    "llama_4_scaling_beta": 0.1,
+    "mscale": 1.0,
+    "mscale_all_dim": 1.0,
+    "original_max_position_embeddings": 16384,
+    "rope_theta": 1000000.0,
+    "rope_type": "yarn"
+  },
+  "sliding_window": null,
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "5.0.0",
+  "use_cache": false,
+  "vocab_size": 131072
+}

configuration_nemotron_labs_diffusion.py ADDED Viewed

	@@ -0,0 +1,186 @@

+# coding=utf-8
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Nemotron-Labs Diffusion model configuration"""
+from transformers.configuration_utils import PretrainedConfig
+from transformers.modeling_rope_utils import rope_config_validation
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+class NemotronLabsDiffusionConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`NemotronLabsDiffusionModel`] for diffusion language models.
+    It is used to instantiate a NemotronLabsDiffusionModel according to the specified arguments, defining the model architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 131072):
+            Vocabulary size of the Ministral model.
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 14336):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 34):
+            Number of hidden layers in the Transformer decoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer.
+        num_key_value_heads (`int`, *optional*, defaults to 8):
+            Number of key_value heads for Grouped Query Attention.
+        head_dim (`int`, *optional*, defaults to 128):
+            The attention head dimension.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function.
+        max_position_embeddings (`int`, *optional*, defaults to 262144):
+            The maximum sequence length.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-05):
+            The epsilon used by the rms normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions.
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether the model's input and output word embeddings should be tied.
+        rope_theta (`float`, *optional*, defaults to 1000000.0):
+            The base period of the RoPE embeddings.
+        rope_parameters (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings.
+            Default uses YaRN scaling with factor=16, original_max_position_embeddings=16384.
+        attention_bias (`bool`, defaults to `False`):
+            Whether to use a bias in the query, key, value and output projection layers.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        mlp_bias (`bool`, *optional*, defaults to `False`):
+            Whether to use a bias in up_proj, down_proj and gate_proj layers.
+        sliding_window (`int`, *optional*, defaults to None):
+            Sliding window attention size.
+        mask_token_id (`int`, *optional*, defaults to -1):
+            Token ID for masking in diffusion.
+        dlm_paradigm (`str`, *optional*, defaults to 'bidirectional'):
+            Paradigm for diffusion ('bidirectional', 'autoregressive', 'block_diff').
+        block_size (`int`, *optional*, defaults to 32):
+            Block size for block diffusion paradigms.
+        dlm_loss_weight (`float`, *optional*):
+            Weight for diffusion LM loss.
+        ar_loss_weight (`float`, *optional*, defaults to 1.0):
+            Weight for autoregressive loss in block_diff paradigm. Use 10000 to only use AR loss.
+        dp_varying_mask_ratio (`bool`, *optional*, defaults to False):
+            Whether to use varying mask ratio for each DP rank during sampling.
+    """
+    model_type = "nemotron_labs_diffusion"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    # Default tensor parallel plan for base model `Ministral`
+    base_model_tp_plan = {
+        "layers.*.self_attn.q_proj": "colwise",
+        "layers.*.self_attn.k_proj": "colwise",
+        "layers.*.self_attn.v_proj": "colwise",
+        "layers.*.self_attn.o_proj": "rowwise",
+        "layers.*.mlp.gate_proj": "colwise",
+        "layers.*.mlp.up_proj": "colwise",
+        "layers.*.mlp.down_proj": "rowwise",
+    }
+    base_model_pp_plan = {
+        "embed_tokens": (["input_ids"], ["inputs_embeds"]),
+        "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
+        "norm": (["hidden_states"], ["hidden_states"]),
+    }
+    def __init__(
+        self,
+        vocab_size=131072,
+        hidden_size=4096,
+        intermediate_size=14336,
+        num_hidden_layers=34,
+        num_attention_heads=32,
+        num_key_value_heads=8,
+        head_dim=128,
+        hidden_act="silu",
+        max_position_embeddings=262144,
+        initializer_range=0.02,
+        rms_norm_eps=1e-05,
+        use_cache=True,
+        pad_token_id=None,
+        bos_token_id=1,
+        eos_token_id=2,
+        tie_word_embeddings=False,
+        rope_theta=1000000.0,
+        rope_parameters=None,
+        attention_bias=False,
+        attention_dropout=0.0,
+        mlp_bias=False,
+        sliding_window=None,
+        attn_implementation="sdpa",
+        mask_token_id=-1,
+        dlm_paradigm='bidirectional',
+        block_size=32,
+        dlm_loss_weight=None,
+        ar_loss_weight=1.0,
+        dp_varying_mask_ratio=False,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.head_dim = head_dim
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.use_cache = use_cache
+        self.rope_parameters = rope_parameters
+        # `rope_theta` is read at the top level by transformers v4.55's yarn impl; mirror from rope_parameters when present.
+        self.rope_theta = (rope_parameters or {}).get("rope_theta", rope_theta)
+        # v4.55 reads rope params from `rope_scaling`; in v5.0 `rope_scaling` is a property alias for rope_parameters.
+        self.rope_scaling = rope_parameters
+        self.attention_bias = attention_bias
+        self.attention_dropout = attention_dropout
+        self.mlp_bias = mlp_bias
+        self.sliding_window = sliding_window
+        rope_config_validation(self)
+        self.attn_implementation = attn_implementation
+        self.mask_token_id = mask_token_id
+        self.dlm_paradigm = dlm_paradigm
+        self.block_size = block_size
+        self.dlm_loss_weight = dlm_loss_weight
+        self.ar_loss_weight = ar_loss_weight
+        self.dp_varying_mask_ratio = dp_varying_mask_ratio
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
+__all__ = ["NemotronLabsDiffusionConfig"]

generation_config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "transformers_version": "5.0.0",
+  "use_cache": false
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:73af2cd1c982f85bac01c7da43765deb3f2deced76eb93dbd2a6a968ff531349
+size 16979144720

model_cards/bias.md ADDED Viewed

	@@ -0,0 +1,4 @@

+Field                                                                                               |  Response
+:---------------------------------------------------------------------------------------------------|:---------------
+Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing:  |  [None]
+Measures taken to mitigate against unwanted bias:                                                   |  [None]

model_cards/explainability.md ADDED Viewed

	@@ -0,0 +1,13 @@

+Field                                                                                                  |  Response
+:------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------
+Intended Task/Domain:                                                                   |  Text generation
+Model Type:                                                                                            |  Transformer
+Intended Users:                                                                                        |  Generative AI creators working with conversational AI models.
+Output:                                                                                                |  Text (Responds to posed question, Stateful - remembers previous answers)
+Describe how the model works:                                                                          |  Text input is encoded into tokens and passed into a transformer-based language model, which returns a text response.
+Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of:  |  Not Applicable
+Technical Limitations & Mitigation:                                                                    |  The model cannot perform long-horizon reasoning and tool calling.
+Verified to have met prescribed NVIDIA quality standards:  |  Yes
+Performance Metrics:                                                                                   |  Accuracy, Latency, Throughput
+Potential Known Risks:                                                                                 |  In some instances, the model may think too long and struggle to derive final answers. The model's output can generate all forms of text, including what may be considered toxic, offensive, or indecent.
+Licensing:                                                                                             |  nvidia-open-model-license.

model_cards/privacy.md ADDED Viewed

	@@ -0,0 +1,11 @@

+Field                                                                                                                              |  Response
+:----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------
+Generatable or reverse engineerable personal data?                                                     |  [No]
+Personal data used to create this model?                                                                                       |  [No]
+Was consent obtained for any personal data used?                                                                                             |  [Not Applicable]
+How often is dataset reviewed?                                                                                                     |  [During dataset creation, model training, evaluation, and the prerelease phase.]
+Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model? |  [Yes]
+Is there provenance for all datasets used in training?                                                                                |  Yes
+Does data labeling (annotation, metadata) comply with privacy laws?                                                                |  Yes
+Is data compliant with data subject requests for data correction or removal, if such a request was made?                           | Not Applicable.
+Applicable Privacy Policy        | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/

model_cards/safety.md ADDED Viewed

	@@ -0,0 +1,6 @@

+Field                                               |  Response
+:---------------------------------------------------|:----------------------------------
+Model Application Field(s):                               |  [Media & Entertainment].
+Describe the life critical impact (if present).   |  Not Applicable
+Model and dataset restrictions:            |  The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development.  Restrictions enforce dataset access during training, and dataset license constraints adhered to.
+Use Case Restrictions: | Abide by nvidia-open-model-license.

modeling_ministral.py ADDED Viewed

	@@ -0,0 +1,459 @@

+from collections.abc import Callable
+from typing import Optional, Union
+import torch
+from torch import nn
+from transformers.utils.generic import check_model_inputs
+from transformers.activations import ACT2FN
+from transformers.cache_utils import Cache, DynamicCache
+from transformers.generation import GenerationMixin
+# from transformers.integrations import use_kernel_forward_from_hub, use_kernel_func_from_hub, use_kernelized_func
+from transformers.integrations import use_kernel_forward_from_hub
+from transformers.masking_utils import create_causal_mask, create_sliding_window_causal_mask, ALL_MASK_ATTENTION_FUNCTIONS
+from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
+from transformers.modeling_layers import (
+    GenericForQuestionAnswering,
+    GenericForSequenceClassification,
+    GenericForTokenClassification,
+    GradientCheckpointingLayer,
+)
+from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
+from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update
+from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
+from transformers.processing_utils import Unpack
+from transformers.utils import TransformersKwargs, auto_docstring, can_return_tuple
+# from transformers.utils.generic import maybe_autocast
+from .configuration_nemotron_labs_diffusion import NemotronLabsDiffusionConfig
+#ALL_MASK_ATTENTION_FUNCTIONS._global_mapping['sdpa'] = sdpa_mask_older_torch
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+# @use_kernel_func_from_hub("rotary_pos_emb")
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
+    """Applies Rotary Position Embedding to the query and key tensors.
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        position_ids (`torch.Tensor`, *optional*):
+            Deprecated and unused.
+        unsqueeze_dim (`int`, *optional*, defaults to 1):
+            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    cos = cos.unsqueeze(unsqueeze_dim)
+    sin = sin.unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+    """
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+def eager_attention_forward(
+    module: nn.Module,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attention_mask: Optional[torch.Tensor],
+    scaling: float,
+    dropout: float = 0.0,
+    **kwargs: Unpack[TransformersKwargs],
+):
+    key_states = repeat_kv(key, module.num_key_value_groups)
+    value_states = repeat_kv(value, module.num_key_value_groups)
+    attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
+    if attention_mask is not None:
+        causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
+        attn_weights = attn_weights + causal_mask
+    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
+    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
+    attn_output = torch.matmul(attn_weights, value_states)
+    attn_output = attn_output.transpose(1, 2).contiguous()
+    return attn_output, attn_weights
+def _get_llama_4_attn_scale(positions_ids: torch.Tensor, beta: float, max_position_embeddings: int) -> torch.Tensor:
+    scaling = 1 + beta * torch.log(1 + torch.floor(positions_ids / max_position_embeddings))
+    return scaling.unsqueeze(-1)
+# @use_kernelized_func(apply_rotary_pos_emb)
+class Ministral3Attention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+    def __init__(self, config: NemotronLabsDiffusionConfig, layer_idx: int):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        self.head_dim = getattr(config, "head_dim", None) or config.hidden_size // config.num_attention_heads
+        self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
+        self.scaling = self.head_dim**-0.5
+        self.attention_dropout = config.attention_dropout
+        self.is_causal = True
+        self.q_proj = nn.Linear(config.hidden_size, config.num_attention_heads * self.head_dim, bias=False)
+        self.k_proj = nn.Linear(config.hidden_size, config.num_key_value_heads * self.head_dim, bias=False)
+        self.v_proj = nn.Linear(config.hidden_size, config.num_key_value_heads * self.head_dim, bias=False)
+        self.o_proj = nn.Linear(config.num_attention_heads * self.head_dim, config.hidden_size, bias=False)
+        self.diffusion_lm = config.diffusion_lm
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: tuple[torch.Tensor, torch.Tensor],
+        attention_mask: Optional[torch.Tensor],
+        past_key_values: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = False,
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> tuple[torch.Tensor, Optional[torch.Tensor]]:
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+        query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        cos, sin = position_embeddings
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        query_states = query_states * _get_llama_4_attn_scale(
+            cache_position,
+            self.config.rope_parameters.get("llama_4_scaling_beta"),
+            self.config.rope_parameters.get("original_max_position_embeddings"),
+        ).to(query_states.dtype)
+        if past_key_values is not None:
+            if use_cache:
+                # sin and cos are specific to RoPE models; cache_position needed for the static cache
+                cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+                key_states, value_states = past_key_values.update(key_states, value_states, self.layer_idx, cache_kwargs)
+            else:  ## if use_cache == False, do not update cache
+                old_k, old_v = past_key_values.layers[self.layer_idx].keys, past_key_values.layers[self.layer_idx].values
+                key_states   = torch.cat([old_k, key_states], dim=-2)
+                value_states = torch.cat([old_v, value_states], dim=-2)
+        attention_interface: Callable = eager_attention_forward
+        if self.config._attn_implementation != "eager":
+            attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+        if self.diffusion_lm:
+            attn_output, attn_weights = attention_interface(
+                self,
+                query_states,
+                key_states,
+                value_states,
+                None,
+                dropout=0.0 if not self.training else self.attention_dropout,
+                scaling=self.scaling,
+                is_causal=False,
+                **kwargs,
+            )
+        else:
+            attn_output, attn_weights = attention_interface(
+                self,
+                query_states,
+                key_states,
+                value_states,
+                attention_mask,
+                dropout=0.0 if not self.training else self.attention_dropout,
+                scaling=self.scaling,
+                sliding_window=getattr(self.config, "sliding_window", None),  # main diff with Llama
+                **kwargs,
+            )
+        attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+        return attn_output, attn_weights
+class Ministral3MLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = ACT2FN[config.hidden_act]
+    def forward(self, x):
+        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+        return down_proj
+@use_kernel_forward_from_hub("RMSNorm")
+class Ministral3RMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        Ministral3RMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+    def extra_repr(self):
+        return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
+class Ministral3DecoderLayer(GradientCheckpointingLayer):
+    def __init__(self, config: NemotronLabsDiffusionConfig, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        if hasattr(config, 'attn_class'):
+            attn_class = config.attn_class
+        else:
+            attn_class = Ministral3Attention
+        self.self_attn = attn_class(config=config, layer_idx=layer_idx)
+        self.mlp = Ministral3MLP(config)
+        self.input_layernorm = Ministral3RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = Ministral3RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        use_cache: Optional[bool] = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> torch.Tensor:
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        # Self Attention
+        hidden_states, _ = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            position_embeddings=position_embeddings,
+            **kwargs,
+        )
+        hidden_states = residual + hidden_states
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        return hidden_states
+@auto_docstring
+class Ministral3PreTrainedModel(PreTrainedModel):
+    config: NemotronLabsDiffusionConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["Ministral3DecoderLayer"]
+    _skip_keys_device_placement = ["past_key_values"]
+    _supports_flash_attn = True
+    _supports_sdpa = True
+    _supports_flex_attn = True
+    _can_compile_fullgraph = True
+    _supports_attention_backend = True
+    _can_record_outputs = {
+        "hidden_states": Ministral3DecoderLayer,
+        "attentions": Ministral3Attention,
+    }
+class Ministral3RotaryEmbedding(nn.Module):
+    inv_freq: torch.Tensor  # fix linting for `register_buffer`
+    def __init__(self, config: NemotronLabsDiffusionConfig, device=None):
+        super().__init__()
+        self.max_seq_len_cached = config.max_position_embeddings
+        self.original_max_seq_len = config.max_position_embeddings
+        self.config = config
+        self.rope_type = self.config.rope_parameters["rope_type"]
+        rope_init_fn: Callable = self.compute_default_rope_parameters
+        if self.rope_type != "default":
+            rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
+        inv_freq, self.attention_scaling = rope_init_fn(self.config, device)
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self.original_inv_freq = inv_freq
+    @staticmethod
+    def compute_default_rope_parameters(
+        config: Optional[NemotronLabsDiffusionConfig] = None,
+        device: Optional["torch.device"] = None,
+        seq_len: Optional[int] = None,
+    ) -> tuple["torch.Tensor", float]:
+        """
+        Computes the inverse frequencies according to the original RoPE implementation
+        Args:
+            config ([`~transformers.PreTrainedConfig`]):
+                The model configuration.
+            device (`torch.device`):
+                The device to use for initialization of the inverse frequencies.
+            seq_len (`int`, *optional*):
+                The current sequence length. Unused for this type of RoPE.
+        Returns:
+            Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
+            post-processing scaling factor applied to the computed cos/sin (unused in this type of RoPE).
+        """
+        base = config.rope_parameters["rope_theta"]
+        dim = getattr(config, "head_dim", None) or config.hidden_size // config.num_attention_heads
+        attention_factor = 1.0  # Unused in this type of RoPE
+        # Compute the inverse frequencies
+        inv_freq = 1.0 / (
+            base ** (torch.arange(0, dim, 2, dtype=torch.int64).to(device=device, dtype=torch.float) / dim)
+        )
+        return inv_freq, attention_factor
+    @torch.no_grad()
+    @dynamic_rope_update  # power user: used with advanced RoPE types (e.g. dynamic rope)
+    def forward(self, x, position_ids):
+        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device)
+        position_ids_expanded = position_ids[:, None, :].float()
+        # device_type = x.device.type if isinstance(x.device.type, str) and x.device.type != "mps" else "cpu"
+        # with maybe_autocast(device_type=device_type, enabled=False):  # Force float32
+        freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
+        emb = torch.cat((freqs, freqs), dim=-1)
+        cos = emb.cos() * self.attention_scaling
+        sin = emb.sin() * self.attention_scaling
+        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
+@auto_docstring
+class Ministral3Model(Ministral3PreTrainedModel):
+    def __init__(self, config: NemotronLabsDiffusionConfig):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList(
+            [Ministral3DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.norm = Ministral3RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.rotary_emb = Ministral3RotaryEmbedding(config=config)
+        self.gradient_checkpointing = False
+        # Initialize weights and apply final processing
+        self.post_init()
+    @check_model_inputs
+    @auto_docstring
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> BaseModelOutputWithPast:
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+        if use_cache and past_key_values is None:
+            # past_key_values = DynamicCache(config=self.config)
+            past_key_values = DynamicCache()
+        if cache_position is None:
+            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+            cache_position = torch.arange(
+                past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
+            )
+        if position_ids is None:
+            position_ids = cache_position.unsqueeze(0)
+        if kwargs.get("use_causal_mask", False):
+            mask_function = create_causal_mask if self.config.sliding_window is None else create_sliding_window_causal_mask
+            causal_mask = mask_function(
+                config=self.config,
+                input_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+                cache_position=cache_position,
+                past_key_values=past_key_values,
+                position_ids=position_ids,
+            )
+        else:
+            causal_mask = None
+        hidden_states = inputs_embeds
+        position_embeddings = self.rotary_emb(hidden_states, position_ids=position_ids)
+        for decoder_layer in self.layers[: self.config.num_hidden_layers]:
+            hidden_states = decoder_layer(
+                hidden_states,
+                attention_mask=causal_mask,
+                position_ids=position_ids,
+                past_key_values=past_key_values,
+                use_cache=use_cache,
+                cache_position=cache_position,
+                position_embeddings=position_embeddings,
+                **kwargs,
+            )
+        hidden_states = self.norm(hidden_states)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values if use_cache else None,
+        )
+__all__ = [
+    "Ministral3Model",
+    "Ministral3PreTrainedModel",
+]

modeling_nemotron_labs_diffusion.py ADDED Viewed

	@@ -0,0 +1,870 @@

+import copy
+from dataclasses import dataclass
+from typing import Optional, Tuple
+import numpy as np
+import torch
+import torch.nn.functional as F
+from torch import nn
+from transformers.modeling_outputs import CausalLMOutputWithPast, BaseModelOutput
+from transformers.utils import ModelOutput
+from torch.nn.attention.flex_attention import flex_attention, create_block_mask
+from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
+from transformers.processing_utils import Unpack
+from transformers.cache_utils import Cache, DynamicCache
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+from transformers.generation import GenerationMixin
+import math
+from .modeling_ministral import Ministral3Model, Ministral3PreTrainedModel, Ministral3Attention, apply_rotary_pos_emb, repeat_kv, _get_llama_4_attn_scale
+from .configuration_nemotron_labs_diffusion import NemotronLabsDiffusionConfig
+__all__ = ["NemotronLabsDiffusionModel", "NemotronLabsDiffusionFlexAttention"]
+@dataclass
+class NemotronLabsDiffusionOutputWithPast(ModelOutput):
+    loss: torch.FloatTensor | None = None
+    logits: torch.FloatTensor | None = None
+    causal_logits: torch.FloatTensor | None = None
+    past_key_values: Cache | None = None
+    hidden_states: tuple[torch.FloatTensor, ...] | None = None
+    attentions: tuple[torch.FloatTensor, ...] | None = None
+@torch.compile(fullgraph=True, mode="max-autotune-no-cudagraphs", dynamic=False)
+def fused_flex_attention(q, k, v, block_mask=None):
+    return flex_attention(q, k, v, block_mask=block_mask)
+class NemotronLabsDiffusionFlexAttention(Ministral3Attention):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.block_size = self.config.block_size
+        self.block_diff_mask = None
+        import torch._dynamo.config as dcfg
+        dcfg.cache_size_limit = 512
+    def compute_block_mask(self, mode, q_len, block_size=None):
+        def block_diff_mask(block_size, b, h, q_idx, kv_idx, n):
+            x0_flag_q = (q_idx >= n)
+            x0_flag_kv = (kv_idx >= n)
+            # Compute block indices
+            block_q = torch.where(x0_flag_q == 1,
+                                    (q_idx - n) // block_size,
+                                    q_idx // block_size)
+            block_kv = torch.where(x0_flag_kv == 1,
+                                    (kv_idx - n) // block_size,
+                                    kv_idx // block_size)
+            # **1. Block Diagonal Mask (M_BD) **
+            block_diagonal = (block_q == block_kv) & (x0_flag_kv == 0) & (x0_flag_q == 0)
+            # **2. Offset Block-Causal Mask (M_OBC) **
+            offset_block_causal = (
+                (block_q > block_kv)
+                & (x0_flag_kv == 1)
+                & (x0_flag_q == 0)
+            )
+            # **3. Fully Causal Mask (M_BC) **
+            fully_causal = (q_idx >= kv_idx) & (x0_flag_kv == 1) & (x0_flag_q == 1)
+            # **4. Combine Masks **
+            return block_diagonal | offset_block_causal | fully_causal
+        attn_mask = lambda b, h, q, kv: block_diff_mask(block_size, b, h, q, kv, q_len//2)
+        block_mask = create_block_mask(
+            attn_mask, B=None, H=None, Q_LEN=q_len, KV_LEN=q_len
+        )
+        return block_mask
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: Tuple[torch.Tensor, torch.Tensor],
+        attention_mask: Optional[torch.Tensor],
+        past_key_values: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        is_training: bool = True,
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        bsz, q_len, _ = hidden_states.size()
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+        query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        cos, sin = position_embeddings
+        if is_training:
+            # Split query and key states in half along sequence length dimension
+            q1, q2 = query_states.chunk(2, dim=2)
+            k1, k2 = key_states.chunk(2, dim=2)
+            # Apply RoPE independently to each half
+            q1, k1 = apply_rotary_pos_emb(q1, k1, cos, sin)
+            q2, k2 = apply_rotary_pos_emb(q2, k2, cos, sin)
+            # Recombine the halves
+            query_states = torch.cat([q1, q2], dim=2)
+            key_states = torch.cat([k1, k2], dim=2)
+        else:
+            query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        query_states = query_states * _get_llama_4_attn_scale(
+            cache_position,
+            self.config.rope_parameters.get("llama_4_scaling_beta"),
+            self.config.rope_parameters.get("original_max_position_embeddings"),
+        ).to(query_states.dtype)
+        if past_key_values is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_values.update(key_states, value_states, self.layer_idx, cache_kwargs)
+        key_states = repeat_kv(key_states, self.num_key_value_groups)
+        value_states = repeat_kv(value_states, self.num_key_value_groups)
+        if self.block_diff_mask is None or q_len != self.block_diff_mask.shape[-2]:
+            block_mask = self.compute_block_mask(mode='block_diff', block_size=self.block_size, q_len=q_len)
+        else:
+            block_mask = self.block_diff_mask
+        attn_output = fused_flex_attention(query_states, key_states, value_states, block_mask=block_mask)
+        attn_output = attn_output.transpose(1, 2).reshape(*input_shape, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+        return attn_output, None
+class NemotronLabsDiffusionModel(Ministral3PreTrainedModel, GenerationMixin):
+    """
+    A single model with:
+      - a bidirectional encoder + diffusion‐LM head over A
+      - a causal decoder + LM head over B, conditioned on F_A
+    """
+    def __init__(self, config: NemotronLabsDiffusionConfig):
+        super().__init__(config)
+        self.mask_token_id = config.mask_token_id
+        diffusion_config = copy.deepcopy(config)
+        diffusion_config.diffusion_lm = True
+        if config.dlm_paradigm == 'block_diff':
+            diffusion_config.attn_class = NemotronLabsDiffusionFlexAttention
+        elif config.dlm_paradigm in ['bidirectional', 'autoregressive']:
+            diffusion_config.attn_class = Ministral3Attention
+            if config.dlm_paradigm == 'autoregressive':
+                diffusion_config.diffusion_lm = False
+        else:
+            raise ValueError(f"Unsupported DLM paradigm: {config.dlm_paradigm}")
+        self.encoder = Ministral3Model(diffusion_config)
+        self.diffusion_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.vocab_size = config.vocab_size
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.encoder.embed_tokens
+    def set_input_embeddings(self, value):
+        self.encoder.embed_tokens = value
+    def get_output_embeddings(self):
+        return self.diffusion_head
+    def set_output_embeddings(self, new_embeddings):
+        self.diffusion_head = new_embeddings
+    def forward_process(self, input_ids, eps=1e-3, block_size=None, loss_mask=None):
+        b, l = input_ids.shape
+        device = input_ids.device
+        if self.config.dp_varying_mask_ratio:
+            # Enable different random seeds for each DP rank during sampling
+            import torch.distributed as dist
+            dp_rank = 0
+            if dist.is_initialized():
+                try:
+                    dp_rank = dist.get_rank()
+                except Exception:
+                    dp_rank = 0
+            # Use a local generator to avoid affecting global RNG state
+            generator = torch.Generator(device=device)
+            generator.manual_seed(torch.seed() + dp_rank)
+        else:
+            generator = None
+        t = torch.rand(b, device=device, generator=generator)
+        p_mask = (1 - eps) * t + eps  # shape: (b,)
+        p_mask = p_mask[:, None].expand(-1, l)  # shape: (b, l)
+        masked_indices = torch.rand((b, l), device=device) < p_mask
+        if loss_mask is not None:
+            masked_indices[loss_mask == 0] = 0
+        noisy_batch = torch.where(masked_indices, self.mask_token_id, input_ids)
+        return noisy_batch, masked_indices, p_mask
+    def forward(
+        self,
+        input_ids: torch.LongTensor,
+        attention_mask: Optional[torch.Tensor]   = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        labels: Optional[torch.LongTensor]       = None,
+        split_len: Optional[int]                 = None,
+        past_key_values: Optional[Cache]         = None,
+        block_size: Optional[int]                = None,
+        eps: float                               = 1e-3,
+        is_teacher: bool                        = False,
+        masked_indices: Optional[torch.Tensor]   = None,
+        p_mask: Optional[torch.Tensor]           = None,
+        teacher_logits: Optional[torch.Tensor]   = None,
+        masked_indices_teacher: Optional[torch.Tensor] = None,
+        loss_mask: Optional[torch.Tensor] = None,
+        ce_loss_weight: float = 1.0,
+        output_last_hidden_states_only: bool = False,
+        skip_loss: bool = False,
+        **kwargs,
+    ) -> CausalLMOutputWithPast:
+        batch_size, seq_len = input_ids.shape
+        if self.config.dlm_paradigm == 'block_diff':
+            if labels is not None and block_size is None:
+                block_size = self.config.block_size
+        elif self.config.dlm_paradigm not in ('bidirectional', 'autoregressive'):
+            raise ValueError(f"Unknown dLM paradigm: {self.config.dlm_paradigm}")
+        if labels is not None and self.config.dlm_paradigm != 'autoregressive':
+            if masked_indices is not None:
+                # assert p_mask is not None
+                if loss_mask is not None:
+                    masked_indices[loss_mask == 0] = 0
+                noisy_inputs = torch.where(masked_indices, self.mask_token_id, input_ids)
+            else:
+                noisy_inputs, masked_indices, p_mask = self.forward_process(input_ids, eps=eps, block_size=block_size, loss_mask=loss_mask)
+        else:
+            noisy_inputs = input_ids
+            masked_indices = None
+            p_mask = None
+        input_ids_len = noisy_inputs.shape[1]
+        if labels is not None and self.config.dlm_paradigm == 'block_diff':
+            if position_ids is None:
+                position_ids = torch.arange(input_ids_len, device=noisy_inputs.device).unsqueeze(0)
+            noisy_inputs = torch.cat([noisy_inputs, input_ids], dim=1)
+        enc_out  = self.encoder(
+            past_key_values=past_key_values,
+            input_ids=noisy_inputs,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            is_training=(labels is not None),
+            **kwargs,
+        )
+        if output_last_hidden_states_only:
+            return BaseModelOutput(last_hidden_state=enc_out.last_hidden_state)
+        logits = self.diffusion_head(enc_out.last_hidden_state)  # (batch, len_B, vocab)
+        causal_logits = None
+        if labels is not None and self.config.dlm_paradigm == 'block_diff':
+            causal_logits = logits[:, input_ids_len:]
+            logits = logits[:, :input_ids_len]
+        loss = None
+        if labels is not None and not skip_loss:
+            if self.config.dlm_paradigm == 'autoregressive':
+                shift_logits = logits[..., :-1, :].contiguous()
+                shift_labels = labels[..., 1:].contiguous()
+                if loss_mask is None:
+                    loss_fct = CrossEntropyLoss()
+                    shift_logits = shift_logits.view(-1, shift_logits.size(-1))
+                    shift_labels = shift_labels.view(-1)
+                    loss = loss_fct(shift_logits, shift_labels)
+                else:
+                    loss_mask = loss_mask[..., 1:].contiguous()
+                    loss_fct = CrossEntropyLoss(reduction='none')
+                    shift_logits = shift_logits.view(-1, shift_logits.size(-1))
+                    shift_labels = shift_labels.view(-1)
+                    shift_labels = shift_labels.to(shift_logits.device)
+                    token_losses = loss_fct(shift_logits, shift_labels)
+                    flat_loss_mask = loss_mask.reshape(-1)
+                    loss = token_losses[flat_loss_mask == 1].sum() / flat_loss_mask.sum()
+            else:
+                # LLaDA-style diffusion loss on masked positions.
+                # Token-wise cross entropy loss on masked positions.
+                token_loss = torch.nn.functional.cross_entropy(
+                    logits[masked_indices],
+                    labels[masked_indices],
+                    reduction='none'
+                ) / p_mask[masked_indices]
+                num_mask_tokens = masked_indices.sum()
+                # global_loss_avg=True: loss is reduced externally by global token count.
+                loss = token_loss.sum()
+                if self.config.dlm_loss_weight is not None:
+                    loss = self.config.dlm_loss_weight * loss
+                if self.config.dlm_paradigm == 'block_diff':
+                    # AR-side loss for block-diffusion paradigm.
+                    causal_logits = causal_logits[..., :-1, :].contiguous()
+                    causal_logits = causal_logits.view(-1, causal_logits.size(-1))
+                    causal_labels = labels[..., 1:].contiguous().view(-1)
+                    loss_fct = CrossEntropyLoss(reduction='sum')
+                    ar_loss = loss_fct(causal_logits, causal_labels)
+                    self.loss_diffusion = loss.detach().item() / num_mask_tokens
+                    self.loss_ar = ar_loss.detach().item() / seq_len
+                    loss = loss + self.config.ar_loss_weight * ar_loss
+                # global_loss_avg=True: return (sum_loss, token_count) for external mean.
+                if self.config.dlm_paradigm == 'block_diff':
+                    loss = (loss, num_mask_tokens + int(self.config.ar_loss_weight * seq_len))
+                else:
+                    loss = (loss, num_mask_tokens)
+        return NemotronLabsDiffusionOutputWithPast(
+            loss=loss if not is_teacher else logits,
+            logits=logits,
+            causal_logits=causal_logits,
+            past_key_values=enc_out.past_key_values,
+            hidden_states=None,
+            attentions=None,
+        )
+    @torch.no_grad()
+    def generate(
+        self,
+        prompt_ids: torch.Tensor,
+        max_new_tokens: int,
+        block_length: int,
+        threshold: Optional[float] = None,
+        causal_context: bool = True,
+        temperature: float = 0.0,
+        eos_token_id: Optional[int] = None,
+        max_thinking_tokens: Optional[int] = None,
+        end_think_token_id: Optional[int] = None,
+    ):
+        """Block-wise diffusion decoding with prefix-cached KV (LLaDA-style).
+        Each block: append `block_length` mask tokens, then iteratively unmask
+        by confidence top-k (with optional threshold). When `causal_context`,
+        the KV cache and the next-block seed are produced via a causal forward
+        between blocks (flipping `self_attn.diffusion_lm`), matching the AR
+        objective at block boundaries.
+        Returns (output_ids, nfe) — output_ids includes the prompt.
+        """
+        if eos_token_id is None:
+            eos_token_id = getattr(self.config, "eos_token_id", None)
+        mask_id = self.mask_token_id
+        x_accum = prompt_ids.clone()
+        B = prompt_ids.shape[0]
+        assert max_new_tokens % block_length == 0
+        num_blocks = max_new_tokens // block_length
+        # one denoising step per generated token (matches legacy chat_utils call)
+        steps_per_block = block_length
+        nfe = 0
+        def _set_diffusion_lm(val: bool):
+            for layer in self.encoder.layers:
+                if hasattr(layer.self_attn, "diffusion_lm"):
+                    layer.self_attn.diffusion_lm = val
+        # Initial causal prefill produces the KV cache and the next-block seed.
+        if causal_context:
+            _set_diffusion_lm(False)
+        output = self(prompt_ids, use_cache=True, use_causal_mask=causal_context)
+        past_key_values = output.past_key_values
+        if causal_context:
+            _set_diffusion_lm(True)
+        next_token = None
+        if causal_context:
+            last_logit = output.logits[:, -1, :]
+            if temperature > 0:
+                next_token = torch.multinomial(torch.softmax(last_logit / temperature, dim=-1), num_samples=1)
+            else:
+                next_token = torch.argmax(last_logit, dim=-1, keepdim=True)
+        for num_block in range(num_blocks):
+            mask_block = torch.full(
+                (B, block_length), mask_id, dtype=prompt_ids.dtype, device=prompt_ids.device,
+            )
+            if causal_context:
+                mask_block[:, 0] = next_token[:, 0]
+            x_accum = torch.cat([x_accum, mask_block], dim=1)
+            block_start = prompt_ids.size(1) + num_block * block_length
+            block_slice = slice(block_start, block_start + block_length)
+            # Thinking-budget enforcement: if we've passed max_thinking_tokens
+            # without an end-think marker, inject one into this block.
+            if end_think_token_id is not None and max_thinking_tokens is not None:
+                tokens_before = num_block * block_length
+                tokens_after = tokens_before + block_length
+                if tokens_after > max_thinking_tokens:
+                    gen_so_far = x_accum[:, prompt_ids.size(1):block_start]
+                    has_end_think = (
+                        (gen_so_far == end_think_token_id).any(dim=1)
+                        if gen_so_far.size(1) > 0
+                        else torch.zeros(B, dtype=torch.bool, device=prompt_ids.device)
+                    )
+                    if not has_end_think.all():
+                        offset = max(0, max_thinking_tokens - tokens_before)
+                        inject_pos = block_start + offset
+                        for b in range(B):
+                            if not has_end_think[b]:
+                                x_accum[b, inject_pos] = end_think_token_id
+            mask_block_idx0 = x_accum[:, block_slice] == mask_id
+            num_transfer_tokens = _get_num_transfer_tokens(mask_block_idx0, steps_per_block)
+            # Denoise the current block by repeated confidence-based unmasking.
+            for i in range(steps_per_block):
+                mask_block_idx = x_accum[:, block_slice] == mask_id
+                if mask_block_idx.sum() == 0:
+                    break
+                nfe += 1
+                logits_block = self(
+                    x_accum[:, block_slice],
+                    past_key_values=past_key_values,
+                    use_cache=False,
+                ).logits
+                x0, transfer_idx = _get_transfer_index(
+                    logits_block, temperature, mask_block_idx, x_accum[:, block_slice],
+                    num_transfer_tokens=num_transfer_tokens[:, i], threshold=threshold,
+                )
+                cur = x_accum[:, block_slice].clone()
+                cur[transfer_idx] = x0[transfer_idx]
+                x_accum[:, block_slice] = cur
+                if eos_token_id is not None:
+                    block_tokens = x_accum[:, block_slice]
+                    eos_mask = block_tokens == eos_token_id
+                    if eos_mask.any(dim=1).any():
+                        after_eos = eos_mask.cumsum(dim=1).bool()
+                        mask_before = (block_tokens == mask_id) & ~after_eos
+                        if (eos_mask.any(dim=1) & ~mask_before.any(dim=1)).any():
+                            break
+            # Post-block: causal forward over the block to update the KV cache
+            # and (when causal_context) sample the seed for the next block.
+            if causal_context:
+                _set_diffusion_lm(False)
+            output = self(
+                x_accum[:, block_slice],
+                past_key_values=past_key_values,
+                use_cache=True,
+                use_causal_mask=causal_context,
+            )
+            past_key_values = output.past_key_values
+            nfe += 1
+            if causal_context:
+                _set_diffusion_lm(True)
+                last_logit = output.logits[:, -1, :]
+                if temperature > 0:
+                    next_token = torch.multinomial(torch.softmax(last_logit / temperature, dim=-1), num_samples=1)
+                else:
+                    next_token = torch.argmax(last_logit, dim=-1, keepdim=True)
+            if eos_token_id is not None:
+                gen_so_far = x_accum[:, prompt_ids.size(1):]
+                is_eos = gen_so_far == eos_token_id
+                if is_eos.any(dim=1).all():
+                    first_eos = is_eos.to(torch.int64).argmax(dim=1)
+                    max_eos = first_eos.max().item()
+                    return x_accum[:, : prompt_ids.size(1) + max_eos + 1], nfe
+        return x_accum, nfe
+    @torch.no_grad()
+    def ar_generate(
+        self,
+        prompt_ids: torch.Tensor,
+        max_new_tokens: int = 128,
+        temperature: float = 0.0,
+        eos_token_id: Optional[int] = None,
+        max_thinking_tokens: Optional[int] = None,
+        end_think_token_id: Optional[int] = None,
+    ) -> tuple:
+        """Autoregressive generation calling the encoder directly (injected by build_hf_tidar_repo).
+        Bypasses NemotronLabsDiffusionModel.forward() to avoid diffusion-specific
+        code paths. Calls self.encoder (Ministral3Model) with explicit cache_position,
+        position_ids, and use_cache so the KV cache and causal masking behave
+        identically to MistralForCausalLM / vLLM.
+        Returns:
+            (output_ids, nfe) where output_ids includes the prompt.
+        """
+        for layer in self.encoder.layers:
+            if hasattr(layer.self_attn, 'diffusion_lm'):
+                layer.self_attn.diffusion_lm = False
+        if eos_token_id is None:
+            eos_token_id = getattr(self.config, 'eos_token_id', None)
+        device = prompt_ids.device
+        batch_size, prompt_len = prompt_ids.shape
+        past_key_values = DynamicCache()
+        cache_position = torch.arange(prompt_len, device=device)
+        position_ids = cache_position.unsqueeze(0).expand(batch_size, -1)
+        enc_out = self.encoder(
+            input_ids=prompt_ids,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            use_cache=True,
+            cache_position=cache_position,
+        )
+        past_key_values = enc_out.past_key_values
+        next_logit = self.diffusion_head(enc_out.last_hidden_state[:, -1:, :]).squeeze(1)
+        generated_tokens = []
+        nfe = 0
+        for step in range(max_new_tokens):
+            nfe += 1
+            if temperature > 0:
+                probs = torch.softmax(next_logit / temperature, dim=-1)
+                next_token = torch.multinomial(probs, num_samples=1)
+            else:
+                next_token = torch.argmax(next_logit, dim=-1, keepdim=True)
+            # ---- thinking budget enforcement ----
+            if end_think_token_id is not None and max_thinking_tokens is not None:
+                if step >= max_thinking_tokens:
+                    if generated_tokens:
+                        gen_tensor = torch.cat(generated_tokens, dim=1)
+                        has_end_think = (gen_tensor == end_think_token_id).any(dim=1)
+                    else:
+                        has_end_think = torch.zeros(batch_size, dtype=torch.bool, device=device)
+                    for b in range(batch_size):
+                        if not has_end_think[b]:
+                            next_token[b] = end_think_token_id
+            generated_tokens.append(next_token)
+            if eos_token_id is not None and (next_token == eos_token_id).all():
+                break
+            if step < max_new_tokens - 1:
+                cur_pos = prompt_len + step
+                step_cache_pos = torch.tensor([cur_pos], device=device)
+                step_pos_ids = step_cache_pos.unsqueeze(0).expand(batch_size, -1)
+                enc_out = self.encoder(
+                    input_ids=next_token,
+                    position_ids=step_pos_ids,
+                    past_key_values=past_key_values,
+                    use_cache=True,
+                    cache_position=step_cache_pos,
+                )
+                past_key_values = enc_out.past_key_values
+                next_logit = self.diffusion_head(enc_out.last_hidden_state[:, -1:, :]).squeeze(1)
+        all_generated = torch.cat(generated_tokens, dim=1)
+        output_ids = torch.cat([prompt_ids, all_generated], dim=1)
+        return output_ids, nfe
+    @torch.no_grad()
+    def linear_spec_generate(
+        self,
+        prompt_ids: torch.Tensor,
+        max_new_tokens: int = 128,
+        block_length: int = 32,
+        temperature: float = 0.0,
+        mask_token_id: Optional[int] = None,
+        eos_token_id: Optional[int] = None,
+        max_thinking_tokens: Optional[int] = None,
+        end_think_token_id: Optional[int] = None,
+        threshold: float = 0.0,
+    ):
+        """Linear speculative decoding: diffusion draft + AR verify.
+        Each iteration: (1) draft the next block under bidirectional attention,
+        (2) verify the drafted block under causal attention, accept the longest
+        prefix where draft matches AR + one bonus token, advance the KV cache.
+        LoRA-aware: when a PEFT adapter is attached to the model (e.g.
+        ``linear_spec_lora``), it is toggled ON for the bidirectional draft
+        phase and OFF for the causal prefill / verify phases — so the adapter
+        only specializes the diffusion-mode forward and AR semantics are
+        preserved. With no adapter loaded the calls are no-ops.
+        Returns ``(output_ids, nfe)`` — ``output_ids`` includes the prompt.
+        """
+        if prompt_ids.shape[0] != 1:
+            raise ValueError("Linear speculative decoding requires batch_size == 1")
+        token_mask_id = mask_token_id if mask_token_id is not None else self.config.mask_token_id
+        if eos_token_id is None:
+            eos_token_id = getattr(self.config, "eos_token_id", None)
+        device = prompt_ids.device
+        def _set_diffusion_lm(val: bool):
+            for layer in self.encoder.layers:
+                if hasattr(layer.self_attn, "diffusion_lm"):
+                    layer.self_attn.diffusion_lm = val
+        def _toggle_adapters(enable: bool):
+            # No-op when no PEFT/LoRA modules are attached.
+            for module in self.modules():
+                if hasattr(module, "_disable_adapters"):
+                    module._disable_adapters = not enable
+        # Prefill (causal, LoRA OFF).
+        _set_diffusion_lm(False)
+        _toggle_adapters(False)
+        enc_out = self.encoder(
+            input_ids=prompt_ids,
+            past_key_values=DynamicCache(),
+            use_cache=True,
+            use_causal_mask=True,
+        )
+        past_key_values = enc_out.past_key_values
+        last_logit = self.diffusion_head(enc_out.last_hidden_state[:, -1:, :]).squeeze(1)
+        nfe = 1
+        if temperature > 0:
+            next_token = torch.multinomial(torch.softmax(last_logit / temperature, dim=-1), num_samples=1)
+        else:
+            next_token = torch.argmax(last_logit, dim=-1, keepdim=True)
+        if eos_token_id is not None and next_token.item() == eos_token_id:
+            return torch.cat([prompt_ids, next_token], dim=1), nfe
+        generated = [next_token]
+        total_gen = 1
+        while total_gen < max_new_tokens:
+            cache_len = past_key_values.get_seq_length()
+            block = torch.full((1, block_length), token_mask_id, dtype=torch.long, device=device)
+            block[0, 0] = next_token.item()
+            # Draft phase (bidirectional, LoRA ON) — iterate at threshold>0 so
+            # that even low-confidence blocks make progress.
+            _set_diffusion_lm(True)
+            _toggle_adapters(True)
+            while True:
+                is_mask = block == token_mask_id
+                if not is_mask.any():
+                    break
+                enc_out = self.encoder(input_ids=block, past_key_values=past_key_values, use_cache=False)
+                nfe += 1
+                draft_logits = self.diffusion_head(enc_out.last_hidden_state)
+                # LLaDA: logit[i] directly predicts position i — no shift needed.
+                if temperature > 0:
+                    draft_probs = torch.softmax(draft_logits / temperature, dim=-1)
+                    draft_tokens = torch.multinomial(
+                        draft_probs.view(-1, draft_probs.shape[-1]), num_samples=1
+                    ).view(1, block_length)
+                else:
+                    draft_tokens = draft_logits.argmax(dim=-1)
+                    draft_probs = torch.softmax(draft_logits, dim=-1)
+                if threshold > 0:
+                    draft_conf = torch.gather(draft_probs, -1, draft_tokens.unsqueeze(-1)).squeeze(-1)
+                    draft_conf = torch.where(is_mask, draft_conf, -torch.inf)
+                    unmask = draft_conf >= threshold
+                    # Force progress even when every masked position is below threshold.
+                    if not unmask.any():
+                        best_idx = draft_conf.view(-1).argmax()
+                        unmask = torch.zeros_like(is_mask, dtype=torch.bool)
+                        unmask.view(-1)[best_idx] = True
+                    block[unmask] = draft_tokens[unmask]
+                else:
+                    block[is_mask] = draft_tokens[is_mask]
+                    break
+            # Verify phase (causal, LoRA OFF).
+            _set_diffusion_lm(False)
+            _toggle_adapters(False)
+            enc_out = self.encoder(
+                input_ids=block,
+                past_key_values=past_key_values,
+                use_cache=True,
+                use_causal_mask=True,
+            )
+            past_key_values = enc_out.past_key_values
+            nfe += 1
+            verify_logits = self.diffusion_head(enc_out.last_hidden_state)
+            if temperature > 0:
+                ar_tokens = torch.multinomial(
+                    torch.softmax(verify_logits / temperature, dim=-1).view(-1, verify_logits.shape[-1]),
+                    num_samples=1,
+                ).view(1, block_length)
+            else:
+                ar_tokens = verify_logits.argmax(dim=-1)
+            # Accept consecutive matches; AR also gives one bonus token at the tail.
+            accepted = 0
+            for i in range(block_length - 1):
+                if ar_tokens[0, i].item() == block[0, i + 1].item():
+                    accepted += 1
+                else:
+                    break
+            accepted += 1
+            accepted_toks = ar_tokens[:, :accepted]
+            generated.append(accepted_toks)
+            total_gen += accepted
+            _crop_dynamic_cache(past_key_values, cache_len + accepted)
+            next_token = ar_tokens[:, accepted - 1 : accepted]
+            if eos_token_id is not None:
+                eos_pos = (accepted_toks[0] == eos_token_id).nonzero(as_tuple=True)[0]
+                if len(eos_pos) > 0:
+                    first_eos = eos_pos[0].item()
+                    generated[-1] = accepted_toks[:, : first_eos + 1]
+                    total_gen = total_gen - accepted + first_eos + 1
+                    break
+            # Thinking-budget enforcement: force end-think as next seed if budget exhausted.
+            if end_think_token_id is not None and max_thinking_tokens is not None:
+                if total_gen > max_thinking_tokens:
+                    all_gen = torch.cat(generated, dim=1)
+                    if not (all_gen == end_think_token_id).any():
+                        next_token = torch.tensor([[end_think_token_id]], device=device)
+            if total_gen >= max_new_tokens:
+                break
+        all_generated = torch.cat(generated, dim=1)
+        output_ids = torch.cat([prompt_ids, all_generated], dim=1)
+        return output_ids, nfe
+# ─── Module-level helpers used by `generate` and `linear_spec_generate` ──
+def _crop_dynamic_cache(past_key_values: DynamicCache, max_length: int):
+    """Crop a DynamicCache to max_length, compatible with both old and new transformers."""
+    if hasattr(past_key_values, 'crop'):
+        past_key_values.crop(max_length)
+    else:
+        for layer_idx in range(len(past_key_values)):
+            past_key_values.key_cache[layer_idx] = past_key_values.key_cache[layer_idx][:, :, :max_length]
+            past_key_values.value_cache[layer_idx] = past_key_values.value_cache[layer_idx][:, :, :max_length]
+        past_key_values._seen_tokens = max_length
+def _add_gumbel_noise(logits, temperature):
+    """Gumbel-max sampling in float64 (low-precision Gumbel hurts MDM quality)."""
+    if temperature == 0:
+        return logits
+    logits = logits.to(torch.float64)
+    noise = torch.rand_like(logits, dtype=torch.float64)
+    gumbel_noise = (- torch.log(noise)) ** temperature
+    return logits.exp() / gumbel_noise
+def _get_num_transfer_tokens(mask_index, steps: int):
+    """Even split of masked positions across `steps`, with remainder front-loaded."""
+    mask_num = mask_index.sum(dim=1, keepdim=True)
+    base = mask_num // steps
+    remainder = mask_num % steps
+    num_transfer_tokens = torch.zeros(mask_num.size(0), steps, device=mask_index.device, dtype=torch.int64) + base
+    for i in range(mask_num.size(0)):
+        num_transfer_tokens[i, : int(remainder[i])] += 1
+    return num_transfer_tokens
+def _get_transfer_index(logits, temperature, mask_index, x, num_transfer_tokens, threshold=None):
+    """Pick which masked positions to commit this denoising step.
+    Returns (x0, transfer_index): x0 is argmax tokens (clamped to original x at
+    non-masked positions); transfer_index is a bool mask over positions to
+    finalize, chosen by top-k confidence (and filtered by `threshold` if given).
+    """
+    logits_with_noise = _add_gumbel_noise(logits, temperature=temperature)
+    x0 = torch.argmax(logits_with_noise, dim=-1)
+    p = F.softmax(logits, dim=-1)
+    x0_p = torch.squeeze(torch.gather(p, dim=-1, index=torch.unsqueeze(x0, -1)), -1)
+    x0 = torch.where(mask_index, x0, x)
+    confidence = torch.where(mask_index, x0_p, -np.inf)
+    transfer_index = torch.zeros_like(x0, dtype=torch.bool, device=x0.device)
+    if threshold is not None:
+        num_transfer_tokens = mask_index.sum(dim=1, keepdim=True)
+    for j in range(confidence.shape[0]):
+        _, select_index = torch.topk(confidence[j], k=num_transfer_tokens[j])
+        transfer_index[j, select_index] = True
+        if threshold is not None:
+            for k in range(1, num_transfer_tokens[j]):
+                if confidence[j, select_index[k]] < threshold:
+                    transfer_index[j, select_index[k]] = False
+    return x0, transfer_index
+def gumbel_topk(log_w: torch.Tensor, k: int) -> torch.Tensor:
+    """Return a Bool mask of length len(log_w) with exactly k True."""
+    g = -torch.log(-torch.log(torch.rand_like(log_w) + 1e-9) + 1e-9)
+    topk = torch.topk(log_w + g, k).indices
+    mask = torch.zeros_like(log_w, dtype=torch.bool)
+    mask[topk] = True
+    return mask

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3277c00fe5fb3963b3cb7c07b7f183722d2af4d775a4aea7cfb3684d7cccbc2f
+size 17078330

tokenizer_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff