Instructions to use nvidia/Nemotron-Labs-Diffusion-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/Nemotron-Labs-Diffusion-3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/Nemotron-Labs-Diffusion-3B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("nvidia/Nemotron-Labs-Diffusion-3B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nvidia/Nemotron-Labs-Diffusion-3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/Nemotron-Labs-Diffusion-3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-Diffusion-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/Nemotron-Labs-Diffusion-3B

SGLang

How to use nvidia/Nemotron-Labs-Diffusion-3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/Nemotron-Labs-Diffusion-3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-Diffusion-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/Nemotron-Labs-Diffusion-3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-Diffusion-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/Nemotron-Labs-Diffusion-3B with Docker Model Runner:
```
docker model run hf.co/nvidia/Nemotron-Labs-Diffusion-3B
```

Clean up rope params; ensure transformers 4.55/5.0 compatibility

by abhgarg - opened May 15

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+2475

-1098

Files changed (20) hide show

.gitattributes +0 -5
README.md +97 -100
assets/demo.gif +0 -3
assets/demo.mp4 +0 -3
assets/result_acc.png +0 -3
assets/result_efficiency.png +0 -3
assets/teaser.png +0 -3
chat_utils.py +313 -0
config.json +21 -4
configuration_nemotron_labs_diffusion.py → configuration_ministral_dlm.py +75 -8
generation_config.json +1 -1
linear_spec_lora/adapter_config.json +0 -34
linear_spec_lora/adapter_model.safetensors +0 -3
model_cards/bias.md +0 -4
model_cards/explainability.md +0 -13
model_cards/privacy.md +0 -11
model_cards/safety.md +0 -6
modeling_ministral.py +108 -24
modeling_ministral_dlm.py +1860 -0
modeling_nemotron_labs_diffusion.py +0 -870

.gitattributes CHANGED Viewed

@@ -34,8 +34,3 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
-assets/demo.gif filter=lfs diff=lfs merge=lfs -text
-assets/demo.mp4 filter=lfs diff=lfs merge=lfs -text
-assets/result_acc.png filter=lfs diff=lfs merge=lfs -text
-assets/result_efficiency.png filter=lfs diff=lfs merge=lfs -text
-assets/teaser.png filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,163 +1,160 @@
 ---
 library_name: transformers
-license: other
-license_name: nvidia-nemotron-open-model-license
-license_link: >-
-  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/
-pipeline_tag: text-generation
-tags:
-- nvidia
-- pytorch
 ---
-# Nemotron-Labs-Diffusion-3B
-<div align="center" style="line-height: 1;">
-<a href="https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_Diffusion_Tech_Report.pdf?VersionId=1tm4XZATEzGV7cs51XAf.xmWupU20vYW" target="_blank" style="margin: 2px;">
-    <img alt="Chat" src="https://img.shields.io/badge/📝Paper-Read Now!-536af5?color=76B900&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
-</a>
-<a href="https://huggingface.co/collections/nvidia/nemotron-labs-diffusion" target="_blank" style="margin: 2px;">
-    <img alt="Nemotron-Labs-Diffusion Model Family" src="https://img.shields.io/badge/%F0%9F%A4%97-Nemotron--Labs--Diffusion_Model_Family-76B900" style="display: inline-block; vertical-align: middle;"/>
-</a>
-<a href="https://github.com/NVlabs/Nemotron-Labs-Diffusion" target="_blank" style="margin: 2px;">
-    <img alt="GitHub" src="https://img.shields.io/badge/GitHub-Github Repository-76B900?logo=github&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
-</a>
-<a href="https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/" style="margin: 2px;">
-  <img alt="License" src="https://img.shields.io/badge/License-NVIDIA Open Model License-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/>
-</a>
-</div>
-[![Demo](./assets/demo.gif)](./assets/demo.mp4)
-## Model Overview
-Nemotron-Labs-Diffusion is a tri-mode language model that supports both AR decoding and diffusion-based parallel decoding by simply switching the attention pattern of the same model during inference. The synergy between these two modes enables a third mode, called self-speculation: the same model performs diffusion-based parallel drafting and AR verification with shared KV cache, achieving high acceptance lengths and decoding efficiency. The seamless mode switching by simply changing attention patterns enables high efficiency at different concurrency levels in varying deployment scenarios with one single model.
-<div align="center">
-<img src="./assets/teaser.png" alt="An illustration of Tri-Mode LMs" width="500">
-</div>
-## Highlights
-- SOTA 3B, 8B, 14B dense LM family (base, instruct, and vision-language variants) supporting AR, diffusion, and self-speculation with the focus on decode efficiency.
-- Generation moved from a memory-bound regime toward a compute-bound regime. Model weights are loaded once and reused to compute multiple tokens during generation.
-- Self-speculation uses diffusion for drafting and AR for verification, providing a stronger alternative to MTP approaches:
-  * 3x higher acceptance length and 2.2x speed-up vs. Qwen3-8B-Eagle3 in SGLang.
-  * 5.9× tokens per forward over Qwen3-8B (no MTP) with the same accuracy.
-- Real-device speed-up across platforms:
-  * DGX Spark (8B, concurrency 1): 2.7x faster with 112 tok/sec vs. 41.8 tok/sec AR using w4a16.
-  * GB200 (8B, concurrency 1): 3.3x faster with 850 tok/sec vs. 253 tok/sec AR and 360 tok/sec Eagle3. Custom CUDA kernels boost to 1015 tok/sec (4x).
-- Diffusion speedup-of-light analysis shows that throughput can be further doubled (vs. current best) for a single user with better sampling - future research.
-<div align="center">
-<img src="./assets/result_acc.png" alt="Efficiency Results" width="800">
-</div>
-<div align="center">
-<img src="./assets/result_efficiency.png" alt="Acc Results" width="800">
-</div>
-## License/Terms of Use
-Use of this model is governed by the [NVIDIA Nemotron Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/).
-## Environment
-```bash
-transformers>=5.0.0
 ```
-## Chat with Our Model
 ```
-from transformers import AutoModel, AutoTokenizer
 import torch
-repo_name = "nvidia/Nemotron-Labs-Diffusion-3B"
 tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
-model = AutoModel.from_pretrained(repo_name, trust_remote_code=True)
-model = model.cuda().to(torch.bfloat16)
 history = []
 user_input = input("User: ").strip()
 history.append({"role": "user", "content": user_input})
-prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True)
-prompt_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device='cuda')
-## Chat in AR Mode
-out_ids, nfe = model.ar_generate(inputs.input_ids, max_new_tokens=512)
-## Chat in dLM Mode
-out_ids, nfe = model.generate(prompt_ids, max_new_tokens=512, block_length=32, threshold=0.9, eos_token_id=tokenizer.eos_token_id)
-## Chat in Linear Self-Speculation Mode
-out_ids, nfe = model.linear_spec_generate(prompt_ids, max_new_tokens=512, block_length=32, eos_token_id=tokenizer.eos_token_id)
-tokenized_out = tokenizer.batch_decode(out_ids[:, prompt_ids.shape[1]:], skip_special_tokens=True)[0]
 print(f"Model: {tokenized_out}")
 print(f"[Num Function Eval (NFE)={nfe}]")
 ```
-## Inference with Linear Self-Speculation + LoRA-enhanced Drafter
-An optional LoRA adatper can be applied to the diffusion drafter in the linear self-speculation mode to further increase the acceptance length:
-```python
-import torch
-from transformers import AutoModel, AutoTokenizer
-from peft import PeftModel
-repo = "nvidia/Nemotron-Labs-Diffusion-3B"
-tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
-model = AutoModel.from_pretrained(repo, trust_remote_code=True)
-model = model.cuda().to(torch.bfloat16)
-# Attach the linear_spec LoRA adapter.
-model = PeftModel.from_pretrained(model, repo, subfolder="linear_spec_lora").eval()
-# Unwrap so we can call linear_spec_generate directly (it toggles LoRA internally).
-base = model.model
-history = [{"role": "user", "content": "Solve: What is 15% of 240?"}]
-prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True)
-prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
-out_ids, nfe = base.linear_spec_generate(
-    prompt_ids, max_new_tokens=512, block_length=32,
-    eos_token_id=tokenizer.eos_token_id,
-)
-print(tokenizer.decode(out_ids[0, prompt_ids.shape[1]:], skip_special_tokens=True))
-print(f"[NFE={nfe}]")
 ```
-## Ethical Considerations
-NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the [bias](./model_cards/bias.md), [explainability](./model_cards/explainability.md), [safety & security](./model_cards/safety.md), and [privacy](./model_cards/privacy.md) subcards.
-Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
-## Citations
-```bibtex
-@techreport{fu2026nemotronlabsdiffusion,
-  title       = {Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding},
-  author      = {Yonggan Fu and Lexington Whalen and Abhinav Garg and Chengyue Wu and Maksim Khadkevich and Nicolai Oswald and Enze Xie and Daniel Egert and Sharath Turuvekere Sreenivas and Shizhe Diao and Chenhan Yu and Ye Yu and Weijia Chen and Sajad Norouzi and Jingyu Liu and Shiyi Lan and Ligeng Zhu and Jin Wang and Jindong Jiang and Morteza Mardani and Mehran Maghoumi and Song Han and Ante Jukic and Nima Tajbakhsh and Jan Kautz and Pavlo Molchanov},
-  institution = {NVIDIA},
-  year        = {2026},
-  note        = {Technical report}
-}
 ```

 ---
 library_name: transformers
+tags: []
 ---
+# Nemotron-Diffusion-Exp-Ministral-3B-Instruct
+Developed by [DLER team](https://nv-dler.github.io/) @ NVR and will be updated actively. Contact Yonggan Fu and Pavlo Molchanov for any question.
+# Environment
+Docker path: `/lustre/fsw/portfolios/nvr/users/yongganf/docker/megatron_py25_dllm_ministral.sqsh` on CW-DFW. Apply for interactive nodes with the following command:
+```
+srun -A {account} --partition interactive --time 4:00:00 --gpus 8 --container-image /lustre/fsw/portfolios/nvr/users/yongganf/docker/megatron_py25_dllm_ministral.sqsh --container-mounts=$HOME:/home,/lustre:/lustre  --pty bash
+```
+## Chat with Our Model in dLM Mode
+```
+from transformers import AutoModel, AutoTokenizer
+import torch
+repo_name = "nvidia/Nemotron-Diffusion-Exp-Ministral-3B-Instruct"
+tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
+model = AutoModel.from_pretrained(repo_name, trust_remote_code=True)
+model = model.cuda().to(torch.bfloat16)
+history = []
+user_input = input("User: ").strip()
+history.append({"role": "user", "content": user_input})
+prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True)
+prompt_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device='cuda')
+out_ids, nfe = model.generate(prompt_ids, max_new_tokens=512, steps=512, block_length=32, shift_logits=False, causal_context=True, threshold=0.9, eos_token_id=tokenizer.eos_token_id)
+tokenized_out = tokenizer.batch_decode(out_ids[:, prompt_ids.shape[1]:], skip_special_tokens=True)[0]
+print(f"Model: {tokenized_out}")
+print(f"[Num Function Eval (NFE)={nfe}]")
+```
+## Chat with Our Model in AR Mode
+```
+from transformers import AutoModel, AutoTokenizer
+import torch
+repo_name = "nvidia/Nemotron-Diffusion-Exp-Ministral-3B-Instruct"
+tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
+model = AutoModel.from_pretrained(repo_name, trust_remote_code=True)
+model = model.cuda().to(torch.bfloat16)
+history = []
+user_input = input("User: ").strip()
+history.append({"role": "user", "content": user_input})
+prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True, enable_thinking=False)
+prompt_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device='cuda')
+out_ids, nfe = model.ar_generate(inputs.input_ids, max_new_tokens=512)
+tokenized_out = tokenizer.batch_decode(out_ids[:, prompt_ids.shape[1]:], skip_special_tokens=True)[0]
+print(f"Model: {tokenized_out}")
+print(f"[Num Function Eval (NFE)={nfe}]")
 ```
+## Chat with Our Model in Quadratic Self-Speculation Mode
 ```
+from transformers import AutoModel, AutoTokenizer, AutoConfig
 import torch
+repo_name = "nvidia/Nemotron-Diffusion-Exp-Ministral-3B-Instruct"
 tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
+config = AutoConfig.from_pretrained(repo_name, trust_remote_code=True)
+config.enable_self_spec = True
+model = AutoModel.from_pretrained(repo_name, config=config, trust_remote_code=True).cuda().to(torch.bfloat16)
 history = []
 user_input = input("User: ").strip()
 history.append({"role": "user", "content": user_input})
+prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True, enable_thinking=False)
+inputs = tokenizer(prompt, return_tensors="pt")
+inputs = inputs.to("cuda")
+out_ids, nfe = model.self_spec_generate(inputs.input_ids, max_new_tokens=512, steps=512, block_length=32, ar_mix_weight=0.5, eos_token_id=tokenizer.eos_token_id)
+tokenized_out = tokenizer.batch_decode(out_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
 print(f"Model: {tokenized_out}")
 print(f"[Num Function Eval (NFE)={nfe}]")
 ```
+## Chat with Our Model in Linear Self-Speculation Mode
+```
+from transformers import AutoModel, AutoTokenizer
+import torch
+repo_name = "nvidia/Nemotron-Diffusion-Exp-Ministral-3B-Instruct"
+tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
+model = AutoModel.from_pretrained(repo_name, trust_remote_code=True)
+model = model.cuda().to(torch.bfloat16)
+history = []
+user_input = input("User: ").strip()
+history.append({"role": "user", "content": user_input})
+prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True, enable_thinking=False)
+prompt_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device='cuda')
+out_ids, nfe = model.linear_spec_generate(prompt_ids, max_new_tokens=512, block_length=32, eos_token_id=tokenizer.eos_token_id)
+tokenized_out = tokenizer.batch_decode(out_ids[:, prompt_ids.shape[1]:], skip_special_tokens=True)[0]
+print(f"Model: {tokenized_out}")
+print(f"[Num Function Eval (NFE)={nfe}]")
+```
+## Chat with Our Model in Linear Decoding Mode with Multi-Path Verification
 ```
+from transformers import AutoModel, AutoTokenizer
+import torch
+repo_name = "nvidia/Nemotron-Diffusion-Exp-Ministral-3B-Instruct"
+tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
+model = AutoModel.from_pretrained(repo_name, trust_remote_code=True)
+model = model.cuda().to(torch.bfloat16)
+history = []
+user_input = input("User: ").strip()
+history.append({"role": "user", "content": user_input})
+prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True, enable_thinking=False)
+prompt_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device='cuda')
+out_ids, nfe = model.linear_spec_generate_mp(prompt_ids, max_new_tokens=512, block_length=32, eos_token_id=tokenizer.eos_token_id)
+tokenized_out = tokenizer.batch_decode(out_ids[:, prompt_ids.shape[1]:], skip_special_tokens=True)[0]
+print(f"Model: {tokenized_out}")
+print(f"[Num Function Eval (NFE)={nfe}]")
 ```

assets/demo.gif DELETED Viewed

Git LFS Details

SHA256: 0d09264e272ac0f82dee36417f6a16511287ec1f8dee3b5dba3da222d791fd2c
Pointer size: 132 Bytes
Size of remote file: 8.25 MB

assets/demo.mp4 DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:666d8785ac4af75931d9c677757c4ef9945bf114d07f1c4e2ebb7b893ac39006
-size 9454873

assets/result_acc.png DELETED Viewed

Git LFS Details

SHA256: 992aa22ca9eca3d0bddbcd9f49837e2a9f377bbc0f7545563b129a50b3811448
Pointer size: 131 Bytes
Size of remote file: 405 kB

assets/result_efficiency.png DELETED Viewed

Git LFS Details

SHA256: 4f6161912e2aa703e0ef1bdccbb85039529b97e759d6247c33afa2a209806ede
Pointer size: 131 Bytes
Size of remote file: 801 kB

assets/teaser.png DELETED Viewed

Git LFS Details

SHA256: 6c94aa7b0c6cf8fb739724d0c1ce45749c76443c592eeab94d7cbb9083c6c6b1
Pointer size: 131 Bytes
Size of remote file: 581 kB

chat_utils.py ADDED Viewed

	@@ -0,0 +1,313 @@

+import numpy as np
+import torch
+import torch.nn.functional as F
+def add_gumbel_noise(logits, temperature):
+    '''
+    The Gumbel max is a method for sampling categorical distributions.
+    According to arXiv:2409.02908, for MDM, low-precision Gumbel Max improves perplexity score but reduces generation quality.
+    Thus, we use float64.
+    '''
+    if temperature == 0:
+        return logits
+    logits = logits.to(torch.float64)
+    noise = torch.rand_like(logits, dtype=torch.float64)
+    gumbel_noise = (- torch.log(noise)) ** temperature
+    return logits.exp() / gumbel_noise
+def get_transfer_index(logits, temperature, remasking, mask_index, x, num_transfer_tokens, threshold=None, neg_entropy=False):
+    logits_with_noise = add_gumbel_noise(logits, temperature=temperature)
+    x0 = torch.argmax(logits_with_noise, dim=-1)
+    if remasking == 'low_confidence':
+        # p = F.softmax(logits.to(torch.float64), dim=-1)
+        p = F.softmax(logits, dim=-1)
+        x0_p = torch.squeeze(
+            torch.gather(p, dim=-1, index=torch.unsqueeze(x0, -1)), -1) # b, l
+    elif remasking == 'top_p_margin':
+        # Compute probabilities
+        p = F.softmax(logits, dim=-1)                       # (B, L, V)
+        # Top-2 per position
+        top2 = torch.topk(p, k=2, dim=-1).values            # (B, L, 2)
+        margin = top2[..., 0] - top2[..., 1]                # (B, L)
+        # Normalize margin to [0,1] over MASKED positions per row
+        plus_inf  = torch.full_like(margin, float('inf'))
+        minus_inf = torch.full_like(margin, float('-inf'))
+        masked_for_min = torch.where(mask_index, margin, plus_inf)
+        masked_for_max = torch.where(mask_index, margin, minus_inf)
+        row_min = masked_for_min.amin(dim=1, keepdim=True)  # (B, 1)
+        row_max = masked_for_max.amax(dim=1, keepdim=True)  # (B, 1)
+        denom = (row_max - row_min)
+        # If denom==0 (all equal), set normalized=1 on masked; 0 elsewhere by default
+        normalized = torch.zeros_like(margin)
+        nonzero = denom > 0
+        normalized = torch.where(
+            mask_index & nonzero,
+            (margin - row_min) / (denom + 1e-12),
+            normalized
+        )
+        normalized = torch.where(
+            mask_index & (~nonzero),
+            torch.ones_like(normalized),
+            normalized
+        )
+        x0_p = normalized  # ∈ [0,1] on masked positions
+    elif remasking == 'random':
+        x0_p = torch.rand((x0.shape[0], x0.shape[1]), device=x0.device)
+    else:
+        raise NotImplementedError(remasking)
+    # Calculate negative entropy if requested
+    if neg_entropy:
+        # p = F.softmax(logits.to(torch.float64), dim=-1)
+        p = F.softmax(logits, dim=-1)
+        epsilon = 1e-10
+        log_probs = torch.log(p + epsilon)
+        confidence_scores = torch.sum(p * log_probs, dim=-1)  # negative entropy per position
+    else:
+        confidence_scores = x0_p
+    x0 = torch.where(mask_index, x0, x)
+    confidence = torch.where(mask_index, confidence_scores, -np.inf)
+    transfer_index = torch.zeros_like(x0, dtype=torch.bool, device=x0.device)
+    if threshold is not None:
+        num_transfer_tokens = mask_index.sum(dim=1, keepdim=True)
+    # print(f'confidence: {confidence}')
+    for j in range(confidence.shape[0]):
+        _, select_index = torch.topk(confidence[j], k=num_transfer_tokens[j])
+        transfer_index[j, select_index] = True
+        if threshold is not None:
+            for k in range(1, num_transfer_tokens[j]):
+                if confidence[j, select_index[k]] < threshold:
+                    transfer_index[j, select_index[k]] = False
+    return x0, transfer_index
+def get_num_transfer_tokens(mask_index, steps: int):
+    mask_num = mask_index.sum(dim=1, keepdim=True)
+    base = mask_num // steps
+    remainder = mask_num % steps
+    num_transfer_tokens = torch.zeros(mask_num.size(0), steps, device=mask_index.device, dtype=torch.int64) + base
+    for i in range(mask_num.size(0)):
+        num_transfer_tokens[i, : int(remainder[i])] += 1
+    return num_transfer_tokens
+@torch.no_grad()
+def generate_with_prefix_cache_block_diff(
+    model,
+    prompt,
+    steps=128,
+    gen_length=128,
+    block_length=128,
+    temperature=0.,
+    remasking='low_confidence',
+    mask_id=126336,
+    threshold=None,
+    factor=None,
+    shift_logits=False,
+    neg_entropy=False,
+    causal_context=False,
+    eos_token_id=None,
+    max_thinking_tokens=None,
+    end_think_token_id=None,
+):
+    dream_style=shift_logits
+    x_accum = prompt.clone()
+    B = prompt.shape[0]
+    assert gen_length % block_length == 0
+    num_blocks = gen_length // block_length
+    assert steps % num_blocks == 0
+    steps_per_block = steps // num_blocks
+    nfe = 0
+    if causal_context:
+        model_module = model.module if hasattr(model, "module") else model
+        for layer in model_module.encoder.layers:
+            if hasattr(layer.self_attn, 'diffusion_lm'):
+                layer.self_attn.diffusion_lm=False
+    # Compute KV cache for the prompt initially
+    output = model(prompt, use_cache=True, use_causal_mask=causal_context)
+    past_key_values = output.past_key_values
+    if causal_context:
+        for layer in model_module.encoder.layers:
+            if hasattr(layer.self_attn, 'diffusion_lm'):
+                layer.self_attn.diffusion_lm=True
+    # Causal prefill: next token from last position (same as linear_spec_generate).
+    next_token = None
+    if causal_context:
+        last_logit = output.logits[:, -1, :]
+        if temperature > 0:
+            probs = torch.softmax(last_logit / temperature, dim=-1)
+            next_token = torch.multinomial(probs, num_samples=1)
+        else:
+            next_token = torch.argmax(last_logit, dim=-1, keepdim=True)
+    # For dream_style: store the "next token logit" of the context
+    next_logits_context = None
+    if dream_style:
+        next_logits_context = output.logits[:, -1:, :]  # (B, 1, V)
+    for num_block in range(num_blocks):
+        # Create a new block with mask tokens; under causal context, seed position 0
+        # with the next-token prediction from the previous causal forward (prefill or
+        # post-block encode), matching linear_spec_generate.
+        mask_block = torch.ones(
+            (prompt.shape[0], block_length),
+            dtype=prompt.dtype,
+            device=prompt.device
+        ) * mask_id
+        if causal_context:
+            mask_block[:, 0] = next_token[:, 0]
+        # Append the block of masks
+        x_accum = torch.cat([x_accum, mask_block], dim=1)
+        current_block_start = prompt.size(1) + num_block * block_length
+        block_slice = slice(current_block_start, current_block_start + block_length)
+        # ---- thinking budget enforcement ----
+        # If we've generated >= max_thinking_tokens without a </think>, inject one.
+        if end_think_token_id is not None and max_thinking_tokens is not None:
+            tokens_before_block = num_block * block_length
+            tokens_after_block = tokens_before_block + block_length
+            if tokens_after_block > max_thinking_tokens:
+                gen_so_far = x_accum[:, prompt.size(1):current_block_start]
+                has_end_think = (
+                    (gen_so_far == end_think_token_id).any(dim=1)
+                    if gen_so_far.size(1) > 0
+                    else torch.zeros(B, dtype=torch.bool, device=prompt.device)
+                )
+                if not has_end_think.all():
+                    if tokens_before_block < max_thinking_tokens:
+                        offset = max_thinking_tokens - tokens_before_block
+                    else:
+                        offset = 0
+                    inject_pos = current_block_start + offset
+                    for b in range(B):
+                        if not has_end_think[b]:
+                            x_accum[b, inject_pos] = end_think_token_id
+        # Build the initial mask for this block
+        mask_block_idx0 = (x_accum[:, block_slice] == mask_id)  # (B, Lb)
+        # Precompute the transfer schedule for this block
+        if dream_style:
+            # masked positions only (position 0 may be causal-seeded, not mask_id)
+            schedule_mask = mask_block_idx0
+        else:
+            schedule_mask = mask_block_idx0
+        num_transfer_tokens = get_num_transfer_tokens(schedule_mask, steps_per_block)  # (B, steps)
+        # Denoise the current block
+        for i in range(steps_per_block):
+            mask_block_idx = (x_accum[:, block_slice] == mask_id)  # (B, Lb)
+            if mask_block_idx.sum() == 0:
+                break
+            nfe += 1
+            # Forward only the current noisy block using cached context
+            logits_block = model(
+                x_accum[:, block_slice],
+                past_key_values=past_key_values,
+                use_cache=False
+            ).logits
+            if dream_style:
+                # Align logits so that each masked position has a predictor:
+                # prepend context-next logit, then use logits_block[:-1]
+                if block_length == 1:
+                    logits_use = next_logits_context              # (B, 1, V)
+                else:
+                    logits_use = torch.cat(
+                        [next_logits_context, logits_block[:, :-1, :]],
+                        dim=1
+                    )  # (B, Lb, V)
+                mask_use = mask_block_idx                        # (B, Lb)
+                x_use   = x_accum[:, block_slice]                # (B, Lb)
+                x0, transfer_idx = get_transfer_index(
+                    logits_use, temperature, remasking, mask_use, x_use,
+                    num_transfer_tokens=num_transfer_tokens[:, i],
+                    threshold=threshold, neg_entropy=neg_entropy
+                )
+                cur = x_accum[:, block_slice].clone()
+                cur[transfer_idx] = x0[transfer_idx]
+                x_accum[:, block_slice] = cur
+            else:
+                # non-AR (same-position) case
+                x0, transfer_idx = get_transfer_index(
+                    logits_block, temperature, remasking, mask_block_idx,
+                    x_accum[:, block_slice],
+                    num_transfer_tokens=num_transfer_tokens[:, i],
+                    threshold=threshold, neg_entropy=neg_entropy
+                )
+                cur = x_accum[:, block_slice].clone()
+                cur[transfer_idx] = x0[transfer_idx]
+                x_accum[:, block_slice] = cur
+            if eos_token_id is not None:
+                block_tokens = x_accum[:, block_slice]              # (B, Lb)
+                eos_mask = (block_tokens == eos_token_id)           # (B, Lb)
+                any_eos = eos_mask.any(dim=1)                       # (B,)
+                if any_eos.any():
+                    after_eos = eos_mask.cumsum(dim=1).bool()       # (B, Lb)
+                    mask_before = (block_tokens == mask_id) & ~after_eos
+                    if (any_eos & ~mask_before.any(dim=1)).any():
+                        break
+        if causal_context:
+            for layer in model_module.encoder.layers:
+                if hasattr(layer.self_attn, 'diffusion_lm'):
+                    layer.self_attn.diffusion_lm=False
+        # after block is fully denoised, update KV cache
+        output = model(
+            x_accum[:, block_slice],
+            past_key_values=past_key_values,
+            use_cache=True,
+            use_causal_mask=causal_context
+        )
+        past_key_values = output.past_key_values
+        nfe += 1
+        if causal_context:
+            for layer in model_module.encoder.layers:
+                if hasattr(layer.self_attn, 'diffusion_lm'):
+                    layer.self_attn.diffusion_lm=True
+            # Next block's first position = greedy/sampled next token from this causal encode
+            last_logit = output.logits[:, -1, :]
+            if temperature > 0:
+                probs = torch.softmax(last_logit / temperature, dim=-1)
+                next_token = torch.multinomial(probs, num_samples=1)
+            else:
+                next_token = torch.argmax(last_logit, dim=-1, keepdim=True)
+        if dream_style and num_block < num_blocks - 1:
+            # refresh context-next logit for the next block
+            next_logits_context = output.logits[:, -1:, :]  # (B, 1, V)
+        if eos_token_id is not None:
+            gen_so_far = x_accum[:, prompt.size(1):]                    # (B, gen_len_so_far)
+            is_eos = (gen_so_far == eos_token_id)                       # (B, gen_len_so_far)
+            has_eos = is_eos.any(dim=1)                                 # (B,)
+            if has_eos.all():
+                first_eos_pos = is_eos.to(torch.int64).argmax(dim=1)    # (B,)
+                max_eos = first_eos_pos.max().item()
+                return x_accum[:, : prompt.size(1) + max_eos + 1], nfe
+    return x_accum, nfe

config.json CHANGED Viewed

@@ -1,21 +1,31 @@
 {
   "ar_loss_weight": 1.0,
   "architectures": [
-    "NemotronLabsDiffusionModel"
   ],
   "attention_bias": false,
   "attention_dropout": 0.0,
   "attn_implementation": "sdpa",
   "auto_map": {
-    "AutoConfig": "configuration_nemotron_labs_diffusion.NemotronLabsDiffusionConfig",
-    "AutoModel": "modeling_nemotron_labs_diffusion.NemotronLabsDiffusionModel"
   },
   "block_size": 32,
   "bos_token_id": 1,
   "dlm_loss_weight": null,
   "dlm_paradigm": "bidirectional",
   "dp_varying_mask_ratio": false,
   "eos_token_id": 11,
   "head_dim": 128,
   "hidden_act": "silu",
   "hidden_size": 3072,
@@ -24,10 +34,16 @@
   "mask_token_id": 100,
   "max_position_embeddings": 262144,
   "mlp_bias": false,
-  "model_type": "nemotron_labs_diffusion",
   "num_attention_heads": 32,
   "num_hidden_layers": 26,
   "num_key_value_heads": 8,
   "rms_norm_eps": 1e-05,
   "rope_parameters": {
     "beta_fast": 32.0,
@@ -42,6 +58,7 @@
   },
   "sliding_window": null,
   "tie_word_embeddings": false,
   "torch_dtype": "bfloat16",
   "transformers_version": "5.0.0",
   "use_cache": false,

 {
+  "ada_dlm_loss_ratio": null,
+  "ada_perm_ratio_global": null,
+  "ada_perm_ratio_per_block": null,
+  "adaptive_mask_rate": false,
   "ar_loss_weight": 1.0,
   "architectures": [
+    "MinistralDiffEncoderModel"
   ],
   "attention_bias": false,
   "attention_dropout": 0.0,
   "attn_implementation": "sdpa",
   "auto_map": {
+    "AutoConfig": "configuration_ministral_dlm.MinistralDLMConfig",
+    "AutoModel": "modeling_ministral_dlm.MinistralDiffEncoderModel"
   },
   "block_size": 32,
   "bos_token_id": 1,
+  "diff_loss_weight": 1,
+  "dlm_arch": "encoder",
   "dlm_loss_weight": null,
   "dlm_paradigm": "bidirectional",
+  "dlm_type": "llada",
   "dp_varying_mask_ratio": false,
+  "enable_self_spec": false,
+  "enforce_mask": false,
   "eos_token_id": 11,
+  "global_loss_avg": false,
   "head_dim": 128,
   "hidden_act": "silu",
   "hidden_size": 3072,
   "mask_token_id": 100,
   "max_position_embeddings": 262144,
   "mlp_bias": false,
+  "model_type": "ministral_dlm",
+  "multi_sampling": null,
+  "num_ar_layers": 0,
   "num_attention_heads": 32,
+  "num_diffusion_layers": 0,
   "num_hidden_layers": 26,
   "num_key_value_heads": 8,
+  "num_skip_loss_tokens": 0,
+  "prefix_ratio": 0.8,
+  "random_length_prob": 0,
   "rms_norm_eps": 1e-05,
   "rope_parameters": {
     "beta_fast": 32.0,
   },
   "sliding_window": null,
   "tie_word_embeddings": false,
+  "tok_mask_half_life_ratio": null,
   "torch_dtype": "bfloat16",
   "transformers_version": "5.0.0",
   "use_cache": false,

configuration_nemotron_labs_diffusion.py → configuration_ministral_dlm.py RENAMED Viewed

@@ -12,7 +12,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""Nemotron-Labs Diffusion model configuration"""
 from transformers.configuration_utils import PretrainedConfig
 from transformers.modeling_rope_utils import rope_config_validation
@@ -22,10 +22,10 @@ from transformers.utils import logging
 logger = logging.get_logger(__name__)
-class NemotronLabsDiffusionConfig(PretrainedConfig):
     r"""
-    This is the configuration class to store the configuration of a [`NemotronLabsDiffusionModel`] for diffusion language models.
-    It is used to instantiate a NemotronLabsDiffusionModel according to the specified arguments, defining the model architecture.
     Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
     documentation from [`PretrainedConfig`] for more information.
@@ -72,19 +72,52 @@ class NemotronLabsDiffusionConfig(PretrainedConfig):
             Sliding window attention size.
         mask_token_id (`int`, *optional*, defaults to -1):
             Token ID for masking in diffusion.
         dlm_paradigm (`str`, *optional*, defaults to 'bidirectional'):
-            Paradigm for diffusion ('bidirectional', 'autoregressive', 'block_diff').
         block_size (`int`, *optional*, defaults to 32):
             Block size for block diffusion paradigms.
         dlm_loss_weight (`float`, *optional*):
             Weight for diffusion LM loss.
         ar_loss_weight (`float`, *optional*, defaults to 1.0):
-            Weight for autoregressive loss in block_diff paradigm. Use 10000 to only use AR loss.
         dp_varying_mask_ratio (`bool`, *optional*, defaults to False):
             Whether to use varying mask ratio for each DP rank during sampling.
     """
-    model_type = "nemotron_labs_diffusion"
     keys_to_ignore_at_inference = ["past_key_values"]
     # Default tensor parallel plan for base model `Ministral`
@@ -129,11 +162,28 @@ class NemotronLabsDiffusionConfig(PretrainedConfig):
         sliding_window=None,
         attn_implementation="sdpa",
         mask_token_id=-1,
         dlm_paradigm='bidirectional',
         block_size=32,
         dlm_loss_weight=None,
         ar_loss_weight=1.0,
         dp_varying_mask_ratio=False,
         **kwargs,
     ):
         self.vocab_size = vocab_size
@@ -168,11 +218,28 @@ class NemotronLabsDiffusionConfig(PretrainedConfig):
         self.attn_implementation = attn_implementation
         self.mask_token_id = mask_token_id
         self.dlm_paradigm = dlm_paradigm
         self.block_size = block_size
         self.dlm_loss_weight = dlm_loss_weight
         self.ar_loss_weight = ar_loss_weight
         self.dp_varying_mask_ratio = dp_varying_mask_ratio
         super().__init__(
             pad_token_id=pad_token_id,
             bos_token_id=bos_token_id,
@@ -182,5 +249,5 @@ class NemotronLabsDiffusionConfig(PretrainedConfig):
         )
-__all__ = ["NemotronLabsDiffusionConfig"]

 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+"""Ministral DLM model configuration"""
 from transformers.configuration_utils import PretrainedConfig
 from transformers.modeling_rope_utils import rope_config_validation
 logger = logging.get_logger(__name__)
+class MinistralDLMConfig(PretrainedConfig):
     r"""
+    This is the configuration class to store the configuration of a [`Ministral3Model`] for diffusion language models.
+    It is used to instantiate a Ministral model according to the specified arguments, defining the model architecture.
     Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
     documentation from [`PretrainedConfig`] for more information.
             Sliding window attention size.
         mask_token_id (`int`, *optional*, defaults to -1):
             Token ID for masking in diffusion.
+        dlm_type (`str`, *optional*, defaults to 'llada'):
+            Type of diffusion language model ('llada', 'dream').
+        random_length_prob (`float`, *optional*):
+            Probability of using random lengths during training.
+        num_ar_layers (`int`, *optional*, defaults to 0):
+            Number of autoregressive layers.
+        num_diffusion_layers (`int`, *optional*, defaults to 0):
+            Number of diffusion layers.
+        diff_loss_weight (`float`, *optional*, defaults to 1):
+            Weight for diffusion loss.
+        enforce_mask (`bool`, *optional*, defaults to False):
+            Whether to enforce masking.
+        prefix_ratio (`float`, *optional*, defaults to 0.8):
+            Ratio for prefix in prefix_bidirectional mode.
         dlm_paradigm (`str`, *optional*, defaults to 'bidirectional'):
+            Paradigm for diffusion ('bidirectional', 'autoregressive', 'prefix_bidirectional', 'efficient_block_diff', 'block_diff', 'sbd_block_diff').
+        dlm_arch (`str`, *optional*, defaults to 'encoder'):
+            Architecture type ('encoder', 'encoder_decoder').
         block_size (`int`, *optional*, defaults to 32):
             Block size for block diffusion paradigms.
+        tok_mask_half_life_ratio (`float`, *optional*):
+            Half-life ratio for token masking.
+        adaptive_mask_rate (`bool`, *optional*, defaults to False):
+            Whether to use adaptive mask rate.
+        multi_sampling (`int`, *optional*):
+            Number of samples for multi-sampling.
+        num_skip_loss_tokens (`int`, *optional*, defaults to 0):
+            Number of tokens to skip in loss calculation.
         dlm_loss_weight (`float`, *optional*):
             Weight for diffusion LM loss.
         ar_loss_weight (`float`, *optional*, defaults to 1.0):
+            Weight for autoregressive loss in sbd_block_diff paradigm. Use 10000 to only use AR loss.
+        global_loss_avg (`bool`, *optional*, defaults to False):
+            Whether to use global loss average.
         dp_varying_mask_ratio (`bool`, *optional*, defaults to False):
             Whether to use varying mask ratio for each DP rank during sampling.
+        ada_perm_ratio_per_block (`float`, *optional*):
+            Adaptive permutation ratio for each block.
+        ada_perm_ratio_global (`float`, *optional*):
+            Adaptive permutation ratio for global.
+        enable_self_spec (`bool`, *optional*, defaults to `False`):
+            Force MinistralFlexAttention for all paradigms (including bidirectional/autoregressive).
+            Required for self speculative generation; leave False for standard eval to use faster SDPA kernels.
     """
+    model_type = "ministral_dlm"
     keys_to_ignore_at_inference = ["past_key_values"]
     # Default tensor parallel plan for base model `Ministral`
         sliding_window=None,
         attn_implementation="sdpa",
         mask_token_id=-1,
+        dlm_type='llada',
+        random_length_prob=None,
+        num_ar_layers=0,
+        num_diffusion_layers=0,
+        diff_loss_weight=1,
+        enforce_mask=False,
+        prefix_ratio=0.8,
         dlm_paradigm='bidirectional',
+        dlm_arch='encoder',
         block_size=32,
+        tok_mask_half_life_ratio=None,
+        adaptive_mask_rate=False,
+        multi_sampling=None,
+        num_skip_loss_tokens=0,
         dlm_loss_weight=None,
         ar_loss_weight=1.0,
+        global_loss_avg=False,
         dp_varying_mask_ratio=False,
+        ada_perm_ratio_per_block=None,
+        ada_perm_ratio_global=None,
+        ada_dlm_loss_ratio=None,
+        enable_self_spec=False,
         **kwargs,
     ):
         self.vocab_size = vocab_size
         self.attn_implementation = attn_implementation
         self.mask_token_id = mask_token_id
+        self.dlm_type = dlm_type
+        self.random_length_prob = random_length_prob
+        self.num_ar_layers = num_ar_layers
+        self.num_diffusion_layers = num_diffusion_layers
+        self.diff_loss_weight = diff_loss_weight
+        self.enforce_mask = enforce_mask
+        self.prefix_ratio = prefix_ratio
         self.dlm_paradigm = dlm_paradigm
+        self.dlm_arch = dlm_arch
         self.block_size = block_size
+        self.tok_mask_half_life_ratio = tok_mask_half_life_ratio
+        self.adaptive_mask_rate = adaptive_mask_rate
+        self.multi_sampling = multi_sampling
+        self.num_skip_loss_tokens = num_skip_loss_tokens
         self.dlm_loss_weight = dlm_loss_weight
         self.ar_loss_weight = ar_loss_weight
+        self.global_loss_avg = global_loss_avg
         self.dp_varying_mask_ratio = dp_varying_mask_ratio
+        self.ada_perm_ratio_per_block = ada_perm_ratio_per_block
+        self.ada_perm_ratio_global = ada_perm_ratio_global
+        self.ada_dlm_loss_ratio = ada_dlm_loss_ratio
+        self.enable_self_spec = enable_self_spec
         super().__init__(
             pad_token_id=pad_token_id,
             bos_token_id=bos_token_id,
         )
+__all__ = ["MinistralDLMConfig"]

generation_config.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "_from_model_config": true,
   "bos_token_id": 1,
   "eos_token_id": 11,
-  "transformers_version": "5.0.0",
   "use_cache": false
 }

   "_from_model_config": true,
   "bos_token_id": 1,
   "eos_token_id": 11,
+  "transformers_version": "4.55.4",
   "use_cache": false
 }

linear_spec_lora/adapter_config.json DELETED Viewed

@@ -1,34 +0,0 @@
-{
-  "alpha_pattern": {},
-  "auto_mapping": {
-    "base_model_class": "NemotronLabsDiffusionModel",
-    "parent_library": "transformers_modules.Nemotron-Labs-Diffusion-3B.modeling_nemotron_labs_diffusion"
-  },
-  "base_model_name_or_path": "nvidia/Nemotron-Labs-Diffusion-3B",
-  "bias": "none",
-  "eva_config": null,
-  "exclude_modules": null,
-  "fan_in_fan_out": false,
-  "inference_mode": true,
-  "init_lora_weights": true,
-  "layer_replication": null,
-  "layers_pattern": null,
-  "layers_to_transform": null,
-  "loftq_config": {},
-  "lora_alpha": 512,
-  "lora_bias": false,
-  "lora_dropout": 0.0,
-  "megatron_config": null,
-  "megatron_core": "megatron.core",
-  "modules_to_save": null,
-  "peft_type": "LORA",
-  "r": 128,
-  "rank_pattern": {},
-  "revision": null,
-  "target_modules": [
-    "o_proj"
-  ],
-  "task_type": null,
-  "use_dora": false,
-  "use_rslora": false
-}

linear_spec_lora/adapter_model.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:897ef67dff8a69bd1a908fa390ef2164fdaa738e0e47bec502e2f0d86311ff74
-size 95427600

model_cards/bias.md DELETED Viewed

@@ -1,4 +0,0 @@
-Field                                                                                               |  Response
-:---------------------------------------------------------------------------------------------------|:---------------
-Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing:  |  [None]
-Measures taken to mitigate against unwanted bias:                                                   |  [None]

model_cards/explainability.md DELETED Viewed

@@ -1,13 +0,0 @@
-Field                                                                                                  |  Response
-:------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------
-Intended Task/Domain:                                                                   |  Text generation
-Model Type:                                                                                            |  Transformer
-Intended Users:                                                                                        |  Generative AI creators working with conversational AI models.
-Output:                                                                                                |  Text (Responds to posed question, Stateful - remembers previous answers)
-Describe how the model works:                                                                          |  Text input is encoded into tokens and passed into a transformer-based language model, which returns a text response.
-Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of:  |  Not Applicable
-Technical Limitations & Mitigation:                                                                    |  The model cannot perform long-horizon reasoning and tool calling.
-Verified to have met prescribed NVIDIA quality standards:  |  Yes
-Performance Metrics:                                                                                   |  Accuracy, Latency, Throughput
-Potential Known Risks:                                                                                 |  In some instances, the model may think too long and struggle to derive final answers. The model's output can generate all forms of text, including what may be considered toxic, offensive, or indecent.
-Licensing:                                                                                             |  nvidia-open-model-license.

model_cards/privacy.md DELETED Viewed

@@ -1,11 +0,0 @@
-Field                                                                                                                              |  Response
-:----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------
-Generatable or reverse engineerable personal data?                                                     |  [No]
-Personal data used to create this model?                                                                                       |  [No]
-Was consent obtained for any personal data used?                                                                                             |  [Not Applicable]
-How often is dataset reviewed?                                                                                                     |  [During dataset creation, model training, evaluation, and the prerelease phase.]
-Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model? |  [Yes]
-Is there provenance for all datasets used in training?                                                                                |  Yes
-Does data labeling (annotation, metadata) comply with privacy laws?                                                                |  Yes
-Is data compliant with data subject requests for data correction or removal, if such a request was made?                           | Not Applicable.
-Applicable Privacy Policy        | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/

model_cards/safety.md DELETED Viewed

@@ -1,6 +0,0 @@
-Field                                               |  Response
-:---------------------------------------------------|:----------------------------------
-Model Application Field(s):                               |  [Media & Entertainment].
-Describe the life critical impact (if present).   |  Not Applicable
-Model and dataset restrictions:            |  The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development.  Restrictions enforce dataset access during training, and dataset license constraints adhered to.
-Use Case Restrictions: | Abide by nvidia-open-model-license.

modeling_ministral.py CHANGED Viewed

@@ -25,7 +25,7 @@ from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
 from transformers.processing_utils import Unpack
 from transformers.utils import TransformersKwargs, auto_docstring, can_return_tuple
 # from transformers.utils.generic import maybe_autocast
-from .configuration_nemotron_labs_diffusion import NemotronLabsDiffusionConfig
 #ALL_MASK_ATTENTION_FUNCTIONS._global_mapping['sdpa'] = sdpa_mask_older_torch
@@ -110,7 +110,7 @@ def _get_llama_4_attn_scale(positions_ids: torch.Tensor, beta: float, max_positi
 class Ministral3Attention(nn.Module):
     """Multi-headed attention from 'Attention Is All You Need' paper"""
-    def __init__(self, config: NemotronLabsDiffusionConfig, layer_idx: int):
         super().__init__()
         self.config = config
         self.layer_idx = layer_idx
@@ -234,7 +234,7 @@ class Ministral3RMSNorm(nn.Module):
 class Ministral3DecoderLayer(GradientCheckpointingLayer):
-    def __init__(self, config: NemotronLabsDiffusionConfig, layer_idx: int):
         super().__init__()
         self.hidden_size = config.hidden_size
@@ -284,7 +284,7 @@ class Ministral3DecoderLayer(GradientCheckpointingLayer):
 @auto_docstring
 class Ministral3PreTrainedModel(PreTrainedModel):
-    config: NemotronLabsDiffusionConfig
     base_model_prefix = "model"
     supports_gradient_checkpointing = True
     _no_split_modules = ["Ministral3DecoderLayer"]
@@ -304,7 +304,7 @@ class Ministral3PreTrainedModel(PreTrainedModel):
 class Ministral3RotaryEmbedding(nn.Module):
     inv_freq: torch.Tensor  # fix linting for `register_buffer`
-    def __init__(self, config: NemotronLabsDiffusionConfig, device=None):
         super().__init__()
         self.max_seq_len_cached = config.max_position_embeddings
         self.original_max_seq_len = config.max_position_embeddings
@@ -323,7 +323,7 @@ class Ministral3RotaryEmbedding(nn.Module):
     @staticmethod
     def compute_default_rope_parameters(
-        config: Optional[NemotronLabsDiffusionConfig] = None,
         device: Optional["torch.device"] = None,
         seq_len: Optional[int] = None,
     ) -> tuple["torch.Tensor", float]:
@@ -370,7 +370,7 @@ class Ministral3RotaryEmbedding(nn.Module):
 @auto_docstring
 class Ministral3Model(Ministral3PreTrainedModel):
-    def __init__(self, config: NemotronLabsDiffusionConfig):
         super().__init__(config)
         self.padding_idx = config.pad_token_id
         self.vocab_size = config.vocab_size
@@ -420,23 +420,15 @@ class Ministral3Model(Ministral3PreTrainedModel):
         if kwargs.get("use_causal_mask", False):
             mask_function = create_causal_mask if self.config.sliding_window is None else create_sliding_window_causal_mask
-            # Build candidate kwargs and filter against the function's signature
-            # for cross-transformers-version compatibility:
-            #   - `input_embeds` (<= 4.x) was renamed to `inputs_embeds` (>= 5.0)
-            #   - `cache_position` was removed from the signature in 5.9.0
-            import inspect
-            sig_params = inspect.signature(mask_function).parameters
-            embeds_kw = "inputs_embeds" if "inputs_embeds" in sig_params else "input_embeds"
-            candidate = {
-                "config": self.config,
-                "attention_mask": attention_mask,
-                "cache_position": cache_position,
-                "past_key_values": past_key_values,
-                "position_ids": position_ids,
-                embeds_kw: inputs_embeds,
-            }
-            causal_mask = mask_function(**{k: v for k, v in candidate.items() if k in sig_params})
         else:
             causal_mask = None
@@ -461,7 +453,99 @@ class Ministral3Model(Ministral3PreTrainedModel):
         )
 __all__ = [
     "Ministral3Model",
     "Ministral3PreTrainedModel",
 ]

 from transformers.processing_utils import Unpack
 from transformers.utils import TransformersKwargs, auto_docstring, can_return_tuple
 # from transformers.utils.generic import maybe_autocast
+from .configuration_ministral_dlm import MinistralDLMConfig
 #ALL_MASK_ATTENTION_FUNCTIONS._global_mapping['sdpa'] = sdpa_mask_older_torch
 class Ministral3Attention(nn.Module):
     """Multi-headed attention from 'Attention Is All You Need' paper"""
+    def __init__(self, config: MinistralDLMConfig, layer_idx: int):
         super().__init__()
         self.config = config
         self.layer_idx = layer_idx
 class Ministral3DecoderLayer(GradientCheckpointingLayer):
+    def __init__(self, config: MinistralDLMConfig, layer_idx: int):
         super().__init__()
         self.hidden_size = config.hidden_size
 @auto_docstring
 class Ministral3PreTrainedModel(PreTrainedModel):
+    config: MinistralDLMConfig
     base_model_prefix = "model"
     supports_gradient_checkpointing = True
     _no_split_modules = ["Ministral3DecoderLayer"]
 class Ministral3RotaryEmbedding(nn.Module):
     inv_freq: torch.Tensor  # fix linting for `register_buffer`
+    def __init__(self, config: MinistralDLMConfig, device=None):
         super().__init__()
         self.max_seq_len_cached = config.max_position_embeddings
         self.original_max_seq_len = config.max_position_embeddings
     @staticmethod
     def compute_default_rope_parameters(
+        config: Optional[MinistralDLMConfig] = None,
         device: Optional["torch.device"] = None,
         seq_len: Optional[int] = None,
     ) -> tuple["torch.Tensor", float]:
 @auto_docstring
 class Ministral3Model(Ministral3PreTrainedModel):
+    def __init__(self, config: MinistralDLMConfig):
         super().__init__(config)
         self.padding_idx = config.pad_token_id
         self.vocab_size = config.vocab_size
         if kwargs.get("use_causal_mask", False):
             mask_function = create_causal_mask if self.config.sliding_window is None else create_sliding_window_causal_mask
+            causal_mask = mask_function(
+                config=self.config,
+                input_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+                cache_position=cache_position,
+                past_key_values=past_key_values,
+                position_ids=position_ids,
+            )
         else:
             causal_mask = None
         )
+@auto_docstring
+class Ministral3ForCausalLM(Ministral3PreTrainedModel, GenerationMixin):
+    _tied_weights_keys = {"lm_head.weight": "model.embed_tokens.weight"}
+    _tp_plan = {"lm_head": "colwise_rep"}
+    _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = Ministral3Model(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+    @can_return_tuple
+    @auto_docstring
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        logits_to_keep: Union[int, torch.Tensor] = 0,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> CausalLMOutputWithPast:
+        r"""
+        Example:
+        ```python
+        >>> from transformers import AutoTokenizer, Ministral3ForCausalLM
+        >>> model = Ministral3ForCausalLM.from_pretrained("meta-ministral3/Ministral3-2-7b-hf")
+        >>> tokenizer = AutoTokenizer.from_pretrained("meta-ministral3/Ministral3-2-7b-hf")
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        outputs: BaseModelOutputWithPast = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            **kwargs,
+        )
+        hidden_states = outputs.last_hidden_state
+        # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+        slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
+        logits = self.lm_head(hidden_states[:, slice_indices, :])
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+class Ministral3ForTokenClassification(GenericForTokenClassification, Ministral3PreTrainedModel):
+    pass
+class Ministral3ForSequenceClassification(GenericForSequenceClassification, Ministral3PreTrainedModel):
+    pass
+class Ministral3ForQuestionAnswering(GenericForQuestionAnswering, Ministral3PreTrainedModel):
+    pass
 __all__ = [
+    "Ministral3ForCausalLM",
+    "Ministral3ForQuestionAnswering",
     "Ministral3Model",
     "Ministral3PreTrainedModel",
+    "Ministral3ForSequenceClassification",
+    "Ministral3ForTokenClassification",
 ]

modeling_ministral_dlm.py ADDED Viewed

	@@ -0,0 +1,1860 @@

+import copy
+from dataclasses import dataclass
+from typing import Callable, Optional, Tuple, Union
+import random
+import os
+import sys
+import json
+import numpy as np
+import torch
+import torch.nn.functional as F
+from torch import nn
+from transformers.modeling_outputs import CausalLMOutputWithPast, BaseModelOutput
+from transformers.utils import ModelOutput
+from torch.nn.attention.flex_attention import BlockMask, flex_attention, create_block_mask, or_masks
+from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
+from transformers.processing_utils import Unpack
+from transformers.cache_utils import Cache, DynamicCache
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+from transformers.generation import GenerationMixin
+import math
+from .chat_utils import generate_with_prefix_cache_block_diff
+from .modeling_ministral import Ministral3Model, Ministral3PreTrainedModel, Ministral3Attention, apply_rotary_pos_emb, repeat_kv, _get_llama_4_attn_scale
+from .configuration_ministral_dlm import MinistralDLMConfig
+__all__ = ["MinistralDiffEncoderModel", "MinistralFlexAttention"]
+@dataclass
+class MinistralDiffOutputWithPast(ModelOutput):
+    loss: torch.FloatTensor | None = None
+    logits: torch.FloatTensor | None = None
+    causal_logits: torch.FloatTensor | None = None
+    past_key_values: Cache | None = None
+    hidden_states: tuple[torch.FloatTensor, ...] | None = None
+    attentions: tuple[torch.FloatTensor, ...] | None = None
+# @torch.compile(dynamic=True, mode="reduce-overhead")
+# @torch.compile(mode="default")
+# @torch.compile(fullgraph=True, mode="reduce-overhead", dynamic=False)
+@torch.compile(fullgraph=True, mode="max-autotune-no-cudagraphs", dynamic=False)
+def fused_flex_attention(q, k, v, block_mask=None):
+    return flex_attention(q, k, v, block_mask=block_mask)
+def _crop_dynamic_cache(past_key_values: DynamicCache, max_length: int):
+    """Crop a DynamicCache to max_length, compatible with both old and new transformers."""
+    if hasattr(past_key_values, 'crop'):
+        past_key_values.crop(max_length)
+    else:
+        for layer_idx in range(len(past_key_values)):
+            past_key_values.key_cache[layer_idx] = past_key_values.key_cache[layer_idx][:, :, :max_length]
+            past_key_values.value_cache[layer_idx] = past_key_values.value_cache[layer_idx][:, :, :max_length]
+        past_key_values._seen_tokens = max_length
+def _extract_draft_kv_cache(past_key_values: DynamicCache, clean_len: int, block_length: int):
+    """After quadratic decoding, extract only draft tokens (first of each block) from cache."""
+    for layer_idx in range(len(past_key_values)):
+        if hasattr(past_key_values, 'layers'):
+            layer_cache = past_key_values.layers[layer_idx]
+            k, v = layer_cache.keys, layer_cache.values
+        else:
+            k = past_key_values.key_cache[layer_idx]
+            v = past_key_values.value_cache[layer_idx]
+        clean_k, draft_k = k[:, :, :clean_len], k[:, :, clean_len::block_length + 1]
+        clean_v, draft_v = v[:, :, :clean_len], v[:, :, clean_len::block_length + 1]
+        new_k = torch.cat([clean_k, draft_k], dim=2)
+        new_v = torch.cat([clean_v, draft_v], dim=2)
+        if hasattr(past_key_values, 'layers'):
+            layer_cache.keys = new_k
+            layer_cache.values = new_v
+        else:
+            past_key_values.key_cache[layer_idx] = new_k
+            past_key_values.value_cache[layer_idx] = new_v
+    past_key_values._seen_tokens = clean_len + block_length
+# with reference to https://github.com/pytorch-labs/attention-gym/blob/main/examples/flex_attn.ipynb
+class MinistralFlexAttention(Ministral3Attention):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.max_seq_length = getattr(self.config, 'max_seq_length', 4096)
+        self.block_size_orig = self.config.block_size
+        if self.config.dlm_paradigm == 'bidirectional':
+            self.bidirectional_mask = self.compute_block_mask(mode='bidirectional')
+        elif self.config.dlm_paradigm == 'autoregressive':
+            self.autoregressive_mask = self.compute_block_mask(mode='autoregressive')
+        elif self.config.dlm_paradigm == 'block_diff':
+            self.block_diff_mask = None
+        elif self.config.dlm_paradigm == 'sbd_block_diff':
+            self.sbd_block_diff_mask = None
+        else:
+            raise ValueError(f"Unknown attention mode: {self.config.dlm_paradigm}")
+        self.block_size = self.block_size_orig
+        self.mode = self.config.dlm_paradigm
+        self._quadratic_block_mask = {}
+        import torch._dynamo.config as dcfg
+        dcfg.cache_size_limit = 512
+    def _get_sbd_inference_quadratic_decoding_block_mask(self, block_length: int):
+        if block_length not in self._quadratic_block_mask:
+            draft_len = block_length * (block_length + 1)
+            def quadratic(b, h, q_idx, kv_idx):
+                first_clean = torch.logical_and(
+                    kv_idx % (block_length + 1) == 0,
+                    kv_idx < draft_len,
+                )
+                first_clean = torch.logical_and(first_clean, q_idx >= kv_idx)
+                block_q = q_idx // (block_length + 1)
+                block_kv = kv_idx // (block_length + 1)
+                same_block = torch.logical_and(block_q == block_kv, q_idx < draft_len)
+                same_block_except_first = torch.logical_and(
+                    same_block,
+                    q_idx % (block_length + 1) != 0,
+                )
+                draft_part = torch.logical_or(first_clean, same_block_except_first)
+                clean_part = kv_idx >= draft_len
+                return torch.logical_or(draft_part, clean_part)
+            block_mask = create_block_mask(
+                quadratic,
+                B=None,
+                H=None,
+                Q_LEN=draft_len,
+                KV_LEN=draft_len + self.config.max_position_embeddings,
+                device="cuda",
+            )
+            self._quadratic_block_mask[block_length] = block_mask
+        return self._quadratic_block_mask[block_length]
+    def set_attention_mode(self, mode, block_size=None):
+        self.mode = mode
+        self.block_size = block_size
+    def compute_block_mask(self, mode, q_len=None, block_size=None):
+        def bidirectional_mask(b, h, q, kv):
+            return (q >= kv) | (q < kv)
+        def autoregressive_mask(b, h, q, kv):
+            return (q >= kv)
+        def block_diff_mask(block_size, b, h, q_idx, kv_idx, n):
+            x0_flag_q = (q_idx >= n)
+            x0_flag_kv = (kv_idx >= n)
+            # Compute block indices
+            block_q = torch.where(x0_flag_q == 1,
+                                    (q_idx - n) // block_size,
+                                    q_idx // block_size)
+            block_kv = torch.where(x0_flag_kv == 1,
+                                    (kv_idx - n) // block_size,
+                                    kv_idx // block_size)
+            # **1. Block Diagonal Mask (M_BD) **
+            block_diagonal = (block_q == block_kv) & (x0_flag_q == x0_flag_kv)
+            # **2. Offset Block-Causal Mask (M_OBC) **
+            offset_block_causal = (
+                (block_q > block_kv)
+                & (x0_flag_kv == 1)
+                & (x0_flag_q == 0)
+            )
+            # **3. Block-Causal Mask (M_BC) **
+            block_causal = (block_q >= block_kv) & (x0_flag_kv == 1) & (x0_flag_q == 1)
+            # **4. Combine Masks **
+            return block_diagonal | offset_block_causal | block_causal
+        def sbd_block_diff_mask(block_size, b, h, q_idx, kv_idx, n):
+            x0_flag_q = (q_idx >= n)
+            x0_flag_kv = (kv_idx >= n)
+            # Compute block indices
+            block_q = torch.where(x0_flag_q == 1,
+                                    (q_idx - n) // block_size,
+                                    q_idx // block_size)
+            block_kv = torch.where(x0_flag_kv == 1,
+                                    (kv_idx - n) // block_size,
+                                    kv_idx // block_size)
+            # **1. Block Diagonal Mask (M_BD) **
+            block_diagonal = (block_q == block_kv) & (x0_flag_kv == 0) & (x0_flag_q == 0)
+            # **2. Offset Block-Causal Mask (M_OBC) **
+            offset_block_causal = (
+                (block_q > block_kv)
+                & (x0_flag_kv == 1)
+                & (x0_flag_q == 0)
+            )
+            # **3. Fully Causal Mask (M_BC) **
+            fully_causal = (q_idx >= kv_idx) & (x0_flag_kv == 1) & (x0_flag_q == 1)
+            # **4. Combine Masks **
+            return block_diagonal | offset_block_causal | fully_causal
+        if mode == 'bidirectional':
+            attn_mask = bidirectional_mask
+        elif mode == 'autoregressive':
+            attn_mask = autoregressive_mask
+        elif mode == 'block_diff':
+            assert block_size is not None
+            attn_mask = lambda b, h, q, kv: block_diff_mask(block_size, b, h, q, kv, self.max_seq_length)
+        elif mode == 'sbd_block_diff':
+            assert block_size is not None
+            attn_mask = lambda b, h, q, kv: sbd_block_diff_mask(block_size, b, h, q, kv, self.max_seq_length)
+        else:
+            raise ValueError(f"Unknown attention mode: {mode}")
+        if q_len is not None:
+            Q_LEN = q_len
+        else:
+            if mode in ['block_diff', 'sbd_block_diff']:
+                Q_LEN = self.max_seq_length * 2
+            else:
+                Q_LEN = self.max_seq_length
+        block_mask = create_block_mask(
+            attn_mask, B=None, H=None, Q_LEN=Q_LEN, KV_LEN=Q_LEN
+        )
+        return block_mask
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: Tuple[torch.Tensor, torch.Tensor],
+        attention_mask: Optional[torch.Tensor],
+        past_key_values: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        is_training: bool = True,
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        bsz, q_len, _ = hidden_states.size()
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+        query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        cos, sin = position_embeddings
+        if self.mode in ['block_diff', 'sbd_block_diff'] and is_training:
+            # Split query and key states in half along sequence length dimension
+            q1, q2 = query_states.chunk(2, dim=2)
+            k1, k2 = key_states.chunk(2, dim=2)
+            # Apply RoPE independently to each half
+            q1, k1 = apply_rotary_pos_emb(q1, k1, cos, sin)
+            q2, k2 = apply_rotary_pos_emb(q2, k2, cos, sin)
+            # Recombine the halves
+            query_states = torch.cat([q1, q2], dim=2)
+            key_states = torch.cat([k1, k2], dim=2)
+        else:
+            query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        query_states = query_states * _get_llama_4_attn_scale(
+            cache_position,
+            self.config.rope_parameters.get("llama_4_scaling_beta"),
+            self.config.rope_parameters.get("original_max_position_embeddings"),
+        ).to(query_states.dtype)
+        if past_key_values is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_values.update(key_states, value_states, self.layer_idx, cache_kwargs)
+        self_spec_inference_mode = getattr(self.config, "self_spec_inference_mode", None)
+        if self_spec_inference_mode is not None:
+            if self_spec_inference_mode == "quadratic":
+                block_length = getattr(self.config, "block_length", None) or getattr(self.config, "block_size", None)
+                if block_length is None:
+                    raise ValueError("SBD quadratic decoding requires block_length in config.")
+                if past_key_values is not None:
+                    seq_len = key_states.shape[2]
+                    draft_len = block_length * (block_length + 1)
+                    clean_keys = key_states[:, :, :-draft_len]
+                    draft_keys = key_states[:, :, -draft_len:]
+                    clean_values = value_states[:, :, :-draft_len]
+                    draft_values = value_states[:, :, -draft_len:]
+                    key_states = torch.cat([draft_keys, clean_keys], dim=2)
+                    value_states = torch.cat([draft_values, clean_values], dim=2)
+                    block_mask: BlockMask = self._get_sbd_inference_quadratic_decoding_block_mask(
+                        block_length=block_length
+                    )
+                    block_mask.seq_lengths = (draft_len, seq_len)
+                else:
+                    seq_len = query_states.shape[2]
+                    draft_len = block_length * (block_length + 1)
+                    clean_len = seq_len - draft_len
+                    def _causal_mask(b, h, q_idx, kv_idx):
+                        return torch.logical_and(q_idx >= kv_idx, q_idx < clean_len)
+                    def _draft2clean_mask(b, h, q_idx, kv_idx):
+                        full_clean = torch.logical_and(q_idx >= clean_len, kv_idx <= clean_len)
+                        first_clean = torch.logical_and(
+                            q_idx >= clean_len, (kv_idx - clean_len) % (block_length + 1) == 0
+                        )
+                        first_clean = torch.logical_and(first_clean, q_idx >= kv_idx)
+                        return torch.logical_or(full_clean, first_clean)
+                    def _draft_mask(b, h, q_idx, kv_idx):
+                        block_q = (q_idx - clean_len) // (block_length + 1)
+                        block_kv = (kv_idx - clean_len) // (block_length + 1)
+                        quadrant = torch.logical_and(q_idx >= clean_len, kv_idx >= clean_len)
+                        same_block = torch.logical_and(block_q == block_kv, quadrant)
+                        same_block_except_first = torch.logical_and(
+                            same_block,
+                            (q_idx - clean_len) % (block_length + 1) != 0,
+                        )
+                        return torch.logical_and(block_q == block_kv, same_block_except_first)
+                    mask = or_masks(_causal_mask, _draft2clean_mask)
+                    mask = or_masks(mask, _draft_mask)
+                    block_mask = create_block_mask(
+                        mask, B=None, H=None, Q_LEN=seq_len, KV_LEN=seq_len,
+                    )
+                key_states = repeat_kv(key_states, self.num_key_value_groups)
+                value_states = repeat_kv(value_states, self.num_key_value_groups)
+                attn_output = flex_attention(query_states, key_states, value_states, block_mask=block_mask)
+                attn_output = attn_output.transpose(1, 2).reshape(*input_shape, -1).contiguous()
+                attn_output = self.o_proj(attn_output)
+                return attn_output, None
+            elif self_spec_inference_mode == "default":
+                block_length = getattr(self.config, "block_length", None) or getattr(self.config, "block_size", None)
+                if block_length is None:
+                    raise ValueError("SBD default decoding requires block_length in config.")
+                seq_len = query_states.shape[2]
+                prefix_len = seq_len - block_length
+                def _clean_q_mask(b, h, q_idx, kv_idx):
+                    return torch.logical_and(q_idx >= kv_idx, q_idx < prefix_len)
+                def _noisy_q_mask(b, h, q_idx, kv_idx):
+                    return q_idx >= prefix_len
+                block_mask = create_block_mask(
+                    or_masks(_clean_q_mask, _noisy_q_mask),
+                    B=None,
+                    H=None,
+                    Q_LEN=seq_len,
+                    KV_LEN=seq_len,
+                )
+                key_states = repeat_kv(key_states, self.num_key_value_groups)
+                value_states = repeat_kv(value_states, self.num_key_value_groups)
+                attn_output = flex_attention(query_states, key_states, value_states, block_mask=block_mask)
+                attn_output = attn_output.transpose(1, 2).reshape(*input_shape, -1).contiguous()
+                attn_output = self.o_proj(attn_output)
+                return attn_output, None
+        else:
+            key_states = repeat_kv(key_states, self.num_key_value_groups)
+            value_states = repeat_kv(value_states, self.num_key_value_groups)
+            if self.mode == 'bidirectional':
+                if self.bidirectional_mask is None or q_len != self.bidirectional_mask.shape[-2]:
+                    block_mask = self.compute_block_mask(mode='bidirectional', q_len=q_len)
+                else:
+                    block_mask = self.bidirectional_mask
+            elif self.mode == 'autoregressive':
+                if self.autoregressive_mask is None or q_len != self.autoregressive_mask.shape[-2]:
+                    block_mask = self.compute_block_mask(mode='autoregressive', q_len=q_len)
+                else:
+                    block_mask = self.autoregressive_mask
+            elif self.mode == 'block_diff':
+                if self.block_diff_mask is None or self.block_size != self.block_size_orig or q_len != self.block_diff_mask.shape[-2]:
+                    block_mask = self.compute_block_mask(mode='block_diff', block_size=self.block_size, q_len=q_len)
+                else:
+                    block_mask = self.block_diff_mask
+            elif self.mode == 'sbd_block_diff':
+                if self.sbd_block_diff_mask is None or self.block_size != self.block_size_orig or q_len != self.sbd_block_diff_mask.shape[-2]:
+                    block_mask = self.compute_block_mask(mode='sbd_block_diff', block_size=self.block_size, q_len=q_len)
+                else:
+                    block_mask = self.sbd_block_diff_mask
+            else:
+                raise ValueError(f"Unknown attention mode: {self.mode}")
+            attn_output = fused_flex_attention(query_states, key_states, value_states, block_mask=block_mask)
+            attn_output = attn_output.transpose(1, 2).reshape(*input_shape, -1).contiguous()
+            attn_output = self.o_proj(attn_output)
+            return attn_output, None
+def gumbel_topk(log_w: torch.Tensor, k: int) -> torch.Tensor:
+    """Return a Bool mask of length len(log_w) with exactly k True."""
+    g = -torch.log(-torch.log(torch.rand_like(log_w) + 1e-9) + 1e-9)
+    topk = torch.topk(log_w + g, k).indices
+    mask = torch.zeros_like(log_w, dtype=torch.bool)
+    mask[topk] = True
+    return mask
+class MinistralDiffEncoderModel(Ministral3PreTrainedModel, GenerationMixin):
+    """
+    A single model with:
+      - a bidirectional encoder + diffusion‐LM head over A
+      - a causal decoder + LM head over B, conditioned on F_A
+    """
+    def __init__(self, config: MinistralDLMConfig):
+        super().__init__(config)
+        self.mask_token_id = config.mask_token_id
+        diffusion_config = copy.deepcopy(config)
+        diffusion_config.diffusion_lm = True
+        use_flex = getattr(config, 'enable_self_spec', False)
+        if config.dlm_paradigm in ['block_diff', 'sbd_block_diff']:
+            diffusion_config.attn_class = MinistralFlexAttention
+        elif config.dlm_paradigm in ['bidirectional', 'autoregressive']:
+            diffusion_config.attn_class = MinistralFlexAttention if use_flex else Ministral3Attention
+            if config.dlm_paradigm == 'autoregressive':
+                diffusion_config.diffusion_lm = False
+        else:
+            raise ValueError(f"Unsupported DLM paradigm: {config.dlm_paradigm}")
+        self.encoder = Ministral3Model(diffusion_config)
+        self.diffusion_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.vocab_size = config.vocab_size
+        self.current_iter_ratio = None
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.encoder.embed_tokens
+    def set_input_embeddings(self, value):
+        self.encoder.embed_tokens = value
+    def get_output_embeddings(self):
+        return self.diffusion_head
+    def set_output_embeddings(self, new_embeddings):
+        self.diffusion_head = new_embeddings
+    def forward_process(self, input_ids, eps=1e-3, block_size=None, loss_mask=None):
+        b, l = input_ids.shape
+        device = input_ids.device
+        if self.config.dp_varying_mask_ratio:
+            # Enable different random seeds for each DP rank during sampling
+            import torch.distributed as dist
+            dp_rank = 0
+            if dist.is_initialized():
+                try:
+                    dp_rank = dist.get_rank()
+                except Exception:
+                    dp_rank = 0
+            # Use a local generator to avoid affecting global RNG state
+            generator = torch.Generator(device=device)
+            generator.manual_seed(torch.seed() + dp_rank)
+        else:
+            generator = None
+        if self.config.adaptive_mask_rate:
+            assert block_size is not None
+            # --- simple linear window mapping ---
+            bs_min = getattr(self.config, "t_bs_min", 16)
+            bs_max = getattr(self.config, "t_bs_max", 128)
+            w = getattr(self.config, "t_window_width", 0.6)  # fixed width
+            # fraction in [0,1] (unclamped first)
+            frac = (float(block_size) - float(bs_min)) / max(1.0, float(bs_max - bs_min))
+            # upper bound decreases linearly from 1.0 -> 0.5
+            u_max = 1.0 - w * frac
+            # clamp to [0.6, 1.0] to handle bs outside [bs_min, bs_max]
+            u_max = max(0.6, min(1.0, u_max))
+            u_min = u_max - w  # ensures width = w
+            # sample t ~ Uniform(u_min, u_max)
+            t = u_min + (u_max - u_min) * torch.rand(b, device=device, generator=generator)
+        else:
+            t = torch.rand(b, device=device, generator=generator)
+        p_mask = (1 - eps) * t + eps  # shape: (b,)
+        p_mask = p_mask[:, None].expand(-1, l)  # shape: (b, l)
+        masked_indices = torch.rand((b, l), device=device) < p_mask
+        if loss_mask is not None:
+            masked_indices[loss_mask == 0] = 0
+        noisy_batch = torch.where(masked_indices, self.mask_token_id, input_ids)
+        return noisy_batch, masked_indices, p_mask
+    def forward_process_exp(
+        self,
+        input_ids: torch.Tensor,
+        eps: float = 1e-3,
+        block_size: int | None = None,
+        half_life_ratio: float = 0.25, # λ = ln 2 / (half_life_ratio·L)
+        loss_mask: Optional[torch.Tensor] = None,
+    ):
+        """
+        Two-stage corruption with optional per-block sampling.
+        • Stage 1:  m ~ U(eps, 1)   →   k = round(m · len)  (exact budget).
+        • Stage 2:  sample exactly k positions with weights
+                    w_i(m) = exp[ λ · (1−m) · i ]   (late-heavy when m→0,
+                                                     uniform when m→1).
+          If `block_size` is given, the procedure is run *independently*
+          inside each contiguous block of that length (last block may be shorter).
+          When block_size is provided, m is sampled per-block and p_mask is per-block.
+        Args
+        ----
+        input_ids : (B, L)  LongTensor
+        eps       : minimum corruption ratio
+        block_size: if not None, operate block-wise with per-block m sampling
+        half_life_ratio : controls steepness when m→0
+        """
+        B, L = input_ids.shape
+        device = input_ids.device
+        dtype  = torch.float32
+        masked_indices = torch.zeros((B, L), dtype=torch.bool, device=device)
+        p_mask = torch.zeros((B, L), dtype=dtype, device=device)
+        # ---------- Stage 1 & 2: whole-sentence or block-wise -------------------
+        for b in range(B):
+            if block_size is None:
+                # ---------- Per-batch sampling (original behavior) ----------
+                m = eps + (1.0 - eps) * torch.rand(1, device=device).item()   # scalar
+                k_tot = int(round(m * L))
+                k_tot = max(1, min(k_tot, L))  # clamp to [1, L]
+                # Fill p_mask for this batch
+                p_mask[b, :] = m
+                slope = 1.0 - m          # ∈ [0,1]; 0 ⇒ uniform, 1 ⇒ late-heavy
+                # ------- single pool over the whole sentence -------------
+                lam_base = math.log(2.0) / (half_life_ratio * L) # base decay rate (λ when slope=1)
+                pos   = torch.arange(L, device=device, dtype=dtype)
+                log_w = (lam_base * slope * pos).clone()
+                masked_indices[b] = gumbel_topk(log_w, k_tot)
+            else:
+                # ---------- Per-block sampling ----------
+                num_blocks = math.ceil(L / block_size)
+                lam_base = math.log(2.0) / (half_life_ratio * block_size) # base decay rate (λ when slope=1)
+                for blk in range(num_blocks):
+                    start = blk * block_size
+                    end   = min((blk + 1) * block_size, L)
+                    blk_len = end - start
+                    # Sample m per block
+                    m_blk = eps + (1.0 - eps) * torch.rand(1, device=device).item()
+                    # Fill p_mask for this block
+                    p_mask[b, start:end] = m_blk
+                    # per-block budget
+                    k_blk = int(round(m_blk * blk_len))
+                    k_blk = max(0, min(k_blk, blk_len))
+                    if k_blk == 0:
+                        continue
+                    slope = 1.0 - m_blk          # ∈ [0,1]; 0 ⇒ uniform, 1 ⇒ late-heavy
+                    pos   = torch.arange(blk_len, device=device, dtype=dtype)
+                    log_w = lam_base * slope * pos
+                    blk_mask = gumbel_topk(log_w, k_blk)
+                    masked_indices[b, start:end] = blk_mask
+        if loss_mask is not None:
+            masked_indices[loss_mask == 0] = 0
+        noisy_batch = torch.where(masked_indices, self.mask_token_id, input_ids)
+        return noisy_batch, masked_indices, p_mask
+    def forward(
+        self,
+        input_ids: torch.LongTensor,
+        attention_mask: Optional[torch.Tensor]   = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        labels: Optional[torch.LongTensor]       = None,
+        split_len: Optional[int]                 = None,
+        past_key_values: Optional[Cache]         = None,
+        block_size: Optional[int]                = None,
+        block_diff_ppl: bool                     = False,
+        eps: float                               = 1e-3,
+        is_teacher: bool                        = False,
+        masked_indices: Optional[torch.Tensor]   = None,
+        p_mask: Optional[torch.Tensor]           = None,
+        teacher_logits: Optional[torch.Tensor]   = None,
+        masked_indices_teacher: Optional[torch.Tensor] = None,
+        loss_mask: Optional[torch.Tensor] = None,
+        ce_loss_weight: float = 1.0,
+        output_last_hidden_states_only: bool = False,
+        skip_loss: bool = False,
+        **kwargs,
+    ) -> CausalLMOutputWithPast:
+        batch_size, seq_len = input_ids.shape
+        if self.config.dlm_paradigm == 'bidirectional' or self.config.dlm_paradigm == 'autoregressive':
+            if labels is not None and torch.rand(1) < self.config.random_length_prob:
+                random_length = torch.randint(2, input_ids.shape[1] + 1, (1,))
+                input_ids = input_ids[:, :random_length]
+                labels = labels[:, :random_length]
+                if attention_mask is not None:
+                    attention_mask = attention_mask[:, :random_length]
+                if position_ids is not None:
+                    position_ids = position_ids[:, :random_length]
+                if loss_mask is not None:
+                    loss_mask = loss_mask[:, :random_length]
+        elif self.config.dlm_paradigm in ['block_diff', 'sbd_block_diff']:
+            if labels is not None and block_size is None:
+                if torch.rand(1) < self.config.random_length_prob:
+                    block_size = torch.randint(1, 8, (1,)).item() * 4  ## [4, 32] divisible by 4
+                else:
+                    block_size = self.config.block_size
+        else:
+            raise ValueError(f"Unknown dLM paradigm: {self.config.dlm_paradigm}")
+        if labels is not None and self.config.dlm_paradigm != 'autoregressive':
+            if masked_indices is not None:
+                # assert p_mask is not None
+                if loss_mask is not None:
+                    masked_indices[loss_mask == 0] = 0
+                noisy_inputs = torch.where(masked_indices, self.mask_token_id, input_ids)
+            else:
+                if self.config.tok_mask_half_life_ratio is not None:
+                    noisy_inputs, masked_indices, p_mask = self.forward_process_exp(input_ids, eps=eps, block_size=block_size, half_life_ratio=self.config.tok_mask_half_life_ratio, loss_mask=loss_mask)
+                else:
+                    noisy_inputs, masked_indices, p_mask = self.forward_process(input_ids, eps=eps, block_size=block_size, loss_mask=loss_mask)
+        else:
+            noisy_inputs = input_ids
+            masked_indices = None
+            p_mask = None
+        if self.config.dlm_paradigm in ['block_diff', 'sbd_block_diff']:
+            for layer in self.encoder.layers:
+                if hasattr(layer.self_attn, 'set_attention_mode'):
+                    layer.self_attn.set_attention_mode(self.config.dlm_paradigm, block_size=block_size)
+        input_ids_len = noisy_inputs.shape[1]
+        if labels is not None and self.config.dlm_paradigm in ['block_diff', 'sbd_block_diff']:
+            if position_ids is None:
+                position_ids = torch.arange(input_ids_len, device=noisy_inputs.device).unsqueeze(0)
+            noisy_inputs = torch.cat([noisy_inputs, input_ids], dim=1)
+        if block_diff_ppl:
+            if position_ids is None:
+                position_ids = torch.arange(input_ids_len // 2, device=noisy_inputs.device).unsqueeze(0)
+        enc_out  = self.encoder(
+            past_key_values=past_key_values,
+            input_ids=noisy_inputs,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            is_training=(labels is not None) or (block_diff_ppl),
+            **kwargs,
+        )
+        if output_last_hidden_states_only:
+            return BaseModelOutput(last_hidden_state=enc_out.last_hidden_state)
+        logits = self.diffusion_head(enc_out.last_hidden_state)  # (batch, len_B, vocab)
+        causal_logits = None
+        if labels is not None and self.config.dlm_paradigm in ['block_diff', 'sbd_block_diff']:
+            if self.config.dlm_paradigm == 'sbd_block_diff':
+                causal_logits = logits[:, input_ids_len:]
+            else:
+                causal_logits = None
+            logits = logits[:, :input_ids_len]
+        loss = None
+        if labels is not None and not skip_loss:
+            if self.config.dlm_paradigm == 'autoregressive':
+                shift_logits = logits[..., :-1, :].contiguous()
+                shift_labels = labels[..., 1:].contiguous()
+                if loss_mask is None:
+                    loss_fct = CrossEntropyLoss()
+                    shift_logits = shift_logits.view(-1, shift_logits.size(-1))
+                    shift_labels = shift_labels.view(-1)
+                    loss = loss_fct(shift_logits, shift_labels)
+                else:
+                    loss_mask = loss_mask[..., 1:].contiguous()
+                    loss_fct = CrossEntropyLoss(reduction='none')
+                    shift_logits = shift_logits.view(-1, shift_logits.size(-1))
+                    shift_labels = shift_labels.view(-1)
+                    shift_labels = shift_labels.to(shift_logits.device)
+                    token_losses = loss_fct(shift_logits, shift_labels)
+                    flat_loss_mask = loss_mask.reshape(-1)
+                    loss = token_losses[flat_loss_mask == 1].sum() / flat_loss_mask.sum()
+            else:
+                # Handle DREAM vs LLADA style losses
+                if hasattr(self.config, 'dlm_type') and self.config.dlm_type == 'dream':
+                    logits = logits[..., :-1, :].contiguous()
+                    labels = labels[..., 1:].contiguous()
+                    masked_indices = masked_indices[:, 1:]
+                    p_mask = p_mask[:, 1:]
+                if self.config.ada_perm_ratio_per_block is not None:
+                    # Only compute loss for the top ada_perm_ratio_per_block tokens by confidence within each block
+                    block_size = self.config.block_size
+                    batch_size, seq_len = masked_indices.shape
+                    num_blocks = seq_len // block_size
+                    # Get the max logit (confidence) for each position
+                    confidence = logits.max(dim=-1).values.detach()  # (batch_size, seq_len)
+                    # Create a mask for tokens to include in loss
+                    selected_mask = torch.zeros_like(masked_indices, dtype=torch.bool)
+                    for blk in range(num_blocks):
+                        start = blk * block_size
+                        end = min((blk + 1) * block_size, seq_len)
+                        # Get masked indices within this block
+                        block_masked = masked_indices[:, start:end]  # (batch_size, block_len)
+                        block_confidence = confidence[:, start:end]  # (batch_size, block_len)
+                        for b in range(batch_size):
+                            # Get positions that are masked in this block for this batch
+                            masked_positions = torch.where(block_masked[b])[0]
+                            num_masked = len(masked_positions)
+                            if num_masked > 0:
+                                # Number of tokens to keep (top by confidence)
+                                k = min(max(1, int(block_size * self.config.ada_perm_ratio_per_block)), num_masked)
+                                # Get confidence values for masked positions
+                                masked_confidence = block_confidence[b, masked_positions]
+                                # Get indices of top-k confident tokens
+                                _, topk_indices = torch.topk(masked_confidence, k)
+                                selected_positions = masked_positions[topk_indices]
+                                # Mark these positions in the selected mask
+                                selected_mask[b, start + selected_positions] = True
+                    # Calculate loss only for selected positions
+                    token_loss = torch.nn.functional.cross_entropy(
+                        logits[selected_mask],
+                        labels[selected_mask],
+                        reduction='none'
+                    ) / p_mask[selected_mask]
+                    num_mask_tokens = selected_mask.sum()
+                else:
+                    # Calculate token-wise cross entropy loss for masked positions in B
+                    token_loss = torch.nn.functional.cross_entropy(
+                        logits[masked_indices],
+                        labels[masked_indices],
+                        reduction='none'
+                    ) / p_mask[masked_indices]
+                    num_mask_tokens = masked_indices.sum()
+                if self.config.global_loss_avg:
+                    loss = token_loss.sum()
+                else:
+                    loss = token_loss.sum() / num_mask_tokens
+                if self.config.ada_dlm_loss_ratio is not None:
+                    assert self.current_iter_ratio is not None
+                    assert self.config.dlm_loss_weight is not None
+                    dlm_loss_weight = min(self.config.dlm_loss_weight, self.current_iter_ratio / self.config.ada_dlm_loss_ratio * self.config.dlm_loss_weight)
+                    loss = dlm_loss_weight * loss
+                elif self.config.dlm_loss_weight is not None:
+                    loss = self.config.dlm_loss_weight * loss
+                if self.config.dlm_paradigm == 'sbd_block_diff':
+                    causal_logits = causal_logits[..., :-1, :].contiguous()
+                    causal_logits = causal_logits.view(-1, causal_logits.size(-1))
+                    if hasattr(self.config, 'dlm_type') and self.config.dlm_type == 'dream':
+                        causal_labels = labels.view(-1)
+                    else:
+                        causal_labels = labels[..., 1:].contiguous().view(-1)
+                    if self.config.global_loss_avg:
+                        loss_fct = CrossEntropyLoss(reduction='sum')
+                        ar_loss = loss_fct(causal_logits, causal_labels)
+                        self.loss_diffusion = loss.detach().item() / num_mask_tokens
+                        self.loss_ar = ar_loss.detach().item() / seq_len
+                        loss = loss + self.config.ar_loss_weight * ar_loss
+                    else:
+                        loss_fct = CrossEntropyLoss()
+                        ar_loss = loss_fct(causal_logits, causal_labels)
+                        self.loss_diffusion = loss.detach().item()
+                        self.loss_ar = ar_loss.detach().item()
+                        loss = loss + self.config.ar_loss_weight * ar_loss
+                if self.config.global_loss_avg:
+                    if self.config.dlm_paradigm == 'sbd_block_diff':
+                        loss = (loss, num_mask_tokens + int(self.config.ar_loss_weight * seq_len))
+                    else:
+                        loss = (loss, num_mask_tokens)
+        return MinistralDiffOutputWithPast(
+            loss=loss if not is_teacher else logits,
+            logits=logits,
+            causal_logits=causal_logits,
+            past_key_values=enc_out.past_key_values,
+            hidden_states=None,
+            attentions=None,
+        )
+    def generate(self, prompt_ids, max_new_tokens, steps, block_length, shift_logits, threshold, causal_context=True, temperature=0, eos_token_id=None, max_thinking_tokens=None, end_think_token_id=None):
+        if eos_token_id is None:
+            eos_token_id = getattr(self.config, 'eos_token_id', None)
+        out_ids, nfe = generate_with_prefix_cache_block_diff(
+                        model=self,
+                        prompt=prompt_ids,
+                        gen_length=max_new_tokens,
+                        steps=steps,
+                        block_length=block_length,
+                        remasking="low_confidence",
+                        temperature=temperature,
+                        mask_id=self.mask_token_id,
+                        threshold=threshold,
+                        shift_logits=shift_logits,
+                        neg_entropy=False,
+                        causal_context=causal_context,
+                        eos_token_id=eos_token_id,
+                        max_thinking_tokens=max_thinking_tokens,
+                        end_think_token_id=end_think_token_id,
+                    )
+        return out_ids, nfe
+    @torch.no_grad()
+    def sbd_inference_diffusion_quadratic(
+        self,
+        clean_input_ids: Optional[torch.Tensor],
+        draft_input_ids: torch.Tensor,
+        block_length: int,
+        draft_only: bool = False,
+        past_key_values: Optional[Cache] = None,
+        use_cache: bool = False,
+    ):
+        enc_config = self.encoder.config
+        enc_config.use_sbd_objective = True
+        enc_config.block_length = block_length
+        if draft_only:
+            assert clean_input_ids is not None
+            if use_cache and past_key_values is None:
+                past_key_values = DynamicCache()
+            enc_config.self_spec_inference_mode = "default"
+            input_ids = torch.cat([clean_input_ids, draft_input_ids], dim=-1)
+            outputs = self.encoder(
+                input_ids=input_ids,
+                position_ids=None,
+                past_key_values=past_key_values,
+                use_cache=use_cache,
+                is_training=False,
+            )
+            hidden_states = outputs.last_hidden_state
+            logits = self.diffusion_head(hidden_states)
+            past_key_values = getattr(outputs, "past_key_values", None)
+            if use_cache and past_key_values is not None:
+                _crop_dynamic_cache(past_key_values, clean_input_ids.shape[1])
+            return logits, past_key_values
+        else:
+            enc_config.self_spec_inference_mode = "quadratic"
+            draft_len = block_length * (block_length + 1)
+            draft_input_ids = torch.cat(
+                [
+                    draft_input_ids.view(-1, block_length, 1),
+                    torch.full(
+                        (draft_input_ids.shape[0], block_length, block_length),
+                        fill_value=self.config.mask_token_id,
+                        device=draft_input_ids.device,
+                    ),
+                ],
+                dim=-1,
+            ).view(-1, draft_len)
+            if use_cache:
+                assert past_key_values is not None, (
+                    "Past key values should be provided when using cache, e.g. run draft_only=True first."
+                )
+                assert clean_input_ids is None, (
+                    "Clean input ids should already be in cache, thus none should be provided."
+                )
+                clean_len = past_key_values.get_seq_length()
+                input_ids = draft_input_ids
+            else:
+                clean_len = clean_input_ids.shape[1]
+                input_ids = torch.cat([clean_input_ids, draft_input_ids], dim=-1)
+            per_block_position_ids = torch.arange(
+                clean_len, clean_len + block_length + 1, device=draft_input_ids.device
+            )[None,].repeat(block_length, 1)
+            per_block_position_ids += torch.arange(block_length, device=draft_input_ids.device).view(-1, 1)
+            if use_cache:
+                position_ids = per_block_position_ids.view(-1)[None,]
+            else:
+                clean_position_ids = torch.arange(clean_len, device=draft_input_ids.device)
+                position_ids = torch.cat([clean_position_ids, per_block_position_ids.view(-1)], dim=-1)[None,]
+            outputs = self.encoder(
+                input_ids=input_ids,
+                position_ids=position_ids,
+                past_key_values=past_key_values,
+                use_cache=use_cache,
+                is_training=False,
+            )
+            hidden_states = outputs.last_hidden_state
+            logits = self.diffusion_head(hidden_states)
+            past_key_values = getattr(outputs, "past_key_values", None)
+            if use_cache and past_key_values is not None:
+                _extract_draft_kv_cache(past_key_values, clean_len, block_length)
+            return logits, past_key_values
+    @torch.no_grad()
+    def ar_generate(
+        self,
+        prompt_ids: torch.Tensor,
+        max_new_tokens: int = 128,
+        temperature: float = 0.0,
+        eos_token_id: Optional[int] = None,
+        max_thinking_tokens: Optional[int] = None,
+        end_think_token_id: Optional[int] = None,
+    ) -> tuple:
+        """Autoregressive generation calling the encoder directly (injected by build_hf_tidar_repo).
+        Bypasses MinistralDiffEncoderModel.forward() to avoid diffusion-specific
+        code paths. Calls self.encoder (Ministral3Model) with explicit cache_position,
+        position_ids, and use_cache so the KV cache and causal masking behave
+        identically to MistralForCausalLM / vLLM.
+        Returns:
+            (output_ids, nfe) where output_ids includes the prompt.
+        """
+        for layer in self.encoder.layers:
+            if hasattr(layer.self_attn, 'diffusion_lm'):
+                layer.self_attn.diffusion_lm = False
+        if eos_token_id is None:
+            eos_token_id = getattr(self.config, 'eos_token_id', None)
+        device = prompt_ids.device
+        batch_size, prompt_len = prompt_ids.shape
+        past_key_values = DynamicCache()
+        cache_position = torch.arange(prompt_len, device=device)
+        position_ids = cache_position.unsqueeze(0).expand(batch_size, -1)
+        enc_out = self.encoder(
+            input_ids=prompt_ids,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            use_cache=True,
+            cache_position=cache_position,
+        )
+        past_key_values = enc_out.past_key_values
+        next_logit = self.diffusion_head(enc_out.last_hidden_state[:, -1:, :]).squeeze(1)
+        generated_tokens = []
+        nfe = 0
+        for step in range(max_new_tokens):
+            nfe += 1
+            if temperature > 0:
+                probs = torch.softmax(next_logit / temperature, dim=-1)
+                next_token = torch.multinomial(probs, num_samples=1)
+            else:
+                next_token = torch.argmax(next_logit, dim=-1, keepdim=True)
+            # ---- thinking budget enforcement ----
+            if end_think_token_id is not None and max_thinking_tokens is not None:
+                if step >= max_thinking_tokens:
+                    if generated_tokens:
+                        gen_tensor = torch.cat(generated_tokens, dim=1)
+                        has_end_think = (gen_tensor == end_think_token_id).any(dim=1)
+                    else:
+                        has_end_think = torch.zeros(batch_size, dtype=torch.bool, device=device)
+                    for b in range(batch_size):
+                        if not has_end_think[b]:
+                            next_token[b] = end_think_token_id
+            generated_tokens.append(next_token)
+            if eos_token_id is not None and (next_token == eos_token_id).all():
+                break
+            if step < max_new_tokens - 1:
+                cur_pos = prompt_len + step
+                step_cache_pos = torch.tensor([cur_pos], device=device)
+                step_pos_ids = step_cache_pos.unsqueeze(0).expand(batch_size, -1)
+                enc_out = self.encoder(
+                    input_ids=next_token,
+                    position_ids=step_pos_ids,
+                    past_key_values=past_key_values,
+                    use_cache=True,
+                    cache_position=step_cache_pos,
+                )
+                past_key_values = enc_out.past_key_values
+                next_logit = self.diffusion_head(enc_out.last_hidden_state[:, -1:, :]).squeeze(1)
+        all_generated = torch.cat(generated_tokens, dim=1)
+        output_ids = torch.cat([prompt_ids, all_generated], dim=1)
+        return output_ids, nfe
+    @torch.no_grad()
+    def self_spec_generate(
+        self,
+        prompt_ids: torch.Tensor,
+        max_new_tokens: int = 128,
+        steps: int = 128,
+        block_length: int = 16,
+        ar_mix_weight: Optional[float] = None,
+        temperature: float = 0.0,
+        mask_token_id: Optional[int] = None,
+        eos_token_id: Optional[int] = None,
+        max_thinking_tokens: Optional[int] = None,
+        end_think_token_id: Optional[int] = None,
+    ):
+        self.config.use_sbd_objective = True
+        self.config.dlm_paradigm = "sbd"
+        if prompt_ids.shape[0] != 1:
+            raise ValueError("Self speculation quadratic decoding currently requires batch_size == 1")
+        token_mask_id = mask_token_id if mask_token_id is not None else self.config.mask_token_id
+        if eos_token_id is None:
+            eos_token_id = getattr(self.config, "eos_token_id", None)
+        x = torch.full(
+            (1, prompt_ids.shape[1] + max_new_tokens + block_length * 2),
+            token_mask_id,
+            dtype=torch.long,
+            device=prompt_ids.device,
+        )
+        x[:, : prompt_ids.shape[1]] = prompt_ids.clone()
+        if max_new_tokens % block_length != 0:
+            raise ValueError("max_new_tokens must be divisible by block_length")
+        num_blocks = max_new_tokens // block_length
+        if steps % num_blocks != 0:
+            raise ValueError("steps must be divisible by (max_new_tokens // block_length)")
+        prompt_len = prompt_ids.shape[1]
+        nfe = 0
+        nfe += 1
+        logits, past_key_values = self.sbd_inference_diffusion_quadratic(
+            clean_input_ids=x[:, :prompt_len],
+            draft_input_ids=x[:, prompt_len : prompt_len + block_length],
+            block_length=block_length,
+            draft_only=True,
+            use_cache=True,
+        )
+        logits_proposal = logits[:, prompt_len - 1 : prompt_len + block_length]
+        logits_proposal[:, 1] = logits_proposal[:, 0]
+        logits_proposal = logits_proposal[:, 1:]
+        x0_proposal = torch.argmax(logits_proposal, dim=-1)
+        x[:, prompt_len : prompt_len + block_length] = x0_proposal
+        total_accept_token = 0
+        while True:
+            nfe += 1
+            block_start = prompt_len + total_accept_token
+            block_end = block_start + block_length
+            draft_input_ids = x[:, block_start:block_end]
+            logits, past_key_values = self.sbd_inference_diffusion_quadratic(
+                clean_input_ids=None,
+                draft_input_ids=draft_input_ids,
+                block_length=block_length,
+                draft_only=False,
+                past_key_values=past_key_values,
+                use_cache=True,
+            )
+            useful_token_logits = logits.view(1, block_length, block_length + 1, -1)
+            if ar_mix_weight is None:
+                useful_token_logits[:, :, 1] = useful_token_logits[:, :, 0]
+            else:
+                if not (0.0 <= ar_mix_weight <= 1.0):
+                    raise ValueError("ar_mix_weight must be between 0 and 1")
+                mix_logits = useful_token_logits[:, :, 0] * ar_mix_weight + useful_token_logits[:, :, 1] * (1 - ar_mix_weight)
+                useful_token_logits[:, :, 0] = mix_logits
+                useful_token_logits[:, :, 1] = mix_logits
+            if temperature > 0:
+                useful_token_logits = useful_token_logits / temperature
+            useful_token_pred = torch.argmax(useful_token_logits, dim=-1)
+            new_draft_input_ids = useful_token_pred[:, 0, 1:]
+            accept_cnt = 1
+            while accept_cnt < block_length:
+                if useful_token_pred[:, accept_cnt - 1, 0].item() != draft_input_ids[:, accept_cnt].item():
+                    break
+                new_draft_input_ids = useful_token_pred[:, accept_cnt, 1:]
+                accept_cnt += 1
+            x[:, block_start : block_start + accept_cnt] = draft_input_ids[:, :accept_cnt]
+            # EoS early stopping: all accepted tokens are finalized left-to-right,
+            # so if any is EoS we can truncate and return immediately.
+            if eos_token_id is not None:
+                accepted = x[0, block_start : block_start + accept_cnt]
+                eos_positions = (accepted == eos_token_id).nonzero(as_tuple=True)[0]
+                if len(eos_positions) > 0:
+                    first_eos_rel = eos_positions[0].item()
+                    total_accept_token += first_eos_rel + 1
+                    output_end = prompt_len + total_accept_token
+                    return x[:, :output_end], nfe
+            x[:, block_start + accept_cnt : block_start + accept_cnt + block_length] = new_draft_input_ids
+            past_key_values.crop(block_start + accept_cnt)
+            # ---- thinking budget enforcement ----
+            # Insert end_think as the first token of the next draft block,
+            # shifting all subsequent tokens right by 1 (discarding the last).
+            # The first draft token is always accepted unconditionally, so
+            # end_think is guaranteed to be finalized in the next iteration
+            # without needing to re-encode or touch the KV cache.
+            if end_think_token_id is not None and max_thinking_tokens is not None:
+                tokens_so_far = total_accept_token + accept_cnt
+                if tokens_so_far > max_thinking_tokens:
+                    gen_so_far = x[0, prompt_len : prompt_len + tokens_so_far]
+                    has_end_think = (gen_so_far == end_think_token_id).any()
+                    if not has_end_think:
+                        insert_pos = block_start + accept_cnt
+                        x[0, insert_pos + 1:] = x[0, insert_pos:-1].clone()
+                        x[0, insert_pos] = end_think_token_id
+            total_accept_token += accept_cnt
+            if total_accept_token >= max_new_tokens:
+                break
+        return x[:, : -(block_length * 2)], nfe
+    @torch.no_grad()
+    def linear_spec_generate(
+        self,
+        prompt_ids: torch.Tensor,
+        max_new_tokens: int = 128,
+        block_length: int = 32,
+        temperature: float = 0.0,
+        mask_token_id: Optional[int] = None,
+        eos_token_id: Optional[int] = None,
+        max_thinking_tokens: Optional[int] = None,
+        end_think_token_id: Optional[int] = None,
+        threshold: float = 0.0,
+    ):
+        """Linear speculative decoding: diffusion draft + AR verification.
+        Each step:
+          1. Draft: forward [last_accepted, mask, ...] with bidirectional attention
+             (diffusion_lm=True, use_cache=False).  Shift AR logits to get
+             per-position predictions; apply confidence filtering.
+          2. Verify: forward the drafted block with causal attention
+             (diffusion_lm=False, use_cache=True, use_causal_mask=True).
+             Accept consecutive AR-matching tokens plus one bonus token.
+        Args:
+            prompt_ids: Input token IDs of shape (1, prompt_len).
+            max_new_tokens: Maximum number of tokens to generate.
+            block_length: Number of tokens per draft/verify block.
+            temperature: Sampling temperature (0 = greedy).
+            mask_token_id: Override for config.mask_token_id.
+            eos_token_id: Override for config.eos_token_id.
+            max_thinking_tokens: Budget for thinking tokens before forcing end_think.
+            end_think_token_id: Token ID inserted when thinking budget is exceeded.
+            threshold: Confidence threshold for accepting draft predictions.
+        Returns:
+            (output_ids, nfe): output_ids includes the prompt; nfe is the number
+            of forward evaluations (matching self_spec_generate interface).
+        """
+        if prompt_ids.shape[0] != 1:
+            raise ValueError("Linear speculative decoding requires batch_size == 1")
+        token_mask_id = mask_token_id if mask_token_id is not None else self.config.mask_token_id
+        if eos_token_id is None:
+            eos_token_id = getattr(self.config, "eos_token_id", None)
+        device = prompt_ids.device
+        prompt_len = prompt_ids.shape[1]
+        dream_style = getattr(self.config, 'dlm_type', 'llada') == 'dream'
+        def _set_diffusion_lm(val: bool):
+            for layer in self.encoder.layers:
+                if hasattr(layer.self_attn, 'diffusion_lm'):
+                    layer.self_attn.diffusion_lm = val
+        # ===== Prefill (causal) =====
+        _set_diffusion_lm(False)
+        enc_out = self.encoder(
+            input_ids=prompt_ids,
+            past_key_values=DynamicCache(),
+            use_cache=True,
+            use_causal_mask=True,
+        )
+        past_key_values = enc_out.past_key_values
+        last_logit = self.diffusion_head(enc_out.last_hidden_state[:, -1:, :]).squeeze(1)
+        nfe = 1
+        if temperature > 0:
+            probs = torch.softmax(last_logit / temperature, dim=-1)
+            next_token = torch.multinomial(probs, num_samples=1)
+        else:
+            next_token = torch.argmax(last_logit, dim=-1, keepdim=True)
+        if eos_token_id is not None and next_token.item() == eos_token_id:
+            output_ids = torch.cat([prompt_ids, next_token], dim=1)
+            return output_ids, nfe
+        generated = [next_token]
+        total_gen = 1
+        # ===== Main loop =====
+        while total_gen < max_new_tokens:
+            cache_len = past_key_values.get_seq_length()
+            block = torch.full(
+                (1, block_length), token_mask_id, dtype=torch.long, device=device
+            )
+            block[0, 0] = next_token.item()
+            # -------- Draft (bidirectional, don't update cache) --------
+            _set_diffusion_lm(True)
+            while True:
+                is_mask = block == token_mask_id
+                if not is_mask.any():
+                    break
+                enc_out = self.encoder(
+                    input_ids=block,
+                    past_key_values=past_key_values,
+                    use_cache=False,
+                )
+                nfe += 1
+                draft_logits = self.diffusion_head(enc_out.last_hidden_state)
+                if dream_style:
+                    # DREAM: logit[i] predicts position i+1 → shift to self-prediction
+                    draft_logits = torch.cat(
+                        [draft_logits[:, :1, :], draft_logits[:, :-1, :]], dim=1
+                    )
+                # LLaDA: logit[i] already predicts position i → no shift needed
+                if temperature > 0:
+                    draft_probs = torch.softmax(draft_logits / temperature, dim=-1)
+                    draft_tokens = torch.multinomial(
+                        draft_probs.view(-1, draft_probs.shape[-1]), num_samples=1
+                    ).view(1, block_length)
+                else:
+                    draft_tokens = draft_logits.argmax(dim=-1)
+                    draft_probs = torch.softmax(draft_logits, dim=-1)
+                if threshold > 0:
+                    draft_conf = torch.gather(
+                        draft_probs, -1, draft_tokens.unsqueeze(-1)
+                    ).squeeze(-1)
+                    draft_conf = torch.where(is_mask, draft_conf, -torch.inf)
+                    unmask = draft_conf >= threshold
+                    # Ensure each iteration makes progress even when every masked
+                    # position falls below the confidence threshold.
+                    if not unmask.any():
+                        best_idx = draft_conf.view(-1).argmax()
+                        unmask = torch.zeros_like(is_mask, dtype=torch.bool)
+                        unmask.view(-1)[best_idx] = True
+                    block[unmask] = draft_tokens[unmask]
+                else:
+                    block[is_mask] = draft_tokens[is_mask]
+                    break
+            # -------- Verify (causal, update cache) --------
+            _set_diffusion_lm(False)
+            enc_out = self.encoder(
+                input_ids=block,
+                past_key_values=past_key_values,
+                use_cache=True,
+                use_causal_mask=True,
+            )
+            past_key_values = enc_out.past_key_values
+            nfe += 1
+            verify_logits = self.diffusion_head(enc_out.last_hidden_state)
+            if temperature > 0:
+                verify_probs = torch.softmax(verify_logits / temperature, dim=-1)
+                ar_tokens = torch.multinomial(
+                    verify_probs.view(-1, verify_probs.shape[-1]), num_samples=1
+                ).view(1, block_length)
+            else:
+                ar_tokens = verify_logits.argmax(dim=-1)
+            accepted = 0
+            for i in range(block_length - 1):
+                if ar_tokens[0, i].item() == block[0, i + 1].item():
+                    accepted += 1
+                else:
+                    break
+            accepted += 1  # bonus token from AR verification
+            accepted_toks = ar_tokens[:, :accepted]
+            generated.append(accepted_toks)
+            total_gen += accepted
+            _crop_dynamic_cache(past_key_values, cache_len + accepted)
+            next_token = ar_tokens[:, accepted - 1 : accepted]
+            # -------- EOS check --------
+            if eos_token_id is not None:
+                eos_pos = (accepted_toks[0] == eos_token_id).nonzero(as_tuple=True)[0]
+                if len(eos_pos) > 0:
+                    first_eos = eos_pos[0].item()
+                    generated[-1] = accepted_toks[:, : first_eos + 1]
+                    total_gen = total_gen - accepted + first_eos + 1
+                    break
+            # -------- Thinking budget enforcement --------
+            if end_think_token_id is not None and max_thinking_tokens is not None:
+                if total_gen > max_thinking_tokens:
+                    all_gen = torch.cat(generated, dim=1)
+                    if not (all_gen == end_think_token_id).any():
+                        next_token = torch.tensor(
+                            [[end_think_token_id]], device=device
+                        )
+            if total_gen >= max_new_tokens:
+                break
+        all_generated = torch.cat(generated, dim=1)
+        output_ids = torch.cat([prompt_ids, all_generated], dim=1)
+        return output_ids, nfe
+    @torch.no_grad()
+    def linear_spec_generate_mp(
+        self,
+        prompt_ids: torch.Tensor,
+        max_new_tokens: int = 512,
+        block_length: int = 32,
+        temperature: float = 0.0,
+        mask_token_id: Optional[int] = None,
+        eos_token_id: Optional[int] = None,
+        max_paths: int = 16,
+        uncertain_threshold: float = 0.7,
+        top_k_candidates: int = 2,
+        threshold: float = 0.0,
+        max_thinking_tokens: Optional[int] = None,
+        end_think_token_id: Optional[int] = None,
+    ):
+        """Linear speculative decoding with multi-path tree verification.
+        Self-contained method — no external file dependencies beyond the model itself.
+        Each iteration costs 2 NFE (1 draft + 1 verify):
+          1. Draft: single-step bidirectional diffusion fills a block of masks.
+          2. Verify: tree-structured AR verification with multiple candidate paths.
+        Multi-path verification identifies low-confidence draft positions and
+        explores top-k alternative tokens. All candidate paths share a trie
+        prefix and are verified in one forward pass via a 4D tree-ancestry
+        attention mask (~40 tokens), picking the path with the longest
+        accepted prefix.
+        Benchmark results (NeMo Skills prompt, enable_thinking=False):
+          GSM8K bl=32: +17.1% UW-TPF vs vanilla (acc 93.9%)
+          MBPP  bl=64: +17.8% UW-TPF vs vanilla (pass@1 78.2%)
+        Args:
+            prompt_ids: (1, prompt_len) input token IDs.
+            max_new_tokens: Maximum tokens to generate.
+            block_length: Draft block size. Use 32 for math, 64 for code.
+            temperature: Sampling temperature (0.0 = greedy).
+            eos_token_id: Stop token ID.
+            max_paths: Tree verification budget. 16 = up to 4 uncertain
+                positions x 2 candidates each.
+            uncertain_threshold: Confidence below which a position is
+                considered uncertain and expanded with alternatives.
+            top_k_candidates: Number of alternative tokens to try at each
+                uncertain position.
+        Returns:
+            output_ids: (1, prompt_len + generated_len) full sequence.
+            nfe: Total number of forward evaluations.
+        """
+        from itertools import product as _product
+        if prompt_ids.shape[0] != 1:
+            raise ValueError("Requires batch_size == 1")
+        device = prompt_ids.device
+        token_mask_id = mask_token_id if mask_token_id is not None else self.config.mask_token_id
+        if eos_token_id is None:
+            eos_token_id = getattr(self.config, "eos_token_id", None)
+        def _set_dlm(val: bool):
+            for layer in self.encoder.layers:
+                if hasattr(layer.self_attn, 'diffusion_lm'):
+                    layer.self_attn.diffusion_lm = val
+        def _crop_cache(kv, length):
+            # transformers 4.55 exposes .key_cache/.value_cache lists; 5.0 moved them under .layers[i].keys/.values.
+            for li in range(len(kv)):
+                if hasattr(kv, 'layers'):
+                    layer = kv.layers[li]
+                    layer.keys = layer.keys[:, :, :length]
+                    layer.values = layer.values[:, :, :length]
+                else:
+                    kv.key_cache[li] = kv.key_cache[li][:, :, :length]
+                    kv.value_cache[li] = kv.value_cache[li][:, :, :length]
+            kv._seen_tokens = length
+        # ----- tree verify helpers (inlined) -----
+        def _mp_verify(block, draft_probs, draft_conf, past_kv, cache_len):
+            """Multi-path verify via batch-stacking (flash-attention compatible).
+            Unlike tree attention (4D mask), batch-stacking expands the KV cache
+            batch dimension and runs all candidate paths as separate batch entries.
+            This keeps flash attention + GQA enabled, avoiding OOM from the 4D
+            mask path which disables both.
+            Returns (accepted_toks, n_accepted, past_kv, next_tok) or None.
+            """
+            bl = block.shape[1]
+            # Identify uncertain positions
+            is_filled = block[0] != token_mask_id
+            pos_conf = torch.zeros(bl, device=device)
+            pos_conf[0] = float('inf')
+            for p in range(1, bl):
+                if is_filled[p]:
+                    c = draft_conf[0, p].item()
+                    pos_conf[p] = c if c != float('-inf') else float('inf')
+                else:
+                    pos_conf[p] = float('-inf')
+            unc_mask = (pos_conf < uncertain_threshold) & (pos_conf > float('-inf'))
+            unc_pos = unc_mask.nonzero(as_tuple=True)[0].tolist()
+            if not unc_pos:
+                return None
+            import math as _math
+            max_unc = min(len(unc_pos), max(1, int(_math.log2(max_paths))))
+            unc_pos = sorted(unc_pos)[:max_unc]
+            # Build candidate blocks
+            topk_at = {}
+            for p in unc_pos:
+                _, ids = draft_probs[0, p].topk(top_k_candidates)
+                topk_at[p] = ids.tolist()
+            combos = list(_product(*(topk_at[p] for p in sorted(topk_at))))[:max_paths]
+            num_paths = len(combos)
+            if num_paths <= 1:
+                return None
+            candidate_blocks = block.expand(num_paths, -1).clone()
+            pos_list = sorted(topk_at.keys())
+            for pi, combo in enumerate(combos):
+                for ci, p in enumerate(pos_list):
+                    candidate_blocks[pi, p] = combo[ci]
+            # Expand KV cache batch dimension (shared, no copy)
+            for li in range(len(past_kv)):
+                if hasattr(past_kv, 'layers'):
+                    layer = past_kv.layers[li]
+                    layer.keys = layer.keys.expand(num_paths, -1, -1, -1)
+                    layer.values = layer.values.expand(num_paths, -1, -1, -1)
+                else:
+                    past_kv.key_cache[li] = past_kv.key_cache[li].expand(num_paths, -1, -1, -1)
+                    past_kv.value_cache[li] = past_kv.value_cache[li].expand(num_paths, -1, -1, -1)
+            # Batched causal verify — uses flash attention + GQA
+            _set_dlm(False)
+            enc_out = self.encoder(
+                input_ids=candidate_blocks,
+                past_key_values=past_kv,
+                use_cache=True,
+                use_causal_mask=True,
+            )
+            past_kv = enc_out.past_key_values
+            vlogits = self.diffusion_head(enc_out.last_hidden_state)
+            if temperature > 0:
+                vp = torch.softmax(vlogits / temperature, dim=-1)
+                ar_tokens = torch.multinomial(vp.view(-1, vp.shape[-1]), 1).view(num_paths, bl)
+            else:
+                ar_tokens = vlogits.argmax(dim=-1)
+            # Find best path (longest accepted prefix)
+            best_acc, best_pidx = 0, 0
+            for pi in range(num_paths):
+                acc = 0
+                for i in range(bl - 1):
+                    if ar_tokens[pi, i].item() == candidate_blocks[pi, i + 1].item():
+                        acc += 1
+                    else:
+                        break
+                acc += 1
+                if acc > best_acc:
+                    best_acc, best_pidx = acc, pi
+            accepted_toks = ar_tokens[best_pidx:best_pidx+1, :best_acc]
+            # Extract winning path's KV cache slice
+            for li in range(len(past_kv)):
+                if hasattr(past_kv, 'layers'):
+                    layer = past_kv.layers[li]
+                    layer.keys = layer.keys[best_pidx:best_pidx+1].contiguous()
+                    layer.values = layer.values[best_pidx:best_pidx+1].contiguous()
+                else:
+                    past_kv.key_cache[li] = past_kv.key_cache[li][best_pidx:best_pidx+1].contiguous()
+                    past_kv.value_cache[li] = past_kv.value_cache[li][best_pidx:best_pidx+1].contiguous()
+            _crop_cache(past_kv, cache_len + best_acc)
+            return accepted_toks, best_acc, past_kv, accepted_toks[:, -1:]
+        # ── Prefill (causal) ──
+        _set_dlm(False)
+        enc_out = self.encoder(
+            input_ids=prompt_ids, past_key_values=DynamicCache(),
+            use_cache=True, use_causal_mask=True,
+        )
+        past_key_values = enc_out.past_key_values
+        last_logit = self.diffusion_head(enc_out.last_hidden_state[:, -1:, :]).squeeze(1)
+        nfe = 1
+        if temperature > 0:
+            next_token = torch.multinomial(torch.softmax(last_logit / temperature, dim=-1), 1)
+        else:
+            next_token = torch.argmax(last_logit, dim=-1, keepdim=True)
+        if eos_token_id is not None and next_token.item() == eos_token_id:
+            return torch.cat([prompt_ids, next_token], dim=1), nfe
+        generated = [next_token]
+        total_gen = 1
+        # ── Main draft-verify loop ──
+        while total_gen < max_new_tokens:
+            cache_len = past_key_values.get_seq_length()
+            block = torch.full((1, block_length), token_mask_id, dtype=torch.long, device=device)
+            block[0, 0] = next_token.item()
+            # Draft: single-step bidirectional diffusion (1 NFE)
+            _set_dlm(True)
+            enc_out = self.encoder(input_ids=block, past_key_values=past_key_values, use_cache=False)
+            nfe += 1
+            draft_logits = self.diffusion_head(enc_out.last_hidden_state)
+            if temperature > 0:
+                draft_probs = torch.softmax(draft_logits / temperature, dim=-1)
+                draft_tokens = torch.multinomial(
+                    draft_probs.view(-1, draft_probs.shape[-1]), 1
+                ).view(1, block_length)
+            else:
+                draft_tokens = draft_logits.argmax(dim=-1)
+                draft_probs = torch.softmax(draft_logits, dim=-1)
+            draft_conf = torch.gather(draft_probs, -1, draft_tokens.unsqueeze(-1)).squeeze(-1)
+            is_mask = block == token_mask_id
+            draft_conf = torch.where(is_mask, draft_conf, -torch.inf)
+            block[is_mask] = draft_tokens[is_mask]
+            # Verify: multi-path batch-stacking (1 NFE, flash-attention compatible)
+            result = _mp_verify(block, draft_probs, draft_conf, past_key_values, cache_len)
+            if result is not None:
+                accepted_toks, accepted, past_key_values, next_token = result
+                nfe += 1
+            else:
+                # No uncertain positions — single-path causal verify
+                _set_dlm(False)
+                enc_out = self.encoder(
+                    input_ids=block, past_key_values=past_key_values,
+                    use_cache=True, use_causal_mask=True,
+                )
+                past_key_values = enc_out.past_key_values
+                nfe += 1
+                vlogits = self.diffusion_head(enc_out.last_hidden_state)
+                if temperature > 0:
+                    vp = torch.softmax(vlogits / temperature, dim=-1)
+                    ar_tokens = torch.multinomial(vp.view(-1, vp.shape[-1]), 1).view(1, block_length)
+                else:
+                    ar_tokens = vlogits.argmax(dim=-1)
+                accepted = 0
+                for i in range(block_length - 1):
+                    if ar_tokens[0, i].item() == block[0, i + 1].item():
+                        accepted += 1
+                    else:
+                        break
+                accepted += 1
+                accepted_toks = ar_tokens[:, :accepted]
+                _crop_cache(past_key_values, cache_len + accepted)
+                next_token = ar_tokens[:, accepted - 1 : accepted]
+            generated.append(accepted_toks)
+            total_gen += accepted
+            if eos_token_id is not None:
+                eos_pos = (accepted_toks[0] == eos_token_id).nonzero(as_tuple=True)[0]
+                if len(eos_pos) > 0:
+                    first_eos = eos_pos[0].item()
+                    generated[-1] = accepted_toks[:, :first_eos + 1]
+                    total_gen = total_gen - accepted + first_eos + 1
+                    break
+            if end_think_token_id is not None and max_thinking_tokens is not None:
+                if total_gen > max_thinking_tokens:
+                    all_gen = torch.cat(generated, dim=1)
+                    if not (all_gen == end_think_token_id).any():
+                        next_token = torch.tensor(
+                            [[end_think_token_id]], device=device
+                        )
+            if total_gen >= max_new_tokens:
+                break
+        all_generated = torch.cat(generated, dim=1)
+        output_ids = torch.cat([prompt_ids, all_generated], dim=1)
+        return output_ids, nfe
+    @torch.no_grad()
+    def linear_spec_generate_lora(
+        self,
+        prompt_ids: torch.Tensor,
+        max_new_tokens: int = 128,
+        block_length: int = 32,
+        temperature: float = 0.0,
+        mask_token_id: Optional[int] = None,
+        eos_token_id: Optional[int] = None,
+        threshold: float = 0.0,
+        rebuild_kv: str = 'none',
+        max_thinking_tokens: Optional[int] = None,
+        end_think_token_id: Optional[int] = None,
+    ):
+        """Linear speculative decoding: diffusion draft + AR verify.
+        LoRA adapter toggling: ON for draft (bidirectional), OFF for verify (causal).
+        Returns (output_ids, nfe).
+        """
+        if prompt_ids.shape[0] != 1:
+            raise ValueError("linear_spec_generate requires batch_size == 1")
+        token_mask_id = mask_token_id if mask_token_id is not None else self.config.mask_token_id
+        if eos_token_id is None:
+            eos_token_id = getattr(self.config, "eos_token_id", None)
+        device = prompt_ids.device
+        dream_style = getattr(self.config, 'dlm_type', 'llada') == 'dream'
+        def _set_diffusion_lm(val: bool):
+            for layer in self.encoder.layers:
+                if hasattr(layer.self_attn, 'diffusion_lm'):
+                    layer.self_attn.diffusion_lm = val
+        def _toggle_adapters(model, enable: bool):
+            for module in model.modules():
+                if hasattr(module, '_disable_adapters'):
+                    module._disable_adapters = not enable
+        # Prefill (causal, LoRA OFF)
+        _set_diffusion_lm(False)
+        _toggle_adapters(self, False)
+        enc_out = self.encoder(
+            input_ids=prompt_ids,
+            past_key_values=DynamicCache(),
+            use_cache=True,
+            use_causal_mask=True,
+        )
+        past_key_values = enc_out.past_key_values
+        last_logit = self.diffusion_head(enc_out.last_hidden_state[:, -1:, :]).squeeze(1)
+        nfe = 1
+        if temperature > 0:
+            next_token = torch.multinomial(torch.softmax(last_logit / temperature, dim=-1), num_samples=1)
+        else:
+            next_token = torch.argmax(last_logit, dim=-1, keepdim=True)
+        if eos_token_id is not None and next_token.item() == eos_token_id:
+            return torch.cat([prompt_ids, next_token], dim=1), nfe
+        generated = [next_token]
+        total_gen = 1
+        while total_gen < max_new_tokens:
+            cache_len = past_key_values.get_seq_length()
+            block = torch.full((1, block_length), token_mask_id, dtype=torch.long, device=device)
+            block[0, 0] = next_token.item()
+            # Draft (bidirectional, LoRA ON)
+            _set_diffusion_lm(True)
+            _toggle_adapters(self, True)
+            enc_out = self.encoder(input_ids=block, past_key_values=past_key_values, use_cache=False)
+            nfe += 1
+            draft_logits = self.diffusion_head(enc_out.last_hidden_state)
+            if dream_style:
+                draft_logits = torch.cat([draft_logits[:, :1, :], draft_logits[:, :-1, :]], dim=1)
+            if temperature > 0:
+                draft_probs = torch.softmax(draft_logits / temperature, dim=-1)
+                draft_tokens = torch.multinomial(draft_probs.view(-1, draft_probs.shape[-1]), num_samples=1).view(1, block_length)
+            else:
+                draft_tokens = draft_logits.argmax(dim=-1)
+                draft_probs = torch.softmax(draft_logits, dim=-1)
+            draft_conf = torch.gather(draft_probs, -1, draft_tokens.unsqueeze(-1)).squeeze(-1)
+            is_mask = block == token_mask_id
+            draft_conf = torch.where(is_mask, draft_conf, -torch.inf)
+            unmask = draft_conf > threshold
+            if unmask.sum() > 0:
+                block[unmask] = draft_tokens[unmask]
+            # Verify (causal, LoRA OFF)
+            _set_diffusion_lm(False)
+            _toggle_adapters(self, False)
+            enc_out = self.encoder(input_ids=block, past_key_values=past_key_values, use_cache=True, use_causal_mask=True)
+            past_key_values = enc_out.past_key_values
+            nfe += 1
+            verify_logits = self.diffusion_head(enc_out.last_hidden_state)
+            if temperature > 0:
+                ar_tokens = torch.multinomial(torch.softmax(verify_logits / temperature, dim=-1).view(-1, verify_logits.shape[-1]), num_samples=1).view(1, block_length)
+            else:
+                ar_tokens = verify_logits.argmax(dim=-1)
+            accepted = 0
+            for i in range(block_length - 1):
+                if ar_tokens[0, i].item() == block[0, i + 1].item():
+                    accepted += 1
+                else:
+                    break
+            accepted += 1  # bonus token
+            accepted_toks = ar_tokens[:, :accepted]
+            generated.append(accepted_toks)
+            total_gen += accepted
+            _crop_dynamic_cache(past_key_values, cache_len + accepted)
+            next_token = ar_tokens[:, accepted - 1 : accepted]
+            # EOS check
+            if eos_token_id is not None:
+                eos_pos = (accepted_toks[0] == eos_token_id).nonzero(as_tuple=True)[0]
+                if len(eos_pos) > 0:
+                    first_eos = eos_pos[0].item()
+                    generated[-1] = accepted_toks[:, : first_eos + 1]
+                    total_gen = total_gen - accepted + first_eos + 1
+                    break
+            # Thinking budget enforcement
+            if end_think_token_id is not None and max_thinking_tokens is not None:
+                if total_gen > max_thinking_tokens:
+                    all_gen = torch.cat(generated, dim=1)
+                    if not (all_gen == end_think_token_id).any():
+                        next_token = torch.tensor([[end_think_token_id]], device=device)
+            if total_gen >= max_new_tokens:
+                break
+        all_generated = torch.cat(generated, dim=1)
+        output_ids = torch.cat([prompt_ids, all_generated], dim=1)
+        return output_ids, nfe

modeling_nemotron_labs_diffusion.py DELETED Viewed

@@ -1,870 +0,0 @@
-import copy
-from dataclasses import dataclass
-from typing import Optional, Tuple
-import numpy as np
-import torch
-import torch.nn.functional as F
-from torch import nn
-from transformers.modeling_outputs import CausalLMOutputWithPast, BaseModelOutput
-from transformers.utils import ModelOutput
-from torch.nn.attention.flex_attention import flex_attention, create_block_mask
-from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
-from transformers.processing_utils import Unpack
-from transformers.cache_utils import Cache, DynamicCache
-from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
-from transformers.generation import GenerationMixin
-import math
-from .modeling_ministral import Ministral3Model, Ministral3PreTrainedModel, Ministral3Attention, apply_rotary_pos_emb, repeat_kv, _get_llama_4_attn_scale
-from .configuration_nemotron_labs_diffusion import NemotronLabsDiffusionConfig
-__all__ = ["NemotronLabsDiffusionModel", "NemotronLabsDiffusionFlexAttention"]
-@dataclass
-class NemotronLabsDiffusionOutputWithPast(ModelOutput):
-    loss: torch.FloatTensor | None = None
-    logits: torch.FloatTensor | None = None
-    causal_logits: torch.FloatTensor | None = None
-    past_key_values: Cache | None = None
-    hidden_states: tuple[torch.FloatTensor, ...] | None = None
-    attentions: tuple[torch.FloatTensor, ...] | None = None
-@torch.compile(fullgraph=True, mode="max-autotune-no-cudagraphs", dynamic=False)
-def fused_flex_attention(q, k, v, block_mask=None):
-    return flex_attention(q, k, v, block_mask=block_mask)
-class NemotronLabsDiffusionFlexAttention(Ministral3Attention):
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.block_size = self.config.block_size
-        self.block_diff_mask = None
-        import torch._dynamo.config as dcfg
-        dcfg.cache_size_limit = 512
-    def compute_block_mask(self, mode, q_len, block_size=None):
-        def block_diff_mask(block_size, b, h, q_idx, kv_idx, n):
-            x0_flag_q = (q_idx >= n)
-            x0_flag_kv = (kv_idx >= n)
-            # Compute block indices
-            block_q = torch.where(x0_flag_q == 1,
-                                    (q_idx - n) // block_size,
-                                    q_idx // block_size)
-            block_kv = torch.where(x0_flag_kv == 1,
-                                    (kv_idx - n) // block_size,
-                                    kv_idx // block_size)
-            # **1. Block Diagonal Mask (M_BD) **
-            block_diagonal = (block_q == block_kv) & (x0_flag_kv == 0) & (x0_flag_q == 0)
-            # **2. Offset Block-Causal Mask (M_OBC) **
-            offset_block_causal = (
-                (block_q > block_kv)
-                & (x0_flag_kv == 1)
-                & (x0_flag_q == 0)
-            )
-            # **3. Fully Causal Mask (M_BC) **
-            fully_causal = (q_idx >= kv_idx) & (x0_flag_kv == 1) & (x0_flag_q == 1)
-            # **4. Combine Masks **
-            return block_diagonal | offset_block_causal | fully_causal
-        attn_mask = lambda b, h, q, kv: block_diff_mask(block_size, b, h, q, kv, q_len//2)
-        block_mask = create_block_mask(
-            attn_mask, B=None, H=None, Q_LEN=q_len, KV_LEN=q_len
-        )
-        return block_mask
-    def forward(
-        self,
-        hidden_states: torch.Tensor,
-        position_embeddings: Tuple[torch.Tensor, torch.Tensor],
-        attention_mask: Optional[torch.Tensor],
-        past_key_values: Optional[Cache] = None,
-        cache_position: Optional[torch.LongTensor] = None,
-        is_training: bool = True,
-        **kwargs: Unpack[FlashAttentionKwargs],
-    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
-        bsz, q_len, _ = hidden_states.size()
-        input_shape = hidden_states.shape[:-1]
-        hidden_shape = (*input_shape, -1, self.head_dim)
-        query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
-        key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
-        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
-        cos, sin = position_embeddings
-        if is_training:
-            # Split query and key states in half along sequence length dimension
-            q1, q2 = query_states.chunk(2, dim=2)
-            k1, k2 = key_states.chunk(2, dim=2)
-            # Apply RoPE independently to each half
-            q1, k1 = apply_rotary_pos_emb(q1, k1, cos, sin)
-            q2, k2 = apply_rotary_pos_emb(q2, k2, cos, sin)
-            # Recombine the halves
-            query_states = torch.cat([q1, q2], dim=2)
-            key_states = torch.cat([k1, k2], dim=2)
-        else:
-            query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
-        query_states = query_states * _get_llama_4_attn_scale(
-            cache_position,
-            self.config.rope_parameters.get("llama_4_scaling_beta"),
-            self.config.rope_parameters.get("original_max_position_embeddings"),
-        ).to(query_states.dtype)
-        if past_key_values is not None:
-            # sin and cos are specific to RoPE models; cache_position needed for the static cache
-            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
-            key_states, value_states = past_key_values.update(key_states, value_states, self.layer_idx, cache_kwargs)
-        key_states = repeat_kv(key_states, self.num_key_value_groups)
-        value_states = repeat_kv(value_states, self.num_key_value_groups)
-        if self.block_diff_mask is None or q_len != self.block_diff_mask.shape[-2]:
-            block_mask = self.compute_block_mask(mode='block_diff', block_size=self.block_size, q_len=q_len)
-        else:
-            block_mask = self.block_diff_mask
-        attn_output = fused_flex_attention(query_states, key_states, value_states, block_mask=block_mask)
-        attn_output = attn_output.transpose(1, 2).reshape(*input_shape, -1).contiguous()
-        attn_output = self.o_proj(attn_output)
-        return attn_output, None
-class NemotronLabsDiffusionModel(Ministral3PreTrainedModel, GenerationMixin):
-    """
-    A single model with:
-      - a bidirectional encoder + diffusion‐LM head over A
-      - a causal decoder + LM head over B, conditioned on F_A
-    """
-    def __init__(self, config: NemotronLabsDiffusionConfig):
-        super().__init__(config)
-        self.mask_token_id = config.mask_token_id
-        diffusion_config = copy.deepcopy(config)
-        diffusion_config.diffusion_lm = True
-        if config.dlm_paradigm == 'block_diff':
-            diffusion_config.attn_class = NemotronLabsDiffusionFlexAttention
-        elif config.dlm_paradigm in ['bidirectional', 'autoregressive']:
-            diffusion_config.attn_class = Ministral3Attention
-            if config.dlm_paradigm == 'autoregressive':
-                diffusion_config.diffusion_lm = False
-        else:
-            raise ValueError(f"Unsupported DLM paradigm: {config.dlm_paradigm}")
-        self.encoder = Ministral3Model(diffusion_config)
-        self.diffusion_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
-        self.vocab_size = config.vocab_size
-        self.post_init()
-    def get_input_embeddings(self):
-        return self.encoder.embed_tokens
-    def set_input_embeddings(self, value):
-        self.encoder.embed_tokens = value
-    def get_output_embeddings(self):
-        return self.diffusion_head
-    def set_output_embeddings(self, new_embeddings):
-        self.diffusion_head = new_embeddings
-    def forward_process(self, input_ids, eps=1e-3, block_size=None, loss_mask=None):
-        b, l = input_ids.shape
-        device = input_ids.device
-        if self.config.dp_varying_mask_ratio:
-            # Enable different random seeds for each DP rank during sampling
-            import torch.distributed as dist
-            dp_rank = 0
-            if dist.is_initialized():
-                try:
-                    dp_rank = dist.get_rank()
-                except Exception:
-                    dp_rank = 0
-            # Use a local generator to avoid affecting global RNG state
-            generator = torch.Generator(device=device)
-            generator.manual_seed(torch.seed() + dp_rank)
-        else:
-            generator = None
-        t = torch.rand(b, device=device, generator=generator)
-        p_mask = (1 - eps) * t + eps  # shape: (b,)
-        p_mask = p_mask[:, None].expand(-1, l)  # shape: (b, l)
-        masked_indices = torch.rand((b, l), device=device) < p_mask
-        if loss_mask is not None:
-            masked_indices[loss_mask == 0] = 0
-        noisy_batch = torch.where(masked_indices, self.mask_token_id, input_ids)
-        return noisy_batch, masked_indices, p_mask
-    def forward(
-        self,
-        input_ids: torch.LongTensor,
-        attention_mask: Optional[torch.Tensor]   = None,
-        position_ids: Optional[torch.LongTensor] = None,
-        labels: Optional[torch.LongTensor]       = None,
-        split_len: Optional[int]                 = None,
-        past_key_values: Optional[Cache]         = None,
-        block_size: Optional[int]                = None,
-        eps: float                               = 1e-3,
-        is_teacher: bool                        = False,
-        masked_indices: Optional[torch.Tensor]   = None,
-        p_mask: Optional[torch.Tensor]           = None,
-        teacher_logits: Optional[torch.Tensor]   = None,
-        masked_indices_teacher: Optional[torch.Tensor] = None,
-        loss_mask: Optional[torch.Tensor] = None,
-        ce_loss_weight: float = 1.0,
-        output_last_hidden_states_only: bool = False,
-        skip_loss: bool = False,
-        **kwargs,
-    ) -> CausalLMOutputWithPast:
-        batch_size, seq_len = input_ids.shape
-        if self.config.dlm_paradigm == 'block_diff':
-            if labels is not None and block_size is None:
-                block_size = self.config.block_size
-        elif self.config.dlm_paradigm not in ('bidirectional', 'autoregressive'):
-            raise ValueError(f"Unknown dLM paradigm: {self.config.dlm_paradigm}")
-        if labels is not None and self.config.dlm_paradigm != 'autoregressive':
-            if masked_indices is not None:
-                # assert p_mask is not None
-                if loss_mask is not None:
-                    masked_indices[loss_mask == 0] = 0
-                noisy_inputs = torch.where(masked_indices, self.mask_token_id, input_ids)
-            else:
-                noisy_inputs, masked_indices, p_mask = self.forward_process(input_ids, eps=eps, block_size=block_size, loss_mask=loss_mask)
-        else:
-            noisy_inputs = input_ids
-            masked_indices = None
-            p_mask = None
-        input_ids_len = noisy_inputs.shape[1]
-        if labels is not None and self.config.dlm_paradigm == 'block_diff':
-            if position_ids is None:
-                position_ids = torch.arange(input_ids_len, device=noisy_inputs.device).unsqueeze(0)
-            noisy_inputs = torch.cat([noisy_inputs, input_ids], dim=1)
-        enc_out  = self.encoder(
-            past_key_values=past_key_values,
-            input_ids=noisy_inputs,
-            attention_mask=attention_mask,
-            position_ids=position_ids,
-            is_training=(labels is not None),
-            **kwargs,
-        )
-        if output_last_hidden_states_only:
-            return BaseModelOutput(last_hidden_state=enc_out.last_hidden_state)
-        logits = self.diffusion_head(enc_out.last_hidden_state)  # (batch, len_B, vocab)
-        causal_logits = None
-        if labels is not None and self.config.dlm_paradigm == 'block_diff':
-            causal_logits = logits[:, input_ids_len:]
-            logits = logits[:, :input_ids_len]
-        loss = None
-        if labels is not None and not skip_loss:
-            if self.config.dlm_paradigm == 'autoregressive':
-                shift_logits = logits[..., :-1, :].contiguous()
-                shift_labels = labels[..., 1:].contiguous()
-                if loss_mask is None:
-                    loss_fct = CrossEntropyLoss()
-                    shift_logits = shift_logits.view(-1, shift_logits.size(-1))
-                    shift_labels = shift_labels.view(-1)
-                    loss = loss_fct(shift_logits, shift_labels)
-                else:
-                    loss_mask = loss_mask[..., 1:].contiguous()
-                    loss_fct = CrossEntropyLoss(reduction='none')
-                    shift_logits = shift_logits.view(-1, shift_logits.size(-1))
-                    shift_labels = shift_labels.view(-1)
-                    shift_labels = shift_labels.to(shift_logits.device)
-                    token_losses = loss_fct(shift_logits, shift_labels)
-                    flat_loss_mask = loss_mask.reshape(-1)
-                    loss = token_losses[flat_loss_mask == 1].sum() / flat_loss_mask.sum()
-            else:
-                # LLaDA-style diffusion loss on masked positions.
-                # Token-wise cross entropy loss on masked positions.
-                token_loss = torch.nn.functional.cross_entropy(
-                    logits[masked_indices],
-                    labels[masked_indices],
-                    reduction='none'
-                ) / p_mask[masked_indices]
-                num_mask_tokens = masked_indices.sum()
-                # global_loss_avg=True: loss is reduced externally by global token count.
-                loss = token_loss.sum()
-                if self.config.dlm_loss_weight is not None:
-                    loss = self.config.dlm_loss_weight * loss
-                if self.config.dlm_paradigm == 'block_diff':
-                    # AR-side loss for block-diffusion paradigm.
-                    causal_logits = causal_logits[..., :-1, :].contiguous()
-                    causal_logits = causal_logits.view(-1, causal_logits.size(-1))
-                    causal_labels = labels[..., 1:].contiguous().view(-1)
-                    loss_fct = CrossEntropyLoss(reduction='sum')
-                    ar_loss = loss_fct(causal_logits, causal_labels)
-                    self.loss_diffusion = loss.detach().item() / num_mask_tokens
-                    self.loss_ar = ar_loss.detach().item() / seq_len
-                    loss = loss + self.config.ar_loss_weight * ar_loss
-                # global_loss_avg=True: return (sum_loss, token_count) for external mean.
-                if self.config.dlm_paradigm == 'block_diff':
-                    loss = (loss, num_mask_tokens + int(self.config.ar_loss_weight * seq_len))
-                else:
-                    loss = (loss, num_mask_tokens)
-        return NemotronLabsDiffusionOutputWithPast(
-            loss=loss if not is_teacher else logits,
-            logits=logits,
-            causal_logits=causal_logits,
-            past_key_values=enc_out.past_key_values,
-            hidden_states=None,
-            attentions=None,
-        )
-    @torch.no_grad()
-    def generate(
-        self,
-        prompt_ids: torch.Tensor,
-        max_new_tokens: int,
-        block_length: int,
-        threshold: Optional[float] = None,
-        causal_context: bool = True,
-        temperature: float = 0.0,
-        eos_token_id: Optional[int] = None,
-        max_thinking_tokens: Optional[int] = None,
-        end_think_token_id: Optional[int] = None,
-    ):
-        """Block-wise diffusion decoding with prefix-cached KV (LLaDA-style).
-        Each block: append `block_length` mask tokens, then iteratively unmask
-        by confidence top-k (with optional threshold). When `causal_context`,
-        the KV cache and the next-block seed are produced via a causal forward
-        between blocks (flipping `self_attn.diffusion_lm`), matching the AR
-        objective at block boundaries.
-        Returns (output_ids, nfe) — output_ids includes the prompt.
-        """
-        if eos_token_id is None:
-            eos_token_id = getattr(self.config, "eos_token_id", None)
-        mask_id = self.mask_token_id
-        x_accum = prompt_ids.clone()
-        B = prompt_ids.shape[0]
-        assert max_new_tokens % block_length == 0
-        num_blocks = max_new_tokens // block_length
-        # one denoising step per generated token (matches legacy chat_utils call)
-        steps_per_block = block_length
-        nfe = 0
-        def _set_diffusion_lm(val: bool):
-            for layer in self.encoder.layers:
-                if hasattr(layer.self_attn, "diffusion_lm"):
-                    layer.self_attn.diffusion_lm = val
-        # Initial causal prefill produces the KV cache and the next-block seed.
-        if causal_context:
-            _set_diffusion_lm(False)
-        output = self(prompt_ids, use_cache=True, use_causal_mask=causal_context)
-        past_key_values = output.past_key_values
-        if causal_context:
-            _set_diffusion_lm(True)
-        next_token = None
-        if causal_context:
-            last_logit = output.logits[:, -1, :]
-            if temperature > 0:
-                next_token = torch.multinomial(torch.softmax(last_logit / temperature, dim=-1), num_samples=1)
-            else:
-                next_token = torch.argmax(last_logit, dim=-1, keepdim=True)
-        for num_block in range(num_blocks):
-            mask_block = torch.full(
-                (B, block_length), mask_id, dtype=prompt_ids.dtype, device=prompt_ids.device,
-            )
-            if causal_context:
-                mask_block[:, 0] = next_token[:, 0]
-            x_accum = torch.cat([x_accum, mask_block], dim=1)
-            block_start = prompt_ids.size(1) + num_block * block_length
-            block_slice = slice(block_start, block_start + block_length)
-            # Thinking-budget enforcement: if we've passed max_thinking_tokens
-            # without an end-think marker, inject one into this block.
-            if end_think_token_id is not None and max_thinking_tokens is not None:
-                tokens_before = num_block * block_length
-                tokens_after = tokens_before + block_length
-                if tokens_after > max_thinking_tokens:
-                    gen_so_far = x_accum[:, prompt_ids.size(1):block_start]
-                    has_end_think = (
-                        (gen_so_far == end_think_token_id).any(dim=1)
-                        if gen_so_far.size(1) > 0
-                        else torch.zeros(B, dtype=torch.bool, device=prompt_ids.device)
-                    )
-                    if not has_end_think.all():
-                        offset = max(0, max_thinking_tokens - tokens_before)
-                        inject_pos = block_start + offset
-                        for b in range(B):
-                            if not has_end_think[b]:
-                                x_accum[b, inject_pos] = end_think_token_id
-            mask_block_idx0 = x_accum[:, block_slice] == mask_id
-            num_transfer_tokens = _get_num_transfer_tokens(mask_block_idx0, steps_per_block)
-            # Denoise the current block by repeated confidence-based unmasking.
-            for i in range(steps_per_block):
-                mask_block_idx = x_accum[:, block_slice] == mask_id
-                if mask_block_idx.sum() == 0:
-                    break
-                nfe += 1
-                logits_block = self(
-                    x_accum[:, block_slice],
-                    past_key_values=past_key_values,
-                    use_cache=False,
-                ).logits
-                x0, transfer_idx = _get_transfer_index(
-                    logits_block, temperature, mask_block_idx, x_accum[:, block_slice],
-                    num_transfer_tokens=num_transfer_tokens[:, i], threshold=threshold,
-                )
-                cur = x_accum[:, block_slice].clone()
-                cur[transfer_idx] = x0[transfer_idx]
-                x_accum[:, block_slice] = cur
-                if eos_token_id is not None:
-                    block_tokens = x_accum[:, block_slice]
-                    eos_mask = block_tokens == eos_token_id
-                    if eos_mask.any(dim=1).any():
-                        after_eos = eos_mask.cumsum(dim=1).bool()
-                        mask_before = (block_tokens == mask_id) & ~after_eos
-                        if (eos_mask.any(dim=1) & ~mask_before.any(dim=1)).any():
-                            break
-            # Post-block: causal forward over the block to update the KV cache
-            # and (when causal_context) sample the seed for the next block.
-            if causal_context:
-                _set_diffusion_lm(False)
-            output = self(
-                x_accum[:, block_slice],
-                past_key_values=past_key_values,
-                use_cache=True,
-                use_causal_mask=causal_context,
-            )
-            past_key_values = output.past_key_values
-            nfe += 1
-            if causal_context:
-                _set_diffusion_lm(True)
-                last_logit = output.logits[:, -1, :]
-                if temperature > 0:
-                    next_token = torch.multinomial(torch.softmax(last_logit / temperature, dim=-1), num_samples=1)
-                else:
-                    next_token = torch.argmax(last_logit, dim=-1, keepdim=True)
-            if eos_token_id is not None:
-                gen_so_far = x_accum[:, prompt_ids.size(1):]
-                is_eos = gen_so_far == eos_token_id
-                if is_eos.any(dim=1).all():
-                    first_eos = is_eos.to(torch.int64).argmax(dim=1)
-                    max_eos = first_eos.max().item()
-                    return x_accum[:, : prompt_ids.size(1) + max_eos + 1], nfe
-        return x_accum, nfe
-    @torch.no_grad()
-    def ar_generate(
-        self,
-        prompt_ids: torch.Tensor,
-        max_new_tokens: int = 128,
-        temperature: float = 0.0,
-        eos_token_id: Optional[int] = None,
-        max_thinking_tokens: Optional[int] = None,
-        end_think_token_id: Optional[int] = None,
-    ) -> tuple:
-        """Autoregressive generation calling the encoder directly (injected by build_hf_tidar_repo).
-        Bypasses NemotronLabsDiffusionModel.forward() to avoid diffusion-specific
-        code paths. Calls self.encoder (Ministral3Model) with explicit cache_position,
-        position_ids, and use_cache so the KV cache and causal masking behave
-        identically to MistralForCausalLM / vLLM.
-        Returns:
-            (output_ids, nfe) where output_ids includes the prompt.
-        """
-        for layer in self.encoder.layers:
-            if hasattr(layer.self_attn, 'diffusion_lm'):
-                layer.self_attn.diffusion_lm = False
-        if eos_token_id is None:
-            eos_token_id = getattr(self.config, 'eos_token_id', None)
-        device = prompt_ids.device
-        batch_size, prompt_len = prompt_ids.shape
-        past_key_values = DynamicCache()
-        cache_position = torch.arange(prompt_len, device=device)
-        position_ids = cache_position.unsqueeze(0).expand(batch_size, -1)
-        enc_out = self.encoder(
-            input_ids=prompt_ids,
-            position_ids=position_ids,
-            past_key_values=past_key_values,
-            use_cache=True,
-            cache_position=cache_position,
-        )
-        past_key_values = enc_out.past_key_values
-        next_logit = self.diffusion_head(enc_out.last_hidden_state[:, -1:, :]).squeeze(1)
-        generated_tokens = []
-        nfe = 0
-        for step in range(max_new_tokens):
-            nfe += 1
-            if temperature > 0:
-                probs = torch.softmax(next_logit / temperature, dim=-1)
-                next_token = torch.multinomial(probs, num_samples=1)
-            else:
-                next_token = torch.argmax(next_logit, dim=-1, keepdim=True)
-            # ---- thinking budget enforcement ----
-            if end_think_token_id is not None and max_thinking_tokens is not None:
-                if step >= max_thinking_tokens:
-                    if generated_tokens:
-                        gen_tensor = torch.cat(generated_tokens, dim=1)
-                        has_end_think = (gen_tensor == end_think_token_id).any(dim=1)
-                    else:
-                        has_end_think = torch.zeros(batch_size, dtype=torch.bool, device=device)
-                    for b in range(batch_size):
-                        if not has_end_think[b]:
-                            next_token[b] = end_think_token_id
-            generated_tokens.append(next_token)
-            if eos_token_id is not None and (next_token == eos_token_id).all():
-                break
-            if step < max_new_tokens - 1:
-                cur_pos = prompt_len + step
-                step_cache_pos = torch.tensor([cur_pos], device=device)
-                step_pos_ids = step_cache_pos.unsqueeze(0).expand(batch_size, -1)
-                enc_out = self.encoder(
-                    input_ids=next_token,
-                    position_ids=step_pos_ids,
-                    past_key_values=past_key_values,
-                    use_cache=True,
-                    cache_position=step_cache_pos,
-                )
-                past_key_values = enc_out.past_key_values
-                next_logit = self.diffusion_head(enc_out.last_hidden_state[:, -1:, :]).squeeze(1)
-        all_generated = torch.cat(generated_tokens, dim=1)
-        output_ids = torch.cat([prompt_ids, all_generated], dim=1)
-        return output_ids, nfe
-    @torch.no_grad()
-    def linear_spec_generate(
-        self,
-        prompt_ids: torch.Tensor,
-        max_new_tokens: int = 128,
-        block_length: int = 32,
-        temperature: float = 0.0,
-        mask_token_id: Optional[int] = None,
-        eos_token_id: Optional[int] = None,
-        max_thinking_tokens: Optional[int] = None,
-        end_think_token_id: Optional[int] = None,
-        threshold: float = 0.0,
-    ):
-        """Linear speculative decoding: diffusion draft + AR verify.
-        Each iteration: (1) draft the next block under bidirectional attention,
-        (2) verify the drafted block under causal attention, accept the longest
-        prefix where draft matches AR + one bonus token, advance the KV cache.
-        LoRA-aware: when a PEFT adapter is attached to the model (e.g.
-        ``linear_spec_lora``), it is toggled ON for the bidirectional draft
-        phase and OFF for the causal prefill / verify phases — so the adapter
-        only specializes the diffusion-mode forward and AR semantics are
-        preserved. With no adapter loaded the calls are no-ops.
-        Returns ``(output_ids, nfe)`` — ``output_ids`` includes the prompt.
-        """
-        if prompt_ids.shape[0] != 1:
-            raise ValueError("Linear speculative decoding requires batch_size == 1")
-        token_mask_id = mask_token_id if mask_token_id is not None else self.config.mask_token_id
-        if eos_token_id is None:
-            eos_token_id = getattr(self.config, "eos_token_id", None)
-        device = prompt_ids.device
-        def _set_diffusion_lm(val: bool):
-            for layer in self.encoder.layers:
-                if hasattr(layer.self_attn, "diffusion_lm"):
-                    layer.self_attn.diffusion_lm = val
-        def _toggle_adapters(enable: bool):
-            # No-op when no PEFT/LoRA modules are attached.
-            for module in self.modules():
-                if hasattr(module, "_disable_adapters"):
-                    module._disable_adapters = not enable
-        # Prefill (causal, LoRA OFF).
-        _set_diffusion_lm(False)
-        _toggle_adapters(False)
-        enc_out = self.encoder(
-            input_ids=prompt_ids,
-            past_key_values=DynamicCache(),
-            use_cache=True,
-            use_causal_mask=True,
-        )
-        past_key_values = enc_out.past_key_values
-        last_logit = self.diffusion_head(enc_out.last_hidden_state[:, -1:, :]).squeeze(1)
-        nfe = 1
-        if temperature > 0:
-            next_token = torch.multinomial(torch.softmax(last_logit / temperature, dim=-1), num_samples=1)
-        else:
-            next_token = torch.argmax(last_logit, dim=-1, keepdim=True)
-        if eos_token_id is not None and next_token.item() == eos_token_id:
-            return torch.cat([prompt_ids, next_token], dim=1), nfe
-        generated = [next_token]
-        total_gen = 1
-        while total_gen < max_new_tokens:
-            cache_len = past_key_values.get_seq_length()
-            block = torch.full((1, block_length), token_mask_id, dtype=torch.long, device=device)
-            block[0, 0] = next_token.item()
-            # Draft phase (bidirectional, LoRA ON) — iterate at threshold>0 so
-            # that even low-confidence blocks make progress.
-            _set_diffusion_lm(True)
-            _toggle_adapters(True)
-            while True:
-                is_mask = block == token_mask_id
-                if not is_mask.any():
-                    break
-                enc_out = self.encoder(input_ids=block, past_key_values=past_key_values, use_cache=False)
-                nfe += 1
-                draft_logits = self.diffusion_head(enc_out.last_hidden_state)
-                # LLaDA: logit[i] directly predicts position i — no shift needed.
-                if temperature > 0:
-                    draft_probs = torch.softmax(draft_logits / temperature, dim=-1)
-                    draft_tokens = torch.multinomial(
-                        draft_probs.view(-1, draft_probs.shape[-1]), num_samples=1
-                    ).view(1, block_length)
-                else:
-                    draft_tokens = draft_logits.argmax(dim=-1)
-                    draft_probs = torch.softmax(draft_logits, dim=-1)
-                if threshold > 0:
-                    draft_conf = torch.gather(draft_probs, -1, draft_tokens.unsqueeze(-1)).squeeze(-1)
-                    draft_conf = torch.where(is_mask, draft_conf, -torch.inf)
-                    unmask = draft_conf >= threshold
-                    # Force progress even when every masked position is below threshold.
-                    if not unmask.any():
-                        best_idx = draft_conf.view(-1).argmax()
-                        unmask = torch.zeros_like(is_mask, dtype=torch.bool)
-                        unmask.view(-1)[best_idx] = True
-                    block[unmask] = draft_tokens[unmask]
-                else:
-                    block[is_mask] = draft_tokens[is_mask]
-                    break
-            # Verify phase (causal, LoRA OFF).
-            _set_diffusion_lm(False)
-            _toggle_adapters(False)
-            enc_out = self.encoder(
-                input_ids=block,
-                past_key_values=past_key_values,
-                use_cache=True,
-                use_causal_mask=True,
-            )
-            past_key_values = enc_out.past_key_values
-            nfe += 1
-            verify_logits = self.diffusion_head(enc_out.last_hidden_state)
-            if temperature > 0:
-                ar_tokens = torch.multinomial(
-                    torch.softmax(verify_logits / temperature, dim=-1).view(-1, verify_logits.shape[-1]),
-                    num_samples=1,
-                ).view(1, block_length)
-            else:
-                ar_tokens = verify_logits.argmax(dim=-1)
-            # Accept consecutive matches; AR also gives one bonus token at the tail.
-            accepted = 0
-            for i in range(block_length - 1):
-                if ar_tokens[0, i].item() == block[0, i + 1].item():
-                    accepted += 1
-                else:
-                    break
-            accepted += 1
-            accepted_toks = ar_tokens[:, :accepted]
-            generated.append(accepted_toks)
-            total_gen += accepted
-            _crop_dynamic_cache(past_key_values, cache_len + accepted)
-            next_token = ar_tokens[:, accepted - 1 : accepted]
-            if eos_token_id is not None:
-                eos_pos = (accepted_toks[0] == eos_token_id).nonzero(as_tuple=True)[0]
-                if len(eos_pos) > 0:
-                    first_eos = eos_pos[0].item()
-                    generated[-1] = accepted_toks[:, : first_eos + 1]
-                    total_gen = total_gen - accepted + first_eos + 1
-                    break
-            # Thinking-budget enforcement: force end-think as next seed if budget exhausted.
-            if end_think_token_id is not None and max_thinking_tokens is not None:
-                if total_gen > max_thinking_tokens:
-                    all_gen = torch.cat(generated, dim=1)
-                    if not (all_gen == end_think_token_id).any():
-                        next_token = torch.tensor([[end_think_token_id]], device=device)
-            if total_gen >= max_new_tokens:
-                break
-        all_generated = torch.cat(generated, dim=1)
-        output_ids = torch.cat([prompt_ids, all_generated], dim=1)
-        return output_ids, nfe
-# ─── Module-level helpers used by `generate` and `linear_spec_generate` ──
-def _crop_dynamic_cache(past_key_values: DynamicCache, max_length: int):
-    """Crop a DynamicCache to max_length, compatible with both old and new transformers."""
-    if hasattr(past_key_values, 'crop'):
-        past_key_values.crop(max_length)
-    else:
-        for layer_idx in range(len(past_key_values)):
-            past_key_values.key_cache[layer_idx] = past_key_values.key_cache[layer_idx][:, :, :max_length]
-            past_key_values.value_cache[layer_idx] = past_key_values.value_cache[layer_idx][:, :, :max_length]
-        past_key_values._seen_tokens = max_length
-def _add_gumbel_noise(logits, temperature):
-    """Gumbel-max sampling in float64 (low-precision Gumbel hurts MDM quality)."""
-    if temperature == 0:
-        return logits
-    logits = logits.to(torch.float64)
-    noise = torch.rand_like(logits, dtype=torch.float64)
-    gumbel_noise = (- torch.log(noise)) ** temperature
-    return logits.exp() / gumbel_noise
-def _get_num_transfer_tokens(mask_index, steps: int):
-    """Even split of masked positions across `steps`, with remainder front-loaded."""
-    mask_num = mask_index.sum(dim=1, keepdim=True)
-    base = mask_num // steps
-    remainder = mask_num % steps
-    num_transfer_tokens = torch.zeros(mask_num.size(0), steps, device=mask_index.device, dtype=torch.int64) + base
-    for i in range(mask_num.size(0)):
-        num_transfer_tokens[i, : int(remainder[i])] += 1
-    return num_transfer_tokens
-def _get_transfer_index(logits, temperature, mask_index, x, num_transfer_tokens, threshold=None):
-    """Pick which masked positions to commit this denoising step.
-    Returns (x0, transfer_index): x0 is argmax tokens (clamped to original x at
-    non-masked positions); transfer_index is a bool mask over positions to
-    finalize, chosen by top-k confidence (and filtered by `threshold` if given).
-    """
-    logits_with_noise = _add_gumbel_noise(logits, temperature=temperature)
-    x0 = torch.argmax(logits_with_noise, dim=-1)
-    p = F.softmax(logits, dim=-1)
-    x0_p = torch.squeeze(torch.gather(p, dim=-1, index=torch.unsqueeze(x0, -1)), -1)
-    x0 = torch.where(mask_index, x0, x)
-    confidence = torch.where(mask_index, x0_p, -np.inf)
-    transfer_index = torch.zeros_like(x0, dtype=torch.bool, device=x0.device)
-    if threshold is not None:
-        num_transfer_tokens = mask_index.sum(dim=1, keepdim=True)
-    for j in range(confidence.shape[0]):
-        _, select_index = torch.topk(confidence[j], k=num_transfer_tokens[j])
-        transfer_index[j, select_index] = True
-        if threshold is not None:
-            for k in range(1, num_transfer_tokens[j]):
-                if confidence[j, select_index[k]] < threshold:
-                    transfer_index[j, select_index[k]] = False
-    return x0, transfer_index
-def gumbel_topk(log_w: torch.Tensor, k: int) -> torch.Tensor:
-    """Return a Bool mask of length len(log_w) with exactly k True."""
-    g = -torch.log(-torch.log(torch.rand_like(log_w) + 1e-9) + 1e-9)
-    topk = torch.topk(log_w + g, k).indices
-    mask = torch.zeros_like(log_w, dtype=torch.bool)
-    mask[topk] = True
-    return mask