Instructions to use nvidia/Nemotron-Labs-Diffusion-8B-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/Nemotron-Labs-Diffusion-8B-Base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/Nemotron-Labs-Diffusion-8B-Base", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("nvidia/Nemotron-Labs-Diffusion-8B-Base", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nvidia/Nemotron-Labs-Diffusion-8B-Base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/Nemotron-Labs-Diffusion-8B-Base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-Diffusion-8B-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/Nemotron-Labs-Diffusion-8B-Base

SGLang

How to use nvidia/Nemotron-Labs-Diffusion-8B-Base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/Nemotron-Labs-Diffusion-8B-Base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-Diffusion-8B-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/Nemotron-Labs-Diffusion-8B-Base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Labs-Diffusion-8B-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/Nemotron-Labs-Diffusion-8B-Base with Docker Model Runner:
```
docker model run hf.co/nvidia/Nemotron-Labs-Diffusion-8B-Base
```

Clean up rope params; ensure transformers 4.55/5.0 compatibility

by abhgarg - opened May 15

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+2416

-1312

initial commit3fa953fd

Upload modela1950408

Upload tokenizer46c0f958

Update README.md618a1c84

Added custom MinistralDiffOutputWithPast return type and skip_loss functionality418d9e48

Upload modeldfcb979c

Upload tokenizer1c298a2e

Upload model378400a4

Delete model-00004-of-00004.safetensorse93dcdea

Delete model-00003-of-00004.safetensorsc4c48810

Delete model-00001-of-00004.safetensors57e8939e

Delete model-00002-of-00004.safetensors987c6528

Delete model.safetensors.index.jsonb4fbb98c

Removed p_mask assert for compatibility with nemo-rlbad4dec3

Upload model6294f2a6

Upload model4cabc4d9

Made some potential fixes for DSA, need to test if they workff077483

Trying to force transformers to use the older causal mask456e96b9

Overriding the old function doesn't work, reverting to old approach33b2954f

set default causal_context=True262d4024

Upload modelf9e0c410

Upload model15597b8e

Update config.json8200ec05

Update chat_template.jinjaabf48b33

Changed chat_template to remove alternating check0820ac4d

Update chat_template.jinja0a9534ae

Trying new settings for tokenizer_config.json to hopefully fix issuesd42bc629

Upload modelf318bfe5

Upload tokenizer85eb9d3c

abhgarg

NVIDIA org May 15

Remove duplicate top-level rope_scaling block and stray rope_theta from config.json
Remove duplicate 'type' key from rope_parameters
For 3B-Base/8B-Base: set max_position_embeddings=4096 and factor=0.25 to match training
Mirror rope_theta and rope_scaling from rope_parameters in MinistralDLMConfig for v4.55 yarn
Drop unused sdpa_mask_older_torch import (removed in transformers v5.0)
Bump transformers_version to 5.0.0
In linear_spec_generate_mp, guard direct past_kv.key_cache / value_cache access behind a hasattr(past_kv, 'layers') check so v5.0's DynamicCache API works too

Clean up rope params; ensure transformers 4.55/5.0 compatibilitya4574aef

YongganFu changed pull request status to merged May 16

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment