Instructions to use nvidia/Efficient-DLM-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/Efficient-DLM-4B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/Efficient-DLM-4B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("nvidia/Efficient-DLM-4B", trust_remote_code=True)
model = AutoModel.from_pretrained("nvidia/Efficient-DLM-4B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use nvidia/Efficient-DLM-4B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/Efficient-DLM-4B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Efficient-DLM-4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/Efficient-DLM-4B

SGLang

How to use nvidia/Efficient-DLM-4B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/Efficient-DLM-4B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Efficient-DLM-4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/Efficient-DLM-4B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Efficient-DLM-4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/Efficient-DLM-4B with Docker Model Runner:
```
docker model run hf.co/nvidia/Efficient-DLM-4B
```

YongganFu commited on Nov 12, 2025

Commit

be9e4da

verified ·

1 Parent(s): 7a07ecc

Update modeling_nvrdiff.py

Browse files

Files changed (1) hide show

modeling_nvrdiff.py +35 -35

modeling_nvrdiff.py CHANGED Viewed

@@ -486,45 +486,45 @@ class DiffEncoderModel(Qwen3PreTrainedModel, GenerationMixin):
             logits = logits[:, :input_ids_len]
         loss = None
-        if labels is not None:
-            if self.config.dlm_paradigm == 'autoregressive':
-                shift_logits = logits[..., :-1, :].contiguous()
-                shift_labels = labels[..., 1:].contiguous()
-                if loss_mask is None:
-                    loss_fct = CrossEntropyLoss()
-                    shift_logits = shift_logits.view(-1, shift_logits.size(-1))
-                    shift_labels = shift_labels.view(-1)
-                    loss = loss_fct(shift_logits, shift_labels)
-                else:
-                    loss_mask = loss_mask[..., 1:].contiguous()
-                    loss_fct = CrossEntropyLoss(reduction='none')
-                    shift_logits = shift_logits.view(-1, shift_logits.size(-1))
-                    shift_labels = shift_labels.view(-1)
-                    shift_labels = shift_labels.to(shift_logits.device)
-                    token_losses = loss_fct(shift_logits, shift_labels)
-                    loss = token_losses[loss_mask].sum() / loss_mask.sum()
-            else:
-                # Handle DREAM vs LLADA style losses
-                if hasattr(self.config, 'dlm_type') and self.config.dlm_type == 'dream':
-                    logits = logits[..., :-1, :].contiguous()
-                    labels = labels[..., 1:].contiguous()
-                    masked_indices = masked_indices[:, 1:]
-                    p_mask = p_mask[:, 1:]
-                # Calculate token-wise cross entropy loss for masked positions in B
-                token_loss = torch.nn.functional.cross_entropy(
-                    logits[masked_indices],
-                    labels[masked_indices],
-                    reduction='none'
-                ) / p_mask[masked_indices]
-                loss = token_loss.sum() / masked_indices.sum()
         return CausalLMOutputWithPast(
             loss=loss if not is_teacher else logits,

             logits = logits[:, :input_ids_len]
         loss = None
+        # if labels is not None:
+        #     if self.config.dlm_paradigm == 'autoregressive':
+        #         shift_logits = logits[..., :-1, :].contiguous()
+        #         shift_labels = labels[..., 1:].contiguous()
+        #         if loss_mask is None:
+        #             loss_fct = CrossEntropyLoss()
+        #             shift_logits = shift_logits.view(-1, shift_logits.size(-1))
+        #             shift_labels = shift_labels.view(-1)
+        #             loss = loss_fct(shift_logits, shift_labels)
+        #         else:
+        #             loss_mask = loss_mask[..., 1:].contiguous()
+        #             loss_fct = CrossEntropyLoss(reduction='none')
+        #             shift_logits = shift_logits.view(-1, shift_logits.size(-1))
+        #             shift_labels = shift_labels.view(-1)
+        #             shift_labels = shift_labels.to(shift_logits.device)
+        #             token_losses = loss_fct(shift_logits, shift_labels)
+        #             loss = token_losses[loss_mask].sum() / loss_mask.sum()
+        #     else:
+        #         # Handle DREAM vs LLADA style losses
+        #         if hasattr(self.config, 'dlm_type') and self.config.dlm_type == 'dream':
+        #             logits = logits[..., :-1, :].contiguous()
+        #             labels = labels[..., 1:].contiguous()
+        #             masked_indices = masked_indices[:, 1:]
+        #             p_mask = p_mask[:, 1:]
+        #         # Calculate token-wise cross entropy loss for masked positions in B
+        #         token_loss = torch.nn.functional.cross_entropy(
+        #             logits[masked_indices],
+        #             labels[masked_indices],
+        #             reduction='none'
+        #         ) / p_mask[masked_indices]
+        #         loss = token_loss.sum() / masked_indices.sum()
         return CausalLMOutputWithPast(
             loss=loss if not is_teacher else logits,