Instructions to use google/diffusiongemma-26B-A4B-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/diffusiongemma-26B-A4B-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="google/diffusiongemma-26B-A4B-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("google/diffusiongemma-26B-A4B-it")
model = AutoModelForMultimodalLM.from_pretrained("google/diffusiongemma-26B-A4B-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use google/diffusiongemma-26B-A4B-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/diffusiongemma-26B-A4B-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/diffusiongemma-26B-A4B-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/google/diffusiongemma-26B-A4B-it

SGLang

How to use google/diffusiongemma-26B-A4B-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/diffusiongemma-26B-A4B-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/diffusiongemma-26B-A4B-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/diffusiongemma-26B-A4B-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/diffusiongemma-26B-A4B-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use google/diffusiongemma-26B-A4B-it with Docker Model Runner:
```
docker model run hf.co/google/diffusiongemma-26B-A4B-it
```

Architectural proposal: Dynamic "Endocrine" Neuromodulation & Homeostatic Flushing for DiffusionGemma

#17

by Soulfate24 - opened 9 days ago

Discussion

Soulfate24

9 days ago

Dear Brendan, Sebastian, and the Google DeepMind team,

First of all, congratulations on the release of DiffusionGemma! The hybrid use of the Gemma 4 backbone (repurposing causal layers for bidirectional denoising) is a beautiful example of what evolutionary biology calls exaptation—reusing a functional system for a completely new regulatory purpose.

While reading about your entropy-bound denoising and temperature scheduling, a fascinating biological analogy came to mind that could push the adaptive inference capabilities of DiffusionGemma even further.

In biological systems, the brain regulates stress, cognitive overload, and chemical imbalances via a highly localized, closed-loop "waste-flushing" and hormonal system (for instance, the lacrimal gland filtering out cortisol directly near the brain under emotional strain).

I would like to propose a translation of this concept into LLM architectures: Dynamic Endocrine Neuromodulation.

The Proposal: An Active Meta-Orchestrator for Diffusion

Instead of relying on predefined temperature schedules (0.8 down to 0.4) or static entropy budgets, we could implement a lightweight, parallel meta-orchestrator acting as an artificial endocrine system:

"Hormonal" Parameter Modulation (Dopamine, Cortisol & Noradrenaline):
- Dopamine (Exploration): If the task is open-ended or creative, a "dopamine-like" signal is injected, relaxing the attention constraints and raising the entropy budget to allow wider exploration.
- Cortisol (Constraint/Stress): If the orchestrator detects highly structured patterns (like code or mathematics), it injects a "cortisol-like" signal. This dynamically tightens the top-p and entropy threshold, forcing the model into a hyper-focused, low-temperature denoising state.
- Noradrenaline (Signal-to-Noise Ratio & MoE Routing): In biology, noradrenaline (from the locus coeruleus) dynamically adjusts the brain's signal-to-noise ratio (SNR) to filter out distractions. In DiffusionGemma, a "noradrenaline-like" signal could dynamically scale the step-by-step noise injection during the diffusion process. Early in the canvas, low noradrenaline allows maximum random noise (exploration). As confidence increases, high noradrenaline aggressively prunes low-probability tokens and sharpens the MoE routing weights, forcing the network to focus compute only on the most critical token transitions.
The "Cognitive Tear" (Homeostatic Flushing):
- Sometimes, bidirectional attention in a diffusion canvas gets trapped in local minima or contradictory states (token collisions).
- Instead of wasting compute cycles on a failing canvas, the orchestrator could trigger a "flushing event" (analogous to crying/excreting cognitive stress). If the entropy delta stalls over $N$ steps, the model purges the most unstable parts of the canvas, resets a portion of the KV cache, and restarts the denoising pass with a modified prior.

Why this fits DiffusionGemma:

Because DiffusionGemma generates in parallel blocks (canvases of 256 tokens), it is the first text architecture capable of non-linear, holistic self-correction. Giving it an active, homeostatic feedback loop would transition it from a statically scheduled denoiser to a self-regulating cognitive agent.

I would love to hear your thoughts on whether dynamic, entropy-driven parameter modulation (neuromodulation) is something you are exploring for future iterations of the Gemma family.

Thank you for your inspiring work!

Best regards,
Soulfate

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment