Instructions to use amazon/GDN-primed-HQwen3-32B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use amazon/GDN-primed-HQwen3-32B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="amazon/GDN-primed-HQwen3-32B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("amazon/GDN-primed-HQwen3-32B-Instruct", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use amazon/GDN-primed-HQwen3-32B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "amazon/GDN-primed-HQwen3-32B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "amazon/GDN-primed-HQwen3-32B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/amazon/GDN-primed-HQwen3-32B-Instruct

SGLang

How to use amazon/GDN-primed-HQwen3-32B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "amazon/GDN-primed-HQwen3-32B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "amazon/GDN-primed-HQwen3-32B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "amazon/GDN-primed-HQwen3-32B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "amazon/GDN-primed-HQwen3-32B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use amazon/GDN-primed-HQwen3-32B-Instruct with Docker Model Runner:
```
docker model run hf.co/amazon/GDN-primed-HQwen3-32B-Instruct
```

GDN-primed-HQwen3-32B-Instruct

GDN-primed-HQwen3-32B-Instruct is a Hybrid language model consisting of 50% Attention layers and 50% Gated DeltaNet (GDN) layers, primed from Qwen3-32B using the Hybrid Model Factory Priming pipeline. The model is instruction-tuned and supports context lengths up to 128K tokens.

GDN is a State-Space Model layer with constant memory and linear compute cost in the sequence length.

By combining Attention with GDN, our Hybrid model achieves up to 2× faster inference at long contexts while closely matching the base Transformer's quality.

Why Hybrid?

Each Primed Hybrid model is initialized from a base Transformer by converting a portion of its Attention layers into State-Space Model (SSM) layers that maintain a fixed-size recurrent state instead of a growing KV cache. At a 50% Hybrid ratio, roughly half the KV cache (which grows linearly with sequence length) is replaced with fixed-size SSM state. The practical benefits:

Higher throughput at long contexts — less memory on KV cache means more memory for batching
More concurrent sequences — ~2× as many concurrent sequences before hitting memory limits
Growing advantage with context length — at long contexts, Attention dominates the forward pass while SSM layers remain negligible in cost. Since the Hybrid model makes roughly half as many Attention calls as the base Transformer, the throughput advantage grows with context length

Increasing hybridization ratio, replacing more Attention layers with SSM layers, further reduces memory and increases throughput, typically at the expense of performance.

Model Overview

Type: Causal Language Model (Hybrid Attention + SSM)
Base Model: Qwen3-32B
Hybrid Layer Type: Gated DeltaNet (GDN)
Hybrid Ratio: 50% (32 Attention + 32 GDN layers)
Parameters: ~32B
Context Length: 128K natively
Precision: bfloat16
License: Apache 2.0

Note, this is an Instruct-tuned model and is not a thinking model, that is, it does not natively produce chain-of-thought thinking tokens in its generation trace.

Benchmark Results

Below we report benchmark performance for all our instruct-tuned Primed models. All Hybrid models use a 50% Hybrid ratio and are Primed from Qwen3-32B.

We consider the following Transformer as a baseline:

Qwen3-32B (Long): The Qwen model fine-tuned on our priming data, extending its native context length from 32K to 128K. All Primed Hybrid models use the same training hyperparameters and data as this baseline, making it a fair comparison for differing architectures.

On both long- and short-context benchmarks, our Primed Hybrid models closely match the performance of the Transformer model while having considerably lower deployment costs, showcasing the efficacy of the Priming process.

Long-Context Benchmarks

Evaluated on HELMET, MRCR, and BABILong across context lengths from 8K to 128K, using a weighted average with geometrically increasing weights for longer contexts.

The plot below shows performance averaged over context lengths from 8K to 128K.

How close are the Hybrid models to the Transformer baseline on long context tasks? Primed GKA and GDN hybrids have competitive long-context capabilities with a gap of ~2.5-3 points on average with the Transformer [Qwen3-32B (Long)], while being 1.5–2× faster at inference on long contexts.

Short-Context NLP Benchmarks

Evaluations on Tulu3-dev from OLMES. All tasks are over a short-context length (≤ 8K). Each category in the table below averages the following Tulu3-dev subtasks:

Math: GSM8K, MATH.
Knowledge: MMLU, PopQA, TruthfulQA.
Coding: HumanEval, HumanEval+.
Reasoning: BigBenchHard.
Instruction Following: IFEval.

Model	Math	Knowledge	Coding	Reasoning	Instruction Following	Average
Qwen3-32B [Long]	74.43	54.47	94.54	82.89	81.52	77.56
GKA-primed-HQwen3-32B-Instruct	74.02	53.95	93.43	80.31	78.74	76.09
GDN-primed-HQwen3-32B-Instruct	73.65	54.35	94.40	80.99	79.3	76.54

How close are the Hybrid models to the Transformer baseline on short context tasks? Our Primed Hybrid models are within ~1-1.5 points of the average performance of the Transformer [Qwen3-32B (Long)] using <0.5% of the base Transformer model’s pre-training token budget.

For applications to complex reasoning and coding problems check out our Primed Hybrid Reasoning models.

About Gated DeltaNet (GDN)

Gated DeltaNet is a State-Space Model layer with diagonal + low-rank transition dynamics. It extends Mamba2 with the Delta Update rule, improving expressiveness through gated state transitions while retaining Mamba2's efficiency.

For more details, see the GDN paper.

Architecture Details

Component	Details
Number of Layers	64 (32 Attention + 32 GDN)
Hidden Dimension	5120
Attention Heads	64 (Q) / 8 (KV)
Head Dimension	128
Intermediate Dimension (FFN)	25600
Vocabulary Size	151,936
Position Encoding	RoPE (θ = 5,000,000)
Layer Layout	GDN layer indices were selected with our selective hybridization procedure

Inference Efficiency

Sustained decode throughput (tokens/s) on 8× H200 GPUs (TP=8), measured during pure decode with a saturated KV cache. Benchmarked with random data (no prefix-caching benefits). See the full Inference guide for methodology and additional models.

Model	16K	32K	64K	128K
GDN-primed-HQwen3-32B-Instruct	8,133 (1.53×)	4,876 (1.70×)	2,688 (2.06×)	1,238 (2.11×)
GKA-primed-HQwen3-32B (`num_iter=30`, default)	6,810 (1.29×)	4,152 (1.45×)	2,385 (1.82×)	1,168 (1.99×)
Qwen3-32B (Long)	5,299	2,865	1,308	586

Mean TTFT at the Transformer's saturated batch size (Hybrid model has memory to spare):

Model	16K	32K	64K	128K
GDN-primed-HQwen3-32B-Instruct	42,492 ms (1.08×)	48,417 ms (1.00×)	57,525 ms (0.88×)	73,145 ms (0.77×)
GKA-primed-HQwen3-32B (`num_iter=30`, default)	52,053 ms (1.32×)	58,613 ms (1.21×)	68,241 ms (1.05×)	84,935 ms (0.90×)
Qwen3-32B (Long)	39,421 ms	48,527 ms	65,104 ms	94,479 ms

The decode throughput advantage grows with context length — from 1.53× at 16K to 2.11× at 128K — thanks to GDN layers maintaining a fixed-size recurrent state instead of a growing KV cache. TTFT crosses over at 32K and reaches 0.77× (23% faster) at 128K.

Usage

With vLLM (recommended)

Install the Hybrid Model Factory vLLM plugin in your local environment, then serve:

vllm serve amazon/GDN-primed-HQwen3-32B-Instruct \
  --enable-prefix-caching \
  --mamba-cache-mode align \
  --mamba-cache-dtype float32 \
  --mamba-ssm-cache-dtype float32

Query the server:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "amazon/GDN-primed-HQwen3-32B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is Linear Attention in the context of LLMs?"}
    ]
  }'

The --mamba-cache-dtype float32 and --mamba-ssm-cache-dtype float32 flags are important for accurate long-context generation. See the Inference guide for details on all recommended flags.

With Hugging Face Transformers

See the Inference guide for details on when we recommend the Hugging Face Transformers implementation as opposed to the highly optimized vLLM one.

from transformers import AutoModelForCausalLM, AutoTokenizer
import hmf.model.hybrid_zoo.models.model_register  # Register Hybrid models

model = AutoModelForCausalLM.from_pretrained(
    "amazon/GDN-primed-HQwen3-32B-Instruct", trust_remote_code=True
).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("amazon/GDN-primed-HQwen3-32B-Instruct")

messages = [{"role": "user", "content": "What is linear attention in the context of LLMs?"}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training-Free Context Extension

This model supports training-free context extension 2-4× its native context via an extension to Hybrid models of PICASO cache composition. See the State Composition guide for usage. Note, this is currently supported in Hugging Face Transformers only.

Training data

These models were produced through the multi-stage Priming pipeline from Hybrid Model Factory. Training data spans web documents, mathematics, long-context documents, and instruction-following and reasoning examples — each targeting a different capability axis. This diversity is critical: it allows the Priming procedure to convert a base Transformer into a more memory- and compute-efficient Hybrid architecture at nearly the same level of performance, using <0.5% of the base Transformer model's pre-training token budget.

Responsible AI Considerations

At Amazon, we are committed to developing AI responsibly and take a people-centric approach that prioritizes education, science, and our customers, to integrate responsible AI across the end-to-end AI lifecycle. We believe the use of AI must respect the rule of law and human rights, and we encourage the safe and responsible development of AI. When downloaded or used in accordance with AWS Responsible AI Policy, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report model quality, risk, security vulnerabilities or Amazon AI Concerns here.

Citation

@software{hybrid_model_factory,
  title = {Hybrid Model Factory},
  year = {2026},
  url = {https://github.com/awslabs/hybrid-model-factory}
}

@misc{yang2025gateddeltanetworksimproving,
      title={Gated Delta Networks: Improving Mamba2 with Delta Rule}, 
      author={Songlin Yang and Jan Kautz and Ali Hatamizadeh},
      year={2025},
      eprint={2412.06464},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.06464}, 
}

License

This model is licensed under the Apache 2.0 License.

Downloads last month: 16

Safetensors

Model size

34B params

Tensor type

BF16

Model tree for amazon/GDN-primed-HQwen3-32B-Instruct

Base model

Qwen/Qwen3-32B

Finetuned

(520)

this model

Collection including amazon/GDN-primed-HQwen3-32B-Instruct

Primed Hybrid Models Collection

Collection

Collection of primed Hybrid language models for Reasoning and Instruction following. • 9 items • Updated Apr 9 • 7

Papers for amazon/GDN-primed-HQwen3-32B-Instruct

PICASO: Permutation-Invariant Context Composition with State Space Models

Paper • 2502.17605 • Published Feb 24, 2025

Gated Delta Networks: Improving Mamba2 with Delta Rule

Paper • 2412.06464 • Published Dec 9, 2024 • 17

amazon
/

GDN-primed-HQwen3-32B-Instruct

GDN-primed-HQwen3-32B-Instruct

Links

Why Hybrid?

Model Overview

Benchmark Results

Long-Context Benchmarks

Short-Context NLP Benchmarks

About Gated DeltaNet (GDN)

Architecture Details

Inference Efficiency

Usage

With vLLM (recommended)

With Hugging Face Transformers

Training-Free Context Extension

Training data

Responsible AI Considerations

Citation

License

Model tree for amazon/GDN-primed-HQwen3-32B-Instruct

Collection including amazon/GDN-primed-HQwen3-32B-Instruct

Primed Hybrid Models Collection

Papers for amazon/GDN-primed-HQwen3-32B-Instruct

PICASO: Permutation-Invariant Context Composition with State Space Models

Gated Delta Networks: Improving Mamba2 with Delta Rule