Qwen3-8B-DMS-8x

Description:

Qwen-3-8B-DMS-8x is a derivative of Qwen3-8B that integrates Dynamic Memory Sparsification (DMS) with an 8x compression ratio during inference. DMS adaptively sparsifies the KV cache to reduce memory footprint and improve throughput and latency for long-context and reasoning generations. The method learns per-head eviction policies that interpolate between a sliding window over the last 512 tokens and full attention. Inference-time code is provided with the checkpoint.

This model is for research and development only.

License/Terms of Use:

This model is released under the NVIDIA License.

Under the NVIDIA License, NVIDIA confirms:

Models can be used noncommercially, which means for non-commercial research and educational purposes only.

Deployment Geography:

Global

Use Case:

A compact, general-purpose LLM with advanced reasoning capabilities, optimized for inference-time scaling and reduced key-value (KV) cache memory footprint.

Release Date:

Hugging Face on Jan 19th, 2026 via https://huggingface.co/nvidia/Qwen3-8B-DMS-8x

References:

Suggested citation:

@misc{lancucki2025inferencetime,
      title={Inference-Time Hyper-Scaling with KV Cache Compression},
      author={Adrian Łańcucki and Konrad Staniszewski and Piotr Nawrot and Edoardo M. Ponti},
      year={2025},
      eprint={2506.05345},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.05345},
}

Model Architecture:

Architecture Type: Autoregressive Transformer
Network Architecture: Qwen3
Base Model: Qwen3-8B
Number of Model Parameters: 8.2B

Input:

Input Type(s): Text
Input Format: String
Input Parameters: One-Dimensional (1D)
Other Properties Related to Input: The same chat template used for Qwen3-8B should be applied.

Output:

Output Type(s): Text
Output Format: String
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: Native context length 32,768, up to 131,072 tokens with YaRN

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s): HuggingFace Transformers

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Hopper

Preferred/Supported Operating System(s): Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

This AI model can be embedded as an Application Programming Interface (API) call into the software environment described above.

Model Version(s):

Qwen3-8B-DMS-8x

Quick Start and Usage Recommendations:

To run the model transformers==4.57.3, torch and flash-attn are required.

python3 -m venv venv
source venv/bin/activate
pip3 install transformers==4.57.3
pip3 install accelerate  # (optional) for device placement
pip3 install torch
pip3 install flash-attn --no-build-isolation

Model weights and the corresponding DMS-adapted Qwen3 code are available at the HuggingFace hub: nvidia/Qwen3-8B-DMS-8x.

To download the model, you can use the following snippet:

from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import torch

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
model = AutoModelForCausalLM.from_pretrained(
    "nvidia/Qwen3-8B-DMS-8x",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    # The `trust_remote_code` part is important
    # as otherwise Qwen3 code (without DMS) will be loaded
    trust_remote_code=True
)

The rest follows the standard Qwen3 usage pattern. To ask the model about the solution to the quadratic equation, one can use the following snippet:

conversation = [
    {"role": "user", "content": "Solve: x^2 -2x + 1 = 0"}
]
prompt = tokenizer.apply_chat_template(
    conversation,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True
)

streamer = TextStreamer(tokenizer, skip_prompt=False)
model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    streamer=streamer,
    max_new_tokens=2048
)

Training and Evaluation Datasets:

Training Dataset:

Link: https://huggingface.co/datasets/open-r1/OpenR1-Math-220k
Data Modality: Text
Text Training Data Size: Less than a Billion Tokens
Data Collection Method by dataset: Hybrid: Human, Automated
Labeling Method by dataset: Automated
Properties: The dataset contains 220k math problems from NuminaMath 1.5 along with a number of reasoning traces generated by DeepSeek R1 for each problem, some of which were automatically verified with Math Verify and Llama 3.3 70B Instruct. After tokenization, the entire dataset contains 36M tokens. We use only the traces which are labelled as correctly verified.

Evaluation Dataset:

Data Collection Method: Hybrid: Human, Synthetic, Automated
Labeling Method: Hybrid: Human, Synthetic, Automated

Evaluation Results
We evaluate the model using temperature=0.6 and top_p=0.95 with a sequence length limit of 131072 tokens.

Benchmark	Thinking	Qwen3-8B	Qwen3-8B-DMS-8x
GPQA Diamond	y	58.8	57.6
MMLU-Pro	y	74.2	73.5
AIME 2024	y	75.0	73.0
MATH-500	y	95.1	95.5
HumanEval	y	87.8	89.6
IFEval	y	90.3	88.8
ArenaHard v0.1	y	88.4	89.7
RULER 64K	n	69.2	76.2
RULER 128K	n	25.0	21.4

AIME 2024 results were averaged over 10 runs (different seeds) and MATH-500 over 3; MMLU-Pro uses micro-averaging.

Inference:

Acceleration Engine: HuggingFace Transformers
Test Hardware: H100 PCIe/SXM

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.