YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

DilatedQwen3-0.6B

A Qwen3-0.6B checkpoint repackaged as a custom architecture (model_type: dilated_qwen3) with a non-standard attention pattern. The weights are vanilla Qwen3-0.6B; only how attention is computed changes.

This is a self-contained HuggingFace bundle โ€” it loads with trust_remote_code=True and does not depend on any external repo.

Attention mechanism

Standard Qwen3 self-attention is replaced by a local-dense + dilated long-range causal pattern. Write delta = i - j for the causal distance from query position i to key position j (delta >= 0). Query i attends to key j if and only if:

delta < local_window        # dense local window: every recent token
  OR  delta % dilation == 0  # dilated long range: every dilation-th token

So the most recent local_window tokens are attended in full, and everything older is attended at a stride of dilation, all the way back to the start of the sequence. Both parts are causal.

Defaults: local_window = 128, dilation = 2. Setting dilation = 1 recovers standard causal attention; sequences shorter than local_window are also just full causal attention.

Mask for local_window = 6, dilation = 2 (# = attended, row = query i, column = key j):

     j: 0123456789...
 i= 0   #
 i= 1   ##
 i= 2   ###
 i= 3   ####
 i= 4   #####
 i= 5   ######          <- still inside the local window: dense
 i= 6   #######
 i= 7   .#######         <- past the window: oldest key now skipped (stride 2)
 i= 8   #.#######
 i= 9   .#.#######
 i=10   #.#.#######
 i=11   .#.#.#######

The right-hand ####### run is the dense local window; the #.#. prefix is the dilated long-range tail.

The take-home task

  1. Register this architecture (custom attention) with vLLM.
  2. Profile it end-to-end.
  3. Optimize end-to-end performance.

Loading

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "DilatedQwen3-0.6B", trust_remote_code=True
)
tok = AutoTokenizer.from_pretrained("DilatedQwen3-0.6B")

trust_remote_code=True is required: model_type="dilated_qwen3" is unknown to transformers, so the architecture must be loaded from the local modeling_dilated_qwen3.py (and registered explicitly in vLLM).

Files

File Purpose
configuration_dilated_qwen3.py Config (local_window, dilation)
modeling_dilated_qwen3.py Model + the local-dense / dilated long-range attention
config.json auto_map โ†’ local files
model.safetensors Weights (Qwen3-0.6B, 596M params)
tokenizer files Qwen3 tokenizer
Downloads last month
-
Safetensors
Model size
0.8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support