File size: 6,010 Bytes
83ed10f
 
 
 
 
 
 
 
9377294
83ed10f
 
 
 
 
 
b21df42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83ed10f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
df1b374
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
# Qwen3-Coder-30B-A3B-Instruct-RTPurbo

## Model Overview
- **Model Optimizations:**
  - **Sliding Window Attention:** 85%
  - **Full Attention:** 15%
- **Version:** 1.0

<img src="./headwise.png" alt="screenshot">

RTPurbo uses hybrid HeadWise Attention to compress the Qwen3Coder model. Specifically, it divides attention into two parts according to attention type:

1.  **Retrieval Heads**: These heads perform **Full Attention** over the entire sequence (or a large chunk), allowing them to capture rich, long-range dependencies and act as a powerful information retrieval component.
2.  **non Retrieval Heads**: These heads use **Sink SWA Attention**, processing tokens in a sliding-window or fixed-cache manner. They are highly efficient and ideal for handling very long sequences while maintaining local context.

The following code can be used for inference. HeadWise will be triggered in scenarios where SeqLen > 16,384.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig

model_name = "RTP-LLM/Qwen3-Coder-30B-A3B-Instruct-RTPurbo"

tokenizer = AutoTokenizer.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    config=config, 
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Write a quick sort algorithm."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=128
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)
```

## Evaluation

This model was evaluated in the [lm_eval](https://github.com/EleutherAI/lm-evaluation-harness) benchmark using [Qwen3-Coder-30B-A3B-Instruct](https://www.modelscope.cn/models/Qwen/Qwen3-Coder-30B-A3B-Instruct) as evaluator.

<table style="border-collapse:collapse; border-top:2px solid #000; border-bottom:2px solid #000;">
  <thead>
    <tr style="border-bottom:2px solid #000;">
      <th align="center" style="padding:8px 14px;">Longbench</th>
      <th align="center" style="padding:8px 14px;">lcc</th>
      <th align="center" style="padding:8px 14px;">repo-p</th>
      <th align="center" style="padding:8px 14px;">samsum</th>
      <th align="center" style="padding:8px 14px;">trec</th>
      <th align="center" style="padding:8px 14px;">lsht</th>
      <th align="center" style="padding:8px 14px;">2wikim</th>
      <th align="center" style="padding:8px 14px;">hotpot</th>
      <th align="center" style="padding:8px 14px;">multi-en</th>
      <th align="center" style="padding:8px 14px;">multi-zh</th>
      <th align="center" style="padding:8px 14px;">musique</th>
      <th align="center" style="padding:8px 14px;">qasper</th>
      <th align="center" style="padding:8px 14px;">vcsum</th>
      <th align="center" style="padding:8px 14px;">qmsum</th>
      <th align="center" style="padding:8px 14px;">PR-en</th>
      <th align="center" style="padding:8px 14px;">PR-zh</th>
      <th align="center" style="padding:8px 14px;">Avg. (%)</th>
    </tr>
    <tr style="border-bottom:2px solid #000;">
      <th align="center" colspan="17" style="padding:10px 14px;">Qwen3-Coder-30B-A3B</th>
    </tr>
  </thead>

  <tbody>
    <tr style="border-bottom:2px solid #000;">
      <td align="center" style="padding:8px 14px;"><b>Full Attn</b></td>
      <td align="center" style="padding:8px 14px;">34.34</td>
      <td align="center" style="padding:8px 14px;">27.14</td>
      <td align="center" style="padding:8px 14px;">45.80</td>
      <td align="center" style="padding:8px 14px;">81.00</td>
      <td align="center" style="padding:8px 14px;">47.50</td>
      <td align="center" style="padding:8px 14px;">42.08</td>
      <td align="center" style="padding:8px 14px;">57.64</td>
      <td align="center" style="padding:8px 14px;">52.89</td>
      <td align="center" style="padding:8px 14px;">65.99</td>
      <td align="center" style="padding:8px 14px;">38.30</td>
      <td align="center" style="padding:8px 14px;">39.25</td>
      <td align="center" style="padding:8px 14px;">13.55</td>
      <td align="center" style="padding:8px 14px;">23.77</td>
      <td align="center" style="padding:8px 14px;">99.00</td>
      <td align="center" style="padding:8px 14px;">99.75</td>
      <td align="center" style="padding:8px 14px;">51.20</td>
    </tr>
    <tr style="border-bottom:2px solid #000;">
      <td align="center" style="padding:8px 14px;"><b>RTPurbo</b></td>
      <td align="center" style="padding:8px 14px;">35.96</td>
      <td align="center" style="padding:8px 14px;">35.21</td>
      <td align="center" style="padding:8px 14px;">46.49</td>
      <td align="center" style="padding:8px 14px;">81.00</td>
      <td align="center" style="padding:8px 14px;">49.00</td>
      <td align="center" style="padding:8px 14px;">47.39</td>
      <td align="center" style="padding:8px 14px;">55.44</td>
      <td align="center" style="padding:8px 14px;">52.93</td>
      <td align="center" style="padding:8px 14px;">65.23</td>
      <td align="center" style="padding:8px 14px;">35.58</td>
      <td align="center" style="padding:8px 14px;">39.78</td>
      <td align="center" style="padding:8px 14px;">13.80</td>
      <td align="center" style="padding:8px 14px;">23.68</td>
      <td align="center" style="padding:8px 14px;">99.00</td>
      <td align="center" style="padding:8px 14px;">99.75</td>
      <td align="center" style="padding:8px 14px;">52.02</td>
    </tr>
  </tbody>
</table>

## Media Coverage

Our work has been featured by **Minds in AI (机器之心)**. Please visit [it](https://mp.weixin.qq.com/s/wFAJ6oG1CsKBJiCBE45BsQ) for more details.