smy111 commited on
Commit
b21df42
·
verified ·
1 Parent(s): 9377294

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +40 -0
README.md CHANGED
@@ -13,6 +13,46 @@ RTPurbo uses hybrid HeadWise Attention to compress the Qwen3Coder model. Specifi
13
  1. **Retrieval Heads**: These heads perform **Full Attention** over the entire sequence (or a large chunk), allowing them to capture rich, long-range dependencies and act as a powerful information retrieval component.
14
  2. **non Retrieval Heads**: These heads use **Sink SWA Attention**, processing tokens in a sliding-window or fixed-cache manner. They are highly efficient and ideal for handling very long sequences while maintaining local context.
15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  ## Evaluation
17
 
18
  This model was evaluated in the [lm_eval](https://github.com/EleutherAI/lm-evaluation-harness) benchmark using [Qwen3-Coder-30B-A3B-Instruct](https://www.modelscope.cn/models/Qwen/Qwen3-Coder-30B-A3B-Instruct) as evaluator.
 
13
  1. **Retrieval Heads**: These heads perform **Full Attention** over the entire sequence (or a large chunk), allowing them to capture rich, long-range dependencies and act as a powerful information retrieval component.
14
  2. **non Retrieval Heads**: These heads use **Sink SWA Attention**, processing tokens in a sliding-window or fixed-cache manner. They are highly efficient and ideal for handling very long sequences while maintaining local context.
15
 
16
+ The following code can be used for inference. HeadWise will be triggered in scenarios where SeqLen > 16,384.
17
+ ```python
18
+ from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
19
+
20
+ model_name = "RTP-LLM/Qwen3-Coder-30B-A3B-Instruct-RTPurbo"
21
+
22
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
23
+ config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
24
+ model = AutoModelForCausalLM.from_pretrained(
25
+ model_name,
26
+ config=config,
27
+ trust_remote_code=True,
28
+ torch_dtype="auto",
29
+ device_map="auto"
30
+ )
31
+
32
+ # prepare the model input
33
+ prompt = "Write a quick sort algorithm."
34
+ messages = [
35
+ {"role": "user", "content": prompt}
36
+ ]
37
+ text = tokenizer.apply_chat_template(
38
+ messages,
39
+ tokenize=False,
40
+ add_generation_prompt=True,
41
+ )
42
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
43
+
44
+ # conduct text completion
45
+ generated_ids = model.generate(
46
+ **model_inputs,
47
+ max_new_tokens=128
48
+ )
49
+ output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
50
+
51
+ content = tokenizer.decode(output_ids, skip_special_tokens=True)
52
+
53
+ print("content:", content)
54
+ ```
55
+
56
  ## Evaluation
57
 
58
  This model was evaluated in the [lm_eval](https://github.com/EleutherAI/lm-evaluation-harness) benchmark using [Qwen3-Coder-30B-A3B-Instruct](https://www.modelscope.cn/models/Qwen/Qwen3-Coder-30B-A3B-Instruct) as evaluator.