trohrbaugh commited on
Commit
351c136
·
verified ·
1 Parent(s): f533629

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +238 -0
README.md ADDED
@@ -0,0 +1,238 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ license_link: https://huggingface.co/Qwen/Qwen3-Coder-Next/blob/main/LICENSE
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - heretic
8
+ - uncensored
9
+ - decensored
10
+ - abliterated
11
+ ---
12
+ # This is a decensored version of [Qwen/Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next), made using [Heretic](https://github.com/p-e-w/heretic) v1.2.0
13
+
14
+ ## Abliteration parameters
15
+
16
+ | Parameter | Value |
17
+ | :-------- | :---: |
18
+ | **direction_index** | 39.45 |
19
+ | **attn.o_proj.max_weight** | 1.98 |
20
+ | **attn.o_proj.max_weight_position** | 44.55 |
21
+ | **attn.o_proj.min_weight** | 1.58 |
22
+ | **attn.o_proj.min_weight_distance** | 34.05 |
23
+ | **mlp.down_proj.max_weight** | 1.91 |
24
+ | **mlp.down_proj.max_weight_position** | 28.82 |
25
+ | **mlp.down_proj.min_weight** | 1.02 |
26
+ | **mlp.down_proj.min_weight_distance** | 10.46 |
27
+
28
+ ## Performance
29
+
30
+ | Metric | This model | Original model ([Qwen/Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next)) |
31
+ | :----- | :--------: | :---------------------------: |
32
+ | **KL divergence** | 0.0708 | 0 *(by definition)* |
33
+ | **Refusals** | 14/100 | 99/100 |
34
+
35
+ -----
36
+
37
+
38
+ # Qwen3-Coder-Next
39
+
40
+ ## Highlights
41
+
42
+ Today, we're announcing **Qwen3-Coder-Next**, an open-weight language model designed specifically for coding agents and local development. It features the following key enhancements:
43
+
44
+ - **Super Efficient with Significant Performance**: With only 3B activated parameters (80B total parameters), it achieves performance comparable to models with 10–20x more active parameters, making it highly cost-effective for agent deployment.
45
+ - **Advanced Agentic Capabilities**: Through an elaborate training recipe, it excels at long-horizon reasoning, complex tool usage, and recovery from execution failures, ensuring robust performance in dynamic coding tasks.
46
+ - **Versatile Integration with Real-World IDE**: Its 256k context length, combined with adaptability to various scaffold templates, enables seamless integration with different CLI/IDE platforms (e.g., Claude Code, Qwen Code, Qoder, Kilo, Trae, Cline, etc.), supporting diverse development environments.
47
+
48
+ ![image/jpeg](https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen3-Coder-Next/benchmarks.png)
49
+
50
+ ![image/jpeg](https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen3-Coder-Next/swebench_pro.png)
51
+
52
+ ## Model Overview
53
+
54
+ **Qwen3-Coder-Next** has the following features:
55
+ - Type: Causal Language Models
56
+ - Training Stage: Pretraining & Post-training
57
+ - Number of Parameters: 80B in total and 3B activated
58
+ - Number of Parameters (Non-Embedding): 79B
59
+ - Hidden Dimension: 2048
60
+ - Number of Layers: 48
61
+ - Hybrid Layout: 12 \* (3 \* (Gated DeltaNet -> MoE) -> 1 \* (Gated Attention -> MoE))
62
+ - Gated Attention:
63
+ - Number of Attention Heads: 16 for Q and 2 for KV
64
+ - Head Dimension: 256
65
+ - Rotary Position Embedding Dimension: 64
66
+ - Gated DeltaNet:
67
+ - Number of Linear Attention Heads: 32 for V and 16 for QK
68
+ - Head Dimension: 128
69
+ - Mixture of Experts:
70
+ - Number of Experts: 512
71
+ - Number of Activated Experts: 10
72
+ - Number of Shared Experts: 1
73
+ - Expert Intermediate Dimension: 512
74
+ - Context Length: 262,144 natively
75
+
76
+ **NOTE: This model supports only non-thinking mode and does not generate ``<think></think>`` blocks in its output. Meanwhile, specifying `enable_thinking=False` is no longer required.**
77
+
78
+ For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our [blog](https://qwen.ai/blog?id=qwen3-coder-next), [GitHub](https://github.com/QwenLM/Qwen3-Coder), and [Documentation](https://qwen.readthedocs.io/en/latest/).
79
+
80
+
81
+ ## Quickstart
82
+
83
+ We advise you to use the latest version of `transformers`.
84
+
85
+ The following contains a code snippet illustrating how to use the model generate content based on given inputs.
86
+ ```python
87
+ from transformers import AutoModelForCausalLM, AutoTokenizer
88
+
89
+ model_name = "Qwen/Qwen3-Coder-Next"
90
+
91
+ # load the tokenizer and the model
92
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
93
+ model = AutoModelForCausalLM.from_pretrained(
94
+ model_name,
95
+ torch_dtype="auto",
96
+ device_map="auto"
97
+ )
98
+
99
+ # prepare the model input
100
+ prompt = "Write a quick sort algorithm."
101
+ messages = [
102
+ {"role": "user", "content": prompt}
103
+ ]
104
+ text = tokenizer.apply_chat_template(
105
+ messages,
106
+ tokenize=False,
107
+ add_generation_prompt=True,
108
+ )
109
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
110
+
111
+ # conduct text completion
112
+ generated_ids = model.generate(
113
+ **model_inputs,
114
+ max_new_tokens=65536
115
+ )
116
+ output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
117
+
118
+ content = tokenizer.decode(output_ids, skip_special_tokens=True)
119
+
120
+ print("content:", content)
121
+ ```
122
+
123
+ **Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`.**
124
+
125
+ For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
126
+
127
+ ## Deployment
128
+
129
+ For deployment, you can use the latest `sglang` or `vllm` to create an OpenAI-compatible API endpoint.
130
+
131
+ ### SGLang
132
+
133
+ [SGLang](https://github.com/sgl-project/sglang) is a fast serving framework for large language models and vision language models.
134
+ SGLang could be used to launch a server with OpenAI-compatible API service.
135
+
136
+ `sglang>=v0.5.8` is required for Qwen3-Coder-Next, which can be installed using:
137
+ ```shell
138
+ pip install 'sglang[all]>=v0.5.8'
139
+ ```
140
+ See [its documentation](https://docs.sglang.ai/get_started/install.html) for more details.
141
+
142
+ The following command can be used to create an API endpoint at `http://localhost:30000/v1` with maximum context length 256K tokens using tensor parallel on 4 GPUs.
143
+ ```shell
144
+ python -m sglang.launch_server --model Qwen/Qwen3-Coder-Next --port 30000 --tp-size 2 --tool-call-parser qwen3_coder
145
+ ```
146
+
147
+ > [!Note]
148
+ > The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start.
149
+
150
+
151
+ ### vLLM
152
+
153
+ [vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs.
154
+ vLLM could be used to launch a server with OpenAI-compatible API service.
155
+
156
+ `vllm>=0.15.0` is required for Qwen3-Coder-Next, which can be installed using:
157
+ ```shell
158
+ pip install 'vllm>=0.15.0'
159
+ ```
160
+ See [its documentation](https://docs.vllm.ai/en/stable/getting_started/installation/index.html) for more details.
161
+
162
+ The following command can be used to create an API endpoint at `http://localhost:8000/v1` with maximum context length 256K tokens using tensor parallel on 4 GPUs.
163
+ ```shell
164
+ vllm serve Qwen/Qwen3-Coder-Next --port 8000 --tensor-parallel-size 2 --enable-auto-tool-choice --tool-call-parser qwen3_coder
165
+ ```
166
+
167
+ > [!Note]
168
+ > The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start.
169
+
170
+
171
+ ## Agentic Coding
172
+
173
+ Qwen3-Coder-Next excels in tool calling capabilities.
174
+
175
+ You can simply define or use any tools as following example.
176
+ ```python
177
+ # Your tool implementation
178
+ def square_the_number(num: float) -> dict:
179
+ return num ** 2
180
+
181
+ # Define Tools
182
+ tools=[
183
+ {
184
+ "type":"function",
185
+ "function":{
186
+ "name": "square_the_number",
187
+ "description": "output the square of the number.",
188
+ "parameters": {
189
+ "type": "object",
190
+ "required": ["input_num"],
191
+ "properties": {
192
+ 'input_num': {
193
+ 'type': 'number',
194
+ 'description': 'input_num is a number that will be squared'
195
+ }
196
+ },
197
+ }
198
+ }
199
+ }
200
+ ]
201
+
202
+ from openai import OpenAI
203
+ # Define LLM
204
+ client = OpenAI(
205
+ # Use a custom endpoint compatible with OpenAI API
206
+ base_url='http://localhost:8000/v1', # api_base
207
+ api_key="EMPTY"
208
+ )
209
+
210
+ messages = [{'role': 'user', 'content': 'square the number 1024'}]
211
+
212
+ completion = client.chat.completions.create(
213
+ messages=messages,
214
+ model="Qwen3-Coder-Next",
215
+ max_tokens=65536,
216
+ tools=tools,
217
+ )
218
+
219
+ print(completion.choices[0])
220
+ ```
221
+
222
+ ## Best Practices
223
+
224
+ To achieve optimal performance, we recommend the following sampling parameters: `temperature=1.0`, `top_p=0.95`, `top_k=40`.
225
+
226
+
227
+ ## Citation
228
+
229
+ If you find our work helpful, feel free to give us a cite.
230
+
231
+ ```
232
+ @techreport{qwen_qwen3_coder_next_tech_report,
233
+ title = {Qwen3-Coder-Next Technical Report},
234
+ author = {{Qwen Team}},
235
+ url = {https://github.com/QwenLM/Qwen3-Coder/blob/main/qwen3_coder_next_tech_report.pdf},
236
+ note = {Accessed: 2026-02-03}
237
+ }
238
+ ```