Update README.md

#10
by oql - opened
Files changed (1) hide show
  1. README.md +18 -0
README.md CHANGED
@@ -138,6 +138,24 @@ vllm serve Qwen/Qwen3-Coder-Next --port 8000 --tensor-parallel-size 2 --enable-a
138
  > The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start.
139
 
140
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
  ## Agentic Coding
142
 
143
  Qwen3-Coder-Next excels in tool calling capabilities.
 
138
  > The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start.
139
 
140
 
141
+ ### KTransformers
142
+
143
+ [KTransformers](https://github.com/kvcache-ai/ktransformers) is a CPU-GPU heterogeneous inference engine for large language models.
144
+ KTransformers could be used to launch an OpenAI-compatible API service, with one single GPU.
145
+
146
+ `KTransformers` can be installed and run followed by its [Qwen3-Coder-Next Tutorial](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/kt-kernel/Qwen3-Coder-Next-Tutorial.md).
147
+
148
+ See [its documentation](https://github.com/kvcache-ai/ktransformers/blob/main/kt-kernel/README.md) for more details.
149
+
150
+ The following command can be used to create an API endpoint at `http://localhost:30000/v1` with maximum context length 256K tokens on 1 GPU.
151
+ ```shell
152
+ kt run Qwen3-Coder-Next
153
+ ```
154
+
155
+ > [!Note]
156
+ > The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start.
157
+
158
+
159
  ## Agentic Coding
160
 
161
  Qwen3-Coder-Next excels in tool calling capabilities.