Update README.md
#10
by
oql
- opened
README.md
CHANGED
|
@@ -138,6 +138,24 @@ vllm serve Qwen/Qwen3-Coder-Next --port 8000 --tensor-parallel-size 2 --enable-a
|
|
| 138 |
> The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start.
|
| 139 |
|
| 140 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 141 |
## Agentic Coding
|
| 142 |
|
| 143 |
Qwen3-Coder-Next excels in tool calling capabilities.
|
|
|
|
| 138 |
> The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start.
|
| 139 |
|
| 140 |
|
| 141 |
+
### KTransformers
|
| 142 |
+
|
| 143 |
+
[KTransformers](https://github.com/kvcache-ai/ktransformers) is a CPU-GPU heterogeneous inference engine for large language models.
|
| 144 |
+
KTransformers could be used to launch an OpenAI-compatible API service, with one single GPU.
|
| 145 |
+
|
| 146 |
+
`KTransformers` can be installed and run followed by its [Qwen3-Coder-Next Tutorial](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/kt-kernel/Qwen3-Coder-Next-Tutorial.md).
|
| 147 |
+
|
| 148 |
+
See [its documentation](https://github.com/kvcache-ai/ktransformers/blob/main/kt-kernel/README.md) for more details.
|
| 149 |
+
|
| 150 |
+
The following command can be used to create an API endpoint at `http://localhost:30000/v1` with maximum context length 256K tokens on 1 GPU.
|
| 151 |
+
```shell
|
| 152 |
+
kt run Qwen3-Coder-Next
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
> [!Note]
|
| 156 |
+
> The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start.
|
| 157 |
+
|
| 158 |
+
|
| 159 |
## Agentic Coding
|
| 160 |
|
| 161 |
Qwen3-Coder-Next excels in tool calling capabilities.
|