Instructions to use unsloth/Qwen3-Coder-Next-FP8-Dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use unsloth/Qwen3-Coder-Next-FP8-Dynamic with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="unsloth/Qwen3-Coder-Next-FP8-Dynamic")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-Coder-Next-FP8-Dynamic")
model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen3-Coder-Next-FP8-Dynamic")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use unsloth/Qwen3-Coder-Next-FP8-Dynamic with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "unsloth/Qwen3-Coder-Next-FP8-Dynamic"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Qwen3-Coder-Next-FP8-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/unsloth/Qwen3-Coder-Next-FP8-Dynamic

SGLang

How to use unsloth/Qwen3-Coder-Next-FP8-Dynamic with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "unsloth/Qwen3-Coder-Next-FP8-Dynamic" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Qwen3-Coder-Next-FP8-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "unsloth/Qwen3-Coder-Next-FP8-Dynamic" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Qwen3-Coder-Next-FP8-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use unsloth/Qwen3-Coder-Next-FP8-Dynamic with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/Qwen3-Coder-Next-FP8-Dynamic to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/Qwen3-Coder-Next-FP8-Dynamic to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for unsloth/Qwen3-Coder-Next-FP8-Dynamic to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/Qwen3-Coder-Next-FP8-Dynamic",
    max_seq_length=2048,
)

Docker Model Runner
How to use unsloth/Qwen3-Coder-Next-FP8-Dynamic with Docker Model Runner:
```
docker model run hf.co/unsloth/Qwen3-Coder-Next-FP8-Dynamic
```

Single RTX Pro 6000 users

by lightenup - opened Feb 3

Discussion

lightenup

Feb 3

Hi - many thanks unsloth for this exceptionally fast quant job!!

Does anyone know whether this fits on a single RTX Pro 6000 96 GB VRAM? (on reddit I have seen some claim that it should work with vllm)
If it fits:
-) what kind of pp/tg can one expect on sm120 as soon as the context is filled with up to 20k tokens?
-) which inference engine gives the best performance with reliable tool support on a single RTX Pro 6000? Can you share your launch/docker command?

Thanks!!

lightenup

Feb 3

From Reddit, https://www.reddit.com/r/LocalLLaMA/comments/1quvvtv/comment/o3f35uf/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button :

Absolutely it rips! On RTX 6000 you get 80-120 toks/sec that holds well at long context and with concurrent requests. Insane prompt processing 6K-10K/sec - pasting a 15 pages doc to ask a summary is a 2 seconds thing.
That's why I'm excited about the coder version - if developing for example (sub-)agentic tools it could allow very fast iteration locally if it's good enough to handle the test tasks, on top of being a decent coding assistant & also do IDE auto-complete while at it.
Here's my local vllm command which uses around 92 of 96GB
 vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8  \
--port ${PORT} \
--enable-chunked-prefill \
--max-model-len 262144 \
--max-num-seqs 4 \
--max-num-batched-tokens 16384 \
--tool-call-parser hermes \
--chat-template-content-format string \
--enable-auto-tool-choice \
--disable-custom-all-reduce \
--gpu-memory-utilization 0.95

lightenup

Feb 4

Ok, tried it and with vllm 0.16.0rc1.dev158+g2a99c5a6c.precompiled the suggested launch command just lead to an OOM error.

This works however:

vllm serve /path/to/unsloth/Qwen3-Coder-Next-FP8-Dynamic \
        --port ${PORT} \
        --max-model-len 200000 \
        --max-num-seqs 2 \
        --tool-call-parser qwen3_coder \
        --enable-auto-tool-choice \
        --gpu-memory-utilization 0.93 \
        --enable-sleep-mode \
        --attention-backend FLASHINFER \
        --served-model-name qwen3-coder-next \
        --enable-prefix-caching

I am seeing about 8000 tokens/sec pp and 130 tokens/sec tg on a single concurrent request at context size of about 20k tokens (RTX Pro 6000 @ 300W).

About 50 tool calls succeeded until now without errors. Model makes a very good first impression!

danielhanchen

Unsloth AI org Feb 5

Oh nice! Sorry didn't respond earlier - this is very cool!

Qnibbles

Feb 5

@lightenup Yes, the gpu-memory-utilization 0.93 is really critical, thanks. Even 0.95 fails with FP8 K/V cache. Seems like additional headroom is needed with this model.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment