Instructions to use unsloth/Qwen3-Coder-Next-FP8-Dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use unsloth/Qwen3-Coder-Next-FP8-Dynamic with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="unsloth/Qwen3-Coder-Next-FP8-Dynamic")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-Coder-Next-FP8-Dynamic")
model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen3-Coder-Next-FP8-Dynamic")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use unsloth/Qwen3-Coder-Next-FP8-Dynamic with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "unsloth/Qwen3-Coder-Next-FP8-Dynamic"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Qwen3-Coder-Next-FP8-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/unsloth/Qwen3-Coder-Next-FP8-Dynamic

SGLang

How to use unsloth/Qwen3-Coder-Next-FP8-Dynamic with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "unsloth/Qwen3-Coder-Next-FP8-Dynamic" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Qwen3-Coder-Next-FP8-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "unsloth/Qwen3-Coder-Next-FP8-Dynamic" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Qwen3-Coder-Next-FP8-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use unsloth/Qwen3-Coder-Next-FP8-Dynamic with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/Qwen3-Coder-Next-FP8-Dynamic to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/Qwen3-Coder-Next-FP8-Dynamic to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for unsloth/Qwen3-Coder-Next-FP8-Dynamic to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/Qwen3-Coder-Next-FP8-Dynamic",
    max_seq_length=2048,
)

Docker Model Runner
How to use unsloth/Qwen3-Coder-Next-FP8-Dynamic with Docker Model Runner:
```
docker model run hf.co/unsloth/Qwen3-Coder-Next-FP8-Dynamic
```

Inconsistent output (resolved)

by nephepritou - opened Feb 5

Discussion

nephepritou

Feb 5

•

edited Feb 25 by

danielhanchen

UPD: Sorry, guys. It was my setup causing inconsistent outputs. VLLM produces garbage when running on cards from different generations. Qwen's FP8 just had less errors, possible due to different data types and less overflow / precision loss on Ampere <-> Ada communication.

Compared to Qwen's "official" FP8 quant, this one tends to add redundant characters to text output.

For example, test with VLLM nightly with recommended sampling parameters following question

is /users/me endpoint a bad practice?

This will result in following issues with output:

Forgetting to require auth → anyone gets someonesomeone'’s data*
Use Vary: Authorization, avoid server-side caching per endpoint without per-user granularitycache keys
�💡 Alternatives & Complements:
�✅ Best Practices for /users/me
However, whether it's *appropriate* depends on **context, **security considerations**, **consistency**, and **implementation quality**. Here’s a balanced breakdown:

There are broken unicode chars, missing closing tags (**context without closing **), repetitions inside of words (someonesomeone) and missing spaces.

Changing sampling parameters doesn't affects these issues. With temp=0.0 output have much more mistakes than with temp=1.0.

But despite this models still performs good in agentic tasks with OpenCode and I don't know how 🫥

danielhanchen

Unsloth AI org Feb 5

Oh hey! Yes this is expected a bit - Qwen or https://huggingface.co/unsloth/Qwen3-Coder-Next-FP8 uses block [128, 128] FP8 whilst this one uses FP8 per channel - this is I think 8-10% faster.

We actually did a benchmark as well for Qwen3-8B for eg: https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning

We plan in the future to mix block and per row / column to make it slightly more accurate

lightenup

Feb 5

I didn't notice any formatting/spelling issues yet. However I haven't used the model outside an agent harness yet, meaning there is always 10k+ tokens instructions in my context; also about the expected output format. The only potentially related issue I have is that despite detailed instructions qwen3-coder-next-fp8-dynamic isn't very consistent with Codex 'apply_patch' tool. It doesn't mess up the tool call itself, but the tool input argument (essentially a diff file) is often wrong. I'll try with the block-wise fp8 to be able to compare...

Yes this is expected a bit [..]

So you also observed these formatting/spelling issues? are other unlsoth qwen3-coder-next quants also showing this? To me it's unexpected. I assumed minor accuracy issues in larger models would show up differently (slightly higher tendencies to confuse something, ramble, increased chance of failed tool calls, etc.); Maybe this is something else (inference bug)?

lightenup

Feb 6

fyi: I encountered 2 lone out-of-place Chinese characters in the output of the Qwen provided FP8 version. Against my intuition it therefore might be just a property of this model to show such token-based/formatting errors under loss of accuracy; after all it's only 3B active parameters.

nephepritou

Feb 13

Oh hey! Yes this is expected a bit - Qwen or https://huggingface.co/unsloth/Qwen3-Coder-Next-FP8 uses block [128, 128] FP8 whilst this one uses FP8 per channel - this is I think 8-10% faster.

We actually did a benchmark as well for Qwen3-8B for eg: https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning

We plan in the future to mix block and per row / column to make it slightly more accurate

Sorry, guys. It was my setup causing inconsistent outputs. VLLM produces garbage when running on cards from different generations. Qwen's FP8 just had less errors, possible due to different data types and less overflow / precision loss on Ampere <-> Ada communication.

danielhanchen

Unsloth AI org Feb 16

Oh hey! Yes this is expected a bit - Qwen or https://huggingface.co/unsloth/Qwen3-Coder-Next-FP8 uses block [128, 128] FP8 whilst this one uses FP8 per channel - this is I think 8-10% faster.

We actually did a benchmark as well for Qwen3-8B for eg: https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning

We plan in the future to mix block and per row / column to make it slightly more accurate

Sorry, guys. It was my setup causing inconsistent outputs. VLLM produces garbage when running on cards from different generations. Qwen's FP8 just had less errors, possible due to different data types and less overflow / precision loss on Ampere <-> Ada communication.

If you could update your parent thread that would be awesome thanks! :)

danielhanchen changed discussion title from Inconsistent output to Inconsistent output (resolved) Feb 25

danielhanchen changed discussion status to closed Feb 25

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment