Instructions to use Qwen/Qwen3-Coder-Next-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Qwen/Qwen3-Coder-Next-FP8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Qwen/Qwen3-Coder-Next-FP8") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Coder-Next-FP8") model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Coder-Next-FP8") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Qwen/Qwen3-Coder-Next-FP8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Qwen/Qwen3-Coder-Next-FP8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3-Coder-Next-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Qwen/Qwen3-Coder-Next-FP8
- SGLang
How to use Qwen/Qwen3-Coder-Next-FP8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Qwen/Qwen3-Coder-Next-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3-Coder-Next-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Qwen/Qwen3-Coder-Next-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3-Coder-Next-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Qwen/Qwen3-Coder-Next-FP8 with Docker Model Runner:
docker model run hf.co/Qwen/Qwen3-Coder-Next-FP8
Anyone running this on AMD MI300X / vLLM ROCm 7 at 256K context?
Hi everyone - building an open-source repo-scale coding agent (REPOMIND) on top of Qwen3-Coder-Next-FP8 for the AMD Developer Hackathon. Submission May 11, MIT licensed.
The whole architecture relies on the MI300X 192GB single-GPU memory advantage β load 256K tokens of code + KV cache on one card, which physically can't fit on H100 80GB at FP8.
Two questions for the community:
Has anyone here actually run vLLM ROCm 7 with
--tool-call-parser qwen3_coderat >128K context length? Any pitfalls before I burn AMD Cloud credits?For long-context tool-calling, what's the recommended
--max-model-len/--kv-cache-dtypecombination on MI300X? I see Day-0 ROCm support announced but no community reports yet at 256K specifically.
The agent uses an SC-TIR loop (PLAN β CALL β OBSERVE β THINK β ANSWER) with 5 tools (read_file, grep, sandboxed exec, run_tests, git_log). Will publish benchmarks (H100 OOM vs MI300X works) once credits land.
Repo: https://github.com/SRKRZ23/repomind
HF Space: https://huggingface.co/spaces/ZeroR3/repomind
Thanks - and huge respect to the Qwen team for FP8 release + Day-0 ROCm support.
Quick update β the HF Space has been moved to the official AMD Developer Hackathon org:
https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/repomind
Likes there contribute to the HF Special Prize judging π€
Quick update β smoke-tested vLLM 0.17.1 + ROCm 7.2 Quick Start image with Qwen3-Coder-Next-FP8 on a single AMD MI300X (192 GB) yesterday.
Verified:
- max_model_len 262144 (256K) starts cleanly, Application startup complete
- 77.29 GiB weights + 95.26 GiB KV cache available at 256K config
- 31.31Γ max concurrency at 256K context per request
- Cold start ~3.5 min (with model download), warm restart ~1.5 min
- Generation throughput: 30 tok/s at 8K config (warm)
- Real Python code generation through /v1/chat/completions verified
Full evidence (rocm-smi, vLLM logs, JSON responses):
github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-smoke-test
Huge thanks to the Qwen team β Day-0 ROCm support + FP8 release made this possible without manual quantization. The qwen3_coder tool-call parser will be wired in next for the agentic loop (SC-TIR-style adapted from AIMO3 math).
Final update β REPOMIND submission for the AMD Developer Hackathon 2026 just landed: lablab.ai/ai-hackathons/amd-developer/repomind/repomind
Full verified results on Qwen3-Coder-Next-FP8 + single MI300X + vLLM 0.17.1 + ROCm 7.2 (124 min, $4.12 total):
Memory: 77.29 GiB weights + 94.58 GiB KV cache available + 92% VRAM peak.
Concurrency (24-cell matrix, default Triton): 31/31 success at 8K, 16K, 32K, AND 64K. 6.49Γ faster aggregate throughput on 8K vs 32K at N=31.
Long-context: 3/3 needle pass at 200K tokens (usable, not just allocated).
Repo Q&A: 9/9 correct including pytorch/vision (1.3M tokens β 5Γ larger than the 256K context window).
Tuning A/B: tried --attention-backend ROCM_AITER_FA. Got 2-4Γ throughput BUT output degenerated to repeating punctuation on 137/144 cells under FP8 KV cache. Default Triton stays production-safe (0/144 broken). Filing for AMD upstream β vLLM startup logs flag q_scale and prob_scale as uncalibrated for the FP8 attention path.
The qwen3_coder tool-call parser parsed our 5-tool agent registry (read_file, grep_codebase, execute_code, run_tests, git_log) without modification. Day-0 unlock from the Qwen team β huge thanks.
Full evidence pack: github.com/SRKRZ23/repomind/tree/main/benchmarks
HF Space (judged for HF Special Prize): huggingface.co/spaces/lablab-ai-amd-developer-hackathon/repomind
Demo video (1:38): youtu.be/BvSBR1QazLU
If anyone from the Qwen team wants raw vLLM logs / repro for the AITER FP8 regression β happy to share.
β Sardor / ZeroR3