Instructions to use RedHatAI/Qwen3-32B-FP8-dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RedHatAI/Qwen3-32B-FP8-dynamic with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="RedHatAI/Qwen3-32B-FP8-dynamic")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("RedHatAI/Qwen3-32B-FP8-dynamic")
model = AutoModelForCausalLM.from_pretrained("RedHatAI/Qwen3-32B-FP8-dynamic")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use RedHatAI/Qwen3-32B-FP8-dynamic with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RedHatAI/Qwen3-32B-FP8-dynamic"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/Qwen3-32B-FP8-dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/RedHatAI/Qwen3-32B-FP8-dynamic

SGLang

How to use RedHatAI/Qwen3-32B-FP8-dynamic with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "RedHatAI/Qwen3-32B-FP8-dynamic" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/Qwen3-32B-FP8-dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "RedHatAI/Qwen3-32B-FP8-dynamic" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/Qwen3-32B-FP8-dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use RedHatAI/Qwen3-32B-FP8-dynamic with Docker Model Runner:
```
docker model run hf.co/RedHatAI/Qwen3-32B-FP8-dynamic
```

How can I repeat the eval results?

by bash99 - opened May 6, 2025

Discussion

bash99

May 6, 2025

Should I change some chat template as Qwen3 is default a thinking model?

I'd run lm_eval with vllm 0.8.5 and lm-eval lastest version from git.

Use almost the same scripts in model card. (I've 4090 48g * 2 so I use tensor_parallel_size=2

export CUDA_VISIBLE_DEVICES=0,1
export MODEL=Qwen3-30B-A3B-FP8_dynamic
lm_eval \
  --model vllm \
  --model_args pretrained="$MODEL",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunked_prefill=True,tensor_parallel_size=2 \
  --tasks openllm \
  --apply_chat_template\
  --fewshot_as_multiturn \
  --batch_size auto

But the result I got is:

|Open LLM Leaderboard | N/A| | | | | | | |
| - arc_challenge | 1|none | 25|acc |↑ | 0.6382|± |0.0140|
| | |none | 25|acc_norm |↑ | 0.5623|± |0.0145|
| - gsm8k | 3|flexible-extract| 5|exact_match|↑ | 0.2146|± |0.0113|
| | |strict-match | 5|exact_match|↑ | 0.0061|± |0.0021|
| - hellaswag | 1|none | 10|acc |↑ | 0.6301|± |0.0048|
| | |none | 10|acc_norm |↑ | 0.7173|± |0.0045|
| - mmlu | 2|none | |acc |↑ | 0.4318|± |0.0041|
| - truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.5571|± |0.0154|
| - winogrande | 1|none | 5|acc |↑ | 0.7285|± |0.0125|

alexmarques

Red Hat AI org May 6, 2025

The discrepancy is likely due to the thinking mode, which is enabled by default. OpenLLM-style evaluations work significantly better when disabling this behavior.

I used this branch from lm-evaluation-harness: https://github.com/neuralmagic/lm-evaluation-harness/tree/enable_thinking, which disables thinking mode by default (although the user can enable it via a vllm argument). I have pushed a PR to the upstream repo, but it hasn't landed yet.

bash99

May 6, 2025

Update I've try with --system_instruction "You are a helpful assistant. /no_think."

At least for gsm8k_platinum_cot I got 0.8776, But for official fp8 https://huggingface.co/Qwen/Qwen3-32B-FP8 I got 0.8983, bf16 version the value is 0.8809

alexmarques

Red Hat AI org May 6, 2025

Interesting. Thanks for the update. This level of variability is not uncommon for quantized models.

alexmarques changed discussion status to closed May 6, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment