Instructions to use FastFlowLM/GPT-OSS-20B-NPU2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FastFlowLM/GPT-OSS-20B-NPU2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="FastFlowLM/GPT-OSS-20B-NPU2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("FastFlowLM/GPT-OSS-20B-NPU2")
model = AutoModelForCausalLM.from_pretrained("FastFlowLM/GPT-OSS-20B-NPU2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use FastFlowLM/GPT-OSS-20B-NPU2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FastFlowLM/GPT-OSS-20B-NPU2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FastFlowLM/GPT-OSS-20B-NPU2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/FastFlowLM/GPT-OSS-20B-NPU2

SGLang

How to use FastFlowLM/GPT-OSS-20B-NPU2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "FastFlowLM/GPT-OSS-20B-NPU2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FastFlowLM/GPT-OSS-20B-NPU2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "FastFlowLM/GPT-OSS-20B-NPU2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FastFlowLM/GPT-OSS-20B-NPU2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use FastFlowLM/GPT-OSS-20B-NPU2 with Docker Model Runner:
```
docker model run hf.co/FastFlowLM/GPT-OSS-20B-NPU2
```

DMA Paging Error (0xc01e0200)

by chinanet-v1 - opened Dec 20, 2025

Discussion

chinanet-v1

Dec 20, 2025

flm run gpt-oss:20b
[FLM] Configuring NPU Power Mode to performance (flm default)
[Warning] Version check timed out; continuing without update info.
[FLM] Loading model: C:\Users\johndoe\Documents\flm\models\GPT-OSS-20B-NPU2
Error: Failed to submit command to hw queue (0xc01e0200):
Even after the video memory manager split the DMA buffer, the video
memory manager could not page-in all of the required allocations into
video memory at the same time. The device is unable to continue.

Device: HX 370 with 32G RAM
NPU driver: 32.0.203.314

FastFlowLM

Owner Dec 20, 2025

Please check this https://fastflowlm.com/docs/models/gpt-oss/#:~:text=Copy-,%F0%9F%93%9D%20NOTE,-Memory%20Requirements%0A%E2%9A%A0%EF%B8%8F%20Note%3A%20Running%20gpt%2Doss%3A20b

Also, are you using flm v0.9.23 now?

chinanet-v1

Dec 20, 2025

I have reviewed the official troubleshooting guide at https://fastflowlm.com/docs/instructions/cli/, which suggests that RAM shortage is a primary culprit. However, my system telemetry shows ~20GB of free physical RAM at the moment the flm run command initiates.

FastFlowLM

Owner Dec 20, 2025

•

edited Dec 20, 2025

Please check "total mem" in Task Manager -> Perf. -> NPU

There is an internal cap on amount of mem that can be accessed by NPU.

Also, check the update of https://github.com/lemonade-sdk/lemonade/issues/688

Hope the cap can be lifted soon ...

chinanet-v1

Dec 20, 2025

I solved the problem by adjusting the amount of RAM dedicated to the iGPU in the BIOS.

FastFlowLM

Owner Dec 20, 2025

I solved the problem by adjusting the amount of RAM dedicated to the iGPU in the BIOS.

Really, how did you do that? How much "total mem" do you have in in Task Manager -> Perf. -> NPU?

Do you mind sharing it? Many of us have a 32 GB sys and we can really benefit from it! TY!!

tomsh169

Feb 3

•

edited Feb 3

I got the same problem of dma paging error due to insufficient system ram, which in my case is 32G. I understand I can adjust the ram size to solve the problem. This is done by using Amd adrenlin tools(see underlying pic). My case is 32G ram+96G vram, for a combined total of 128G, which is typical of Strix Halo. I can reduce from 96G of vram to transfer the amount to system ram. However, I prefer not to do so as I want as much vram as possible reserved for larger models to run in vram.
One thing I would like to point is that this dma paging error only ocur when the prompt is relatively long. In my case, I set ctx-len to 16384 and the error would ocur when my prompt token is around 10k(response token should be around 1k ,so total context length under 16k setup). Shorter prompt would not trigger a dma error. Either with long or short prompts, the flm server launch is successful with ctx-len of 16384.
For gpt-oss:20b, I would love to see a smaller sized model from current 14.4G, which has been achieved at around 13.4G with recent AMD ryzenAi 1.7 onnx npu model release. This 1G difference is marginally extremely valuable for longer context inference which gives my npu a stronger edge serving normal llm requests while keeping the potential of Gpu serving more challenging tasks such as coding.

chinanet-v1

Feb 11

The NPU shares memory with the operating system. If too much system memory is allocated exclusively to the iGPU, the NPU will lack sufficient memory for GPT-OSS model inference. In my case, flm successfully launched GPT-OSS model inference when there were 24GB of free system memory.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment