Instructions to use onion515/ornith-9b-dflash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use onion515/ornith-9b-dflash with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="onion515/ornith-9b-dflash",
	filename="ornith-9b-dflash-q5_k_m.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use onion515/ornith-9b-dflash with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf onion515/ornith-9b-dflash:Q5_K_M
# Run inference directly in the terminal:
llama cli -hf onion515/ornith-9b-dflash:Q5_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf onion515/ornith-9b-dflash:Q5_K_M
# Run inference directly in the terminal:
llama cli -hf onion515/ornith-9b-dflash:Q5_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf onion515/ornith-9b-dflash:Q5_K_M
# Run inference directly in the terminal:
./llama-cli -hf onion515/ornith-9b-dflash:Q5_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf onion515/ornith-9b-dflash:Q5_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf onion515/ornith-9b-dflash:Q5_K_M

Use Docker

docker model run hf.co/onion515/ornith-9b-dflash:Q5_K_M

LM Studio
Jan

vLLM

How to use onion515/ornith-9b-dflash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "onion515/ornith-9b-dflash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "onion515/ornith-9b-dflash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/onion515/ornith-9b-dflash:Q5_K_M

Ollama
How to use onion515/ornith-9b-dflash with Ollama:
```
ollama run hf.co/onion515/ornith-9b-dflash:Q5_K_M
```

Unsloth Studio

How to use onion515/ornith-9b-dflash with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for onion515/ornith-9b-dflash to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for onion515/ornith-9b-dflash to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for onion515/ornith-9b-dflash to start chatting

How to use onion515/ornith-9b-dflash with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf onion515/ornith-9b-dflash:Q5_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "onion515/ornith-9b-dflash:Q5_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use onion515/ornith-9b-dflash with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf onion515/ornith-9b-dflash:Q5_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default onion515/ornith-9b-dflash:Q5_K_M

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use onion515/ornith-9b-dflash with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf onion515/ornith-9b-dflash:Q5_K_M

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "onion515/ornith-9b-dflash:Q5_K_M" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use onion515/ornith-9b-dflash with Docker Model Runner:
```
docker model run hf.co/onion515/ornith-9b-dflash:Q5_K_M
```

Lemonade

How to use onion515/ornith-9b-dflash with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull onion515/ornith-9b-dflash:Q5_K_M

Run and chat with the model

lemonade run user.ornith-9b-dflash-Q5_K_M

List all available models

lemonade list

Ornith-9B-DFlash GGUF

GGUF conversion of z-lab/Qwen3.5-9B-DFlash for llama.cpp.

⚠️ Important: This is a DFlash draft model, not a standalone language model.
It must be used together with a compatible Ornith-1.0-9B target model for Speculative Decoding.

Hardware Optimization (16GB VRAM)

This Q5_K_M version is specifically optimized for setups with 16GB VRAM (e.g., NVIDIA RTX 4080, RTX 4070 Ti Super).

By pairing the Q5_K_M target model with this lightweight DFlash draft model, both models can fit entirely or mostly into 16GB of video memory, giving you a massive token-generation speedup without running out of memory (OOM).

Model Profiles

Base Model (Draft): z-lab/Qwen3.5-9B-DFlash
Target Model (Main): deepreinforce-ai/Ornith-1.0-9B-GGUF
Format: GGUF
Quantization: Q5_K_M

Compatibility

Requires a recent version of llama.cpp with native DFlash support.

Tested with:

llama.cpp b9831 or newer

Usage

To run this model efficiently on a 16GB GPU, use the following optimized configurations. Ensure you adjust the GPU layers (-ngl / --ngl-draft) based on your exact free VRAM.

Example: Running llama-server (Optimized for 16GB VRAM)

llama-server \
  --model Ornith-1.0-9B.gguf \
  --spec-draft-model ornith-9b-dflash.gguf \
  --spec-type draft-dflash \
  --spec-draft-n-max 3

Sample Performance Log

Original

0.45.831.305 I slot print_timing: id  3 | task 0 | prompt processing, n_tokens =  22027, progress = 1.00, t =  12.93 s / 1704.11 tokens per second
0.47.047.913 I slot print_timing: id  2 | task 2 | n_decoded =    100, tg =  16.14 t/s, tg_3s =  16.14 t/s
0.47.442.970 I slot print_timing: id  3 | task 0 | prompt eval time =   12964.80 ms / 22031 tokens (    0.59 ms per token,  1699.29 tokens per second)
0.47.442.975 I slot print_timing: id  3 | task 0 |        eval time =    1572.66 ms /   121 tokens (   13.00 ms per token,    76.94 tokens per second)
0.47.442.976 I slot print_timing: id  3 | task 0 |       total time =   14537.45 ms / 22152 tokens
0.47.442.977 I slot print_timing: id  3 | task 0 |    graphs reused =        119
0.47.443.439 I slot      release: id  3 | task 0 | stop processing: n_tokens = 22151, truncated = 0
0.50.048.242 I slot print_timing: id  2 | task 2 | n_decoded =    355, tg =  38.61 t/s, tg_3s =  84.99 t/s
0.53.057.872 I slot print_timing: id  2 | task 2 | n_decoded =    615, tg =  50.39 t/s, tg_3s =  86.39 t/s
0.56.058.042 I slot print_timing: id  2 | task 2 | n_decoded =    869, tg =  57.15 t/s, tg_3s =  84.66 t/s
0.59.067.949 I slot print_timing: id  2 | task 2 | n_decoded =   1127, tg =  61.87 t/s, tg_3s =  85.72 t/s
1.02.075.459 I slot print_timing: id  2 | task 2 | n_decoded =   1385, tg =  65.26 t/s, tg_3s =  85.79 t/s
1.05.087.242 I slot print_timing: id  2 | task 2 | n_decoded =   1643, tg =  67.80 t/s, tg_3s =  85.66 t/s
1.08.088.191 I slot print_timing: id  2 | task 2 | n_decoded =   1899, tg =  69.73 t/s, tg_3s =  85.31 t/s
1.11.099.528 I slot print_timing: id  2 | task 2 | n_decoded =   2153, tg =  71.18 t/s, tg_3s =  84.35 t/s
1.14.105.071 I slot print_timing: id  2 | task 2 | n_decoded =   2409, tg =  72.45 t/s, tg_3s =  85.18 t/s
1.17.116.458 I slot print_timing: id  2 | task 2 | n_decoded =   2665, tg =  73.49 t/s, tg_3s =  85.01 t/s
1.20.117.334 I slot print_timing: id  2 | task 2 | n_decoded =   2924, tg =  74.47 t/s, tg_3s =  86.31 t/s
1.23.128.212 I slot print_timing: id  2 | task 2 | n_decoded =   3183, tg =  75.29 t/s, tg_3s =  86.02 t/s
1.23.234.862 I slot print_timing: id  2 | task 2 | prompt eval time =    7603.25 ms / 29038 tokens (    0.26 ms per token,  3819.15 tokens per second)
1.23.234.867 I slot print_timing: id  2 | task 2 |        eval time =   42381.89 ms /  3192 tokens (   13.28 ms per token,    75.32 tokens per second)
1.23.234.868 I slot print_timing: id  2 | task 2 |       total time =   49985.14 ms / 32230 tokens
1.23.234.869 I slot print_timing: id  2 | task 2 |    graphs reused =       3167
1.23.235.407 I slot      release: id  2 | task 2 | stop processing: n_tokens = 32229, truncated = 0

DFlash

1.43.873.969 I slot print_timing: id  3 | task 0 | prompt processing, n_tokens =  22027, progress = 1.00, t =  15.24 s / 1445.06 tokens per second
1.45.289.579 I slot print_timing: id  2 | task 2 | n_decoded =    102, tg =  14.34 t/s, tg_3s =  14.34 t/s
1.46.980.800 I slot print_timing: id  3 | task 0 | n_decoded =    152, tg =  50.12 t/s, tg_3s =  50.12 t/s
1.47.622.243 I slot print_timing: id  3 | task 0 | prompt eval time =   15316.84 ms / 22031 tokens (    0.70 ms per token,  1438.35 tokens per second)
1.47.622.247 I slot print_timing: id  3 | task 0 |        eval time =    3674.21 ms /   182 tokens (   20.19 ms per token,    49.53 tokens per second)
1.47.622.249 I slot print_timing: id  3 | task 0 |       total time =   18991.05 ms / 22213 tokens
1.47.622.250 I slot print_timing: id  3 | task 0 |    graphs reused =          1
1.47.622.253 I slot print_timing: id  3 | task 0 | draft acceptance = 0.43038 (  102 accepted /   237 generated), mean len =  2.29
1.47.622.580 I slot      release: id  3 | task 0 | stop processing: n_tokens = 22212, truncated = 0
1.48.293.079 I slot print_timing: id  2 | task 2 | n_decoded =    298, tg =  29.46 t/s, tg_3s =  65.26 t/s
1.51.307.219 I slot print_timing: id  2 | task 2 | n_decoded =    668, tg =  50.88 t/s, tg_3s = 122.75 t/s
1.54.319.203 I slot print_timing: id  2 | task 2 | n_decoded =   1045, tg =  64.74 t/s, tg_3s = 125.17 t/s
1.57.321.633 I slot print_timing: id  2 | task 2 | n_decoded =   1429, tg =  74.64 t/s, tg_3s = 127.90 t/s
2.00.336.201 I slot print_timing: id  2 | task 2 | n_decoded =   1770, tg =  79.88 t/s, tg_3s = 113.12 t/s
2.03.341.293 I slot print_timing: id  2 | task 2 | n_decoded =   2094, tg =  83.21 t/s, tg_3s = 107.82 t/s
2.05.772.221 I slot print_timing: id  2 | task 2 | prompt eval time =    9144.43 ms / 29038 tokens (    0.31 ms per token,  3175.48 tokens per second)
2.05.772.226 I slot print_timing: id  2 | task 2 |        eval time =   27594.70 ms /  2354 tokens (   11.72 ms per token,    85.31 tokens per second)
2.05.772.227 I slot print_timing: id  2 | task 2 |       total time =   36739.13 ms / 31392 tokens
2.05.772.228 I slot print_timing: id  2 | task 2 |    graphs reused =        884
2.05.772.232 I slot print_timing: id  2 | task 2 | draft acceptance = 0.46619 ( 1372 accepted /  2943 generated), mean len =  2.40
2.05.772.752 I slot      release: id  2 | task 2 | stop processing: n_tokens = 31391, truncated = 0

Conversion

Converted from the original Hugging Face model using the latest convert_hf_to_gguf.py.

No model weights were modified.

Notes

This repository contains only the DFlash draft model.

A compatible Ornith-1.0-9B GGUF target model is required for speculative decoding.

Credits

z-lab — Original DFlash model
deepreinforce-ai Team — Ornith-1.0-9B
ggml-org/llama.cpp — GGUF format and DFlash inference implementation

License

This repository contains a converted GGUF version of the original DFlash draft model.

All original licenses, usage restrictions, and intellectual property remain with the upstream authors. Please refer to the original repositories for complete licensing information.

Downloads last month: 162

GGUF

Model size

1B params

Architecture

dflash

Hardware compatibility

5-bit

Model tree for onion515/ornith-9b-dflash

Base model

deepreinforce-ai/Ornith-1.0-9B-GGUF

Quantized

(2)

this model