Instructions to use onion515/ornith-9b-dflash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use onion515/ornith-9b-dflash with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="onion515/ornith-9b-dflash", filename="ornith-9b-dflash-q5_k_m.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use onion515/ornith-9b-dflash with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf onion515/ornith-9b-dflash:Q5_K_M # Run inference directly in the terminal: llama cli -hf onion515/ornith-9b-dflash:Q5_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf onion515/ornith-9b-dflash:Q5_K_M # Run inference directly in the terminal: llama cli -hf onion515/ornith-9b-dflash:Q5_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf onion515/ornith-9b-dflash:Q5_K_M # Run inference directly in the terminal: ./llama-cli -hf onion515/ornith-9b-dflash:Q5_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf onion515/ornith-9b-dflash:Q5_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf onion515/ornith-9b-dflash:Q5_K_M
Use Docker
docker model run hf.co/onion515/ornith-9b-dflash:Q5_K_M
- LM Studio
- Jan
- vLLM
How to use onion515/ornith-9b-dflash with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "onion515/ornith-9b-dflash" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "onion515/ornith-9b-dflash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/onion515/ornith-9b-dflash:Q5_K_M
- Ollama
How to use onion515/ornith-9b-dflash with Ollama:
ollama run hf.co/onion515/ornith-9b-dflash:Q5_K_M
- Unsloth Studio
How to use onion515/ornith-9b-dflash with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for onion515/ornith-9b-dflash to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for onion515/ornith-9b-dflash to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for onion515/ornith-9b-dflash to start chatting
- Pi
How to use onion515/ornith-9b-dflash with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf onion515/ornith-9b-dflash:Q5_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "onion515/ornith-9b-dflash:Q5_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use onion515/ornith-9b-dflash with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf onion515/ornith-9b-dflash:Q5_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default onion515/ornith-9b-dflash:Q5_K_M
Run Hermes
hermes
- Atomic Chat new
- OpenClaw new
How to use onion515/ornith-9b-dflash with OpenClaw:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf onion515/ornith-9b-dflash:Q5_K_M
Configure OpenClaw
# Install OpenClaw: npm install -g openclaw@latest # Register the local server and set it as the default model: openclaw onboard --non-interactive --mode local \ --auth-choice custom-api-key \ --custom-base-url http://127.0.0.1:8080/v1 \ --custom-model-id "onion515/ornith-9b-dflash:Q5_K_M" \ --custom-provider-id llama-cpp \ --custom-compatibility openai \ --custom-text-input \ --accept-risk \ --skip-health
Run OpenClaw
openclaw agent --local --agent main --message "Hello from Hugging Face"
- Docker Model Runner
How to use onion515/ornith-9b-dflash with Docker Model Runner:
docker model run hf.co/onion515/ornith-9b-dflash:Q5_K_M
- Lemonade
How to use onion515/ornith-9b-dflash with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull onion515/ornith-9b-dflash:Q5_K_M
Run and chat with the model
lemonade run user.ornith-9b-dflash-Q5_K_M
List all available models
lemonade list
Ornith-9B-DFlash GGUF
GGUF conversion of z-lab/Qwen3.5-9B-DFlash for llama.cpp.
⚠️ Important: This is a DFlash draft model, not a standalone language model.
It must be used together with a compatible Ornith-1.0-9B target model for Speculative Decoding.
Hardware Optimization (16GB VRAM)
This Q5_K_M version is specifically optimized for setups with 16GB VRAM (e.g., NVIDIA RTX 4080, RTX 4070 Ti Super).
By pairing the Q5_K_M target model with this lightweight DFlash draft model, both models can fit entirely or mostly into 16GB of video memory, giving you a massive token-generation speedup without running out of memory (OOM).
Model Profiles
- Base Model (Draft): z-lab/Qwen3.5-9B-DFlash
- Target Model (Main): deepreinforce-ai/Ornith-1.0-9B-GGUF
- Format: GGUF
- Quantization: Q5_K_M
Compatibility
Requires a recent version of llama.cpp with native DFlash support.
Tested with:
- llama.cpp b9831 or newer
Usage
To run this model efficiently on a 16GB GPU, use the following optimized configurations. Ensure you adjust the GPU layers (-ngl / --ngl-draft) based on your exact free VRAM.
Example: Running llama-server (Optimized for 16GB VRAM)
llama-server \
--model Ornith-1.0-9B.gguf \
--spec-draft-model ornith-9b-dflash.gguf \
--spec-type draft-dflash \
--spec-draft-n-max 3
Sample Performance Log
Original
0.45.831.305 I slot print_timing: id 3 | task 0 | prompt processing, n_tokens = 22027, progress = 1.00, t = 12.93 s / 1704.11 tokens per second
0.47.047.913 I slot print_timing: id 2 | task 2 | n_decoded = 100, tg = 16.14 t/s, tg_3s = 16.14 t/s
0.47.442.970 I slot print_timing: id 3 | task 0 | prompt eval time = 12964.80 ms / 22031 tokens ( 0.59 ms per token, 1699.29 tokens per second)
0.47.442.975 I slot print_timing: id 3 | task 0 | eval time = 1572.66 ms / 121 tokens ( 13.00 ms per token, 76.94 tokens per second)
0.47.442.976 I slot print_timing: id 3 | task 0 | total time = 14537.45 ms / 22152 tokens
0.47.442.977 I slot print_timing: id 3 | task 0 | graphs reused = 119
0.47.443.439 I slot release: id 3 | task 0 | stop processing: n_tokens = 22151, truncated = 0
0.50.048.242 I slot print_timing: id 2 | task 2 | n_decoded = 355, tg = 38.61 t/s, tg_3s = 84.99 t/s
0.53.057.872 I slot print_timing: id 2 | task 2 | n_decoded = 615, tg = 50.39 t/s, tg_3s = 86.39 t/s
0.56.058.042 I slot print_timing: id 2 | task 2 | n_decoded = 869, tg = 57.15 t/s, tg_3s = 84.66 t/s
0.59.067.949 I slot print_timing: id 2 | task 2 | n_decoded = 1127, tg = 61.87 t/s, tg_3s = 85.72 t/s
1.02.075.459 I slot print_timing: id 2 | task 2 | n_decoded = 1385, tg = 65.26 t/s, tg_3s = 85.79 t/s
1.05.087.242 I slot print_timing: id 2 | task 2 | n_decoded = 1643, tg = 67.80 t/s, tg_3s = 85.66 t/s
1.08.088.191 I slot print_timing: id 2 | task 2 | n_decoded = 1899, tg = 69.73 t/s, tg_3s = 85.31 t/s
1.11.099.528 I slot print_timing: id 2 | task 2 | n_decoded = 2153, tg = 71.18 t/s, tg_3s = 84.35 t/s
1.14.105.071 I slot print_timing: id 2 | task 2 | n_decoded = 2409, tg = 72.45 t/s, tg_3s = 85.18 t/s
1.17.116.458 I slot print_timing: id 2 | task 2 | n_decoded = 2665, tg = 73.49 t/s, tg_3s = 85.01 t/s
1.20.117.334 I slot print_timing: id 2 | task 2 | n_decoded = 2924, tg = 74.47 t/s, tg_3s = 86.31 t/s
1.23.128.212 I slot print_timing: id 2 | task 2 | n_decoded = 3183, tg = 75.29 t/s, tg_3s = 86.02 t/s
1.23.234.862 I slot print_timing: id 2 | task 2 | prompt eval time = 7603.25 ms / 29038 tokens ( 0.26 ms per token, 3819.15 tokens per second)
1.23.234.867 I slot print_timing: id 2 | task 2 | eval time = 42381.89 ms / 3192 tokens ( 13.28 ms per token, 75.32 tokens per second)
1.23.234.868 I slot print_timing: id 2 | task 2 | total time = 49985.14 ms / 32230 tokens
1.23.234.869 I slot print_timing: id 2 | task 2 | graphs reused = 3167
1.23.235.407 I slot release: id 2 | task 2 | stop processing: n_tokens = 32229, truncated = 0
DFlash
1.43.873.969 I slot print_timing: id 3 | task 0 | prompt processing, n_tokens = 22027, progress = 1.00, t = 15.24 s / 1445.06 tokens per second
1.45.289.579 I slot print_timing: id 2 | task 2 | n_decoded = 102, tg = 14.34 t/s, tg_3s = 14.34 t/s
1.46.980.800 I slot print_timing: id 3 | task 0 | n_decoded = 152, tg = 50.12 t/s, tg_3s = 50.12 t/s
1.47.622.243 I slot print_timing: id 3 | task 0 | prompt eval time = 15316.84 ms / 22031 tokens ( 0.70 ms per token, 1438.35 tokens per second)
1.47.622.247 I slot print_timing: id 3 | task 0 | eval time = 3674.21 ms / 182 tokens ( 20.19 ms per token, 49.53 tokens per second)
1.47.622.249 I slot print_timing: id 3 | task 0 | total time = 18991.05 ms / 22213 tokens
1.47.622.250 I slot print_timing: id 3 | task 0 | graphs reused = 1
1.47.622.253 I slot print_timing: id 3 | task 0 | draft acceptance = 0.43038 ( 102 accepted / 237 generated), mean len = 2.29
1.47.622.580 I slot release: id 3 | task 0 | stop processing: n_tokens = 22212, truncated = 0
1.48.293.079 I slot print_timing: id 2 | task 2 | n_decoded = 298, tg = 29.46 t/s, tg_3s = 65.26 t/s
1.51.307.219 I slot print_timing: id 2 | task 2 | n_decoded = 668, tg = 50.88 t/s, tg_3s = 122.75 t/s
1.54.319.203 I slot print_timing: id 2 | task 2 | n_decoded = 1045, tg = 64.74 t/s, tg_3s = 125.17 t/s
1.57.321.633 I slot print_timing: id 2 | task 2 | n_decoded = 1429, tg = 74.64 t/s, tg_3s = 127.90 t/s
2.00.336.201 I slot print_timing: id 2 | task 2 | n_decoded = 1770, tg = 79.88 t/s, tg_3s = 113.12 t/s
2.03.341.293 I slot print_timing: id 2 | task 2 | n_decoded = 2094, tg = 83.21 t/s, tg_3s = 107.82 t/s
2.05.772.221 I slot print_timing: id 2 | task 2 | prompt eval time = 9144.43 ms / 29038 tokens ( 0.31 ms per token, 3175.48 tokens per second)
2.05.772.226 I slot print_timing: id 2 | task 2 | eval time = 27594.70 ms / 2354 tokens ( 11.72 ms per token, 85.31 tokens per second)
2.05.772.227 I slot print_timing: id 2 | task 2 | total time = 36739.13 ms / 31392 tokens
2.05.772.228 I slot print_timing: id 2 | task 2 | graphs reused = 884
2.05.772.232 I slot print_timing: id 2 | task 2 | draft acceptance = 0.46619 ( 1372 accepted / 2943 generated), mean len = 2.40
2.05.772.752 I slot release: id 2 | task 2 | stop processing: n_tokens = 31391, truncated = 0
Conversion
Converted from the original Hugging Face model using the latest convert_hf_to_gguf.py.
No model weights were modified.
Notes
This repository contains only the DFlash draft model.
A compatible Ornith-1.0-9B GGUF target model is required for speculative decoding.
Credits
- z-lab — Original DFlash model
- deepreinforce-ai Team — Ornith-1.0-9B
- ggml-org/llama.cpp — GGUF format and DFlash inference implementation
License
This repository contains a converted GGUF version of the original DFlash draft model.
All original licenses, usage restrictions, and intellectual property remain with the upstream authors. Please refer to the original repositories for complete licensing information.
- Downloads last month
- 162
5-bit
Model tree for onion515/ornith-9b-dflash
Base model
deepreinforce-ai/Ornith-1.0-9B-GGUF