How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="onion515/ornith-9b-dflash",
	filename="ornith-9b-dflash-q5_k_m.gguf",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Ornith-9B-DFlash GGUF

GGUF conversion of z-lab/Qwen3.5-9B-DFlash for llama.cpp.

⚠️ Important: This is a DFlash draft model, not a standalone language model.
It must be used together with a compatible Ornith-1.0-9B target model for Speculative Decoding.

Hardware Optimization (16GB VRAM)

This Q5_K_M version is specifically optimized for setups with 16GB VRAM (e.g., NVIDIA RTX 4080, RTX 4070 Ti Super).

By pairing the Q5_K_M target model with this lightweight DFlash draft model, both models can fit entirely or mostly into 16GB of video memory, giving you a massive token-generation speedup without running out of memory (OOM).

Model Profiles

Compatibility

Requires a recent version of llama.cpp with native DFlash support.

Tested with:

  • llama.cpp b9831 or newer

Usage

To run this model efficiently on a 16GB GPU, use the following optimized configurations. Ensure you adjust the GPU layers (-ngl / --ngl-draft) based on your exact free VRAM.

Example: Running llama-server (Optimized for 16GB VRAM)

llama-server \
  --model Ornith-1.0-9B.gguf \
  --spec-draft-model ornith-9b-dflash.gguf \
  --spec-type draft-dflash \
  --spec-draft-n-max 3

Sample Performance Log

Original

0.45.831.305 I slot print_timing: id  3 | task 0 | prompt processing, n_tokens =  22027, progress = 1.00, t =  12.93 s / 1704.11 tokens per second
0.47.047.913 I slot print_timing: id  2 | task 2 | n_decoded =    100, tg =  16.14 t/s, tg_3s =  16.14 t/s
0.47.442.970 I slot print_timing: id  3 | task 0 | prompt eval time =   12964.80 ms / 22031 tokens (    0.59 ms per token,  1699.29 tokens per second)
0.47.442.975 I slot print_timing: id  3 | task 0 |        eval time =    1572.66 ms /   121 tokens (   13.00 ms per token,    76.94 tokens per second)
0.47.442.976 I slot print_timing: id  3 | task 0 |       total time =   14537.45 ms / 22152 tokens
0.47.442.977 I slot print_timing: id  3 | task 0 |    graphs reused =        119
0.47.443.439 I slot      release: id  3 | task 0 | stop processing: n_tokens = 22151, truncated = 0
0.50.048.242 I slot print_timing: id  2 | task 2 | n_decoded =    355, tg =  38.61 t/s, tg_3s =  84.99 t/s
0.53.057.872 I slot print_timing: id  2 | task 2 | n_decoded =    615, tg =  50.39 t/s, tg_3s =  86.39 t/s
0.56.058.042 I slot print_timing: id  2 | task 2 | n_decoded =    869, tg =  57.15 t/s, tg_3s =  84.66 t/s
0.59.067.949 I slot print_timing: id  2 | task 2 | n_decoded =   1127, tg =  61.87 t/s, tg_3s =  85.72 t/s
1.02.075.459 I slot print_timing: id  2 | task 2 | n_decoded =   1385, tg =  65.26 t/s, tg_3s =  85.79 t/s
1.05.087.242 I slot print_timing: id  2 | task 2 | n_decoded =   1643, tg =  67.80 t/s, tg_3s =  85.66 t/s
1.08.088.191 I slot print_timing: id  2 | task 2 | n_decoded =   1899, tg =  69.73 t/s, tg_3s =  85.31 t/s
1.11.099.528 I slot print_timing: id  2 | task 2 | n_decoded =   2153, tg =  71.18 t/s, tg_3s =  84.35 t/s
1.14.105.071 I slot print_timing: id  2 | task 2 | n_decoded =   2409, tg =  72.45 t/s, tg_3s =  85.18 t/s
1.17.116.458 I slot print_timing: id  2 | task 2 | n_decoded =   2665, tg =  73.49 t/s, tg_3s =  85.01 t/s
1.20.117.334 I slot print_timing: id  2 | task 2 | n_decoded =   2924, tg =  74.47 t/s, tg_3s =  86.31 t/s
1.23.128.212 I slot print_timing: id  2 | task 2 | n_decoded =   3183, tg =  75.29 t/s, tg_3s =  86.02 t/s
1.23.234.862 I slot print_timing: id  2 | task 2 | prompt eval time =    7603.25 ms / 29038 tokens (    0.26 ms per token,  3819.15 tokens per second)
1.23.234.867 I slot print_timing: id  2 | task 2 |        eval time =   42381.89 ms /  3192 tokens (   13.28 ms per token,    75.32 tokens per second)
1.23.234.868 I slot print_timing: id  2 | task 2 |       total time =   49985.14 ms / 32230 tokens
1.23.234.869 I slot print_timing: id  2 | task 2 |    graphs reused =       3167
1.23.235.407 I slot      release: id  2 | task 2 | stop processing: n_tokens = 32229, truncated = 0

DFlash

1.43.873.969 I slot print_timing: id  3 | task 0 | prompt processing, n_tokens =  22027, progress = 1.00, t =  15.24 s / 1445.06 tokens per second
1.45.289.579 I slot print_timing: id  2 | task 2 | n_decoded =    102, tg =  14.34 t/s, tg_3s =  14.34 t/s
1.46.980.800 I slot print_timing: id  3 | task 0 | n_decoded =    152, tg =  50.12 t/s, tg_3s =  50.12 t/s
1.47.622.243 I slot print_timing: id  3 | task 0 | prompt eval time =   15316.84 ms / 22031 tokens (    0.70 ms per token,  1438.35 tokens per second)
1.47.622.247 I slot print_timing: id  3 | task 0 |        eval time =    3674.21 ms /   182 tokens (   20.19 ms per token,    49.53 tokens per second)
1.47.622.249 I slot print_timing: id  3 | task 0 |       total time =   18991.05 ms / 22213 tokens
1.47.622.250 I slot print_timing: id  3 | task 0 |    graphs reused =          1
1.47.622.253 I slot print_timing: id  3 | task 0 | draft acceptance = 0.43038 (  102 accepted /   237 generated), mean len =  2.29
1.47.622.580 I slot      release: id  3 | task 0 | stop processing: n_tokens = 22212, truncated = 0
1.48.293.079 I slot print_timing: id  2 | task 2 | n_decoded =    298, tg =  29.46 t/s, tg_3s =  65.26 t/s
1.51.307.219 I slot print_timing: id  2 | task 2 | n_decoded =    668, tg =  50.88 t/s, tg_3s = 122.75 t/s
1.54.319.203 I slot print_timing: id  2 | task 2 | n_decoded =   1045, tg =  64.74 t/s, tg_3s = 125.17 t/s
1.57.321.633 I slot print_timing: id  2 | task 2 | n_decoded =   1429, tg =  74.64 t/s, tg_3s = 127.90 t/s
2.00.336.201 I slot print_timing: id  2 | task 2 | n_decoded =   1770, tg =  79.88 t/s, tg_3s = 113.12 t/s
2.03.341.293 I slot print_timing: id  2 | task 2 | n_decoded =   2094, tg =  83.21 t/s, tg_3s = 107.82 t/s
2.05.772.221 I slot print_timing: id  2 | task 2 | prompt eval time =    9144.43 ms / 29038 tokens (    0.31 ms per token,  3175.48 tokens per second)
2.05.772.226 I slot print_timing: id  2 | task 2 |        eval time =   27594.70 ms /  2354 tokens (   11.72 ms per token,    85.31 tokens per second)
2.05.772.227 I slot print_timing: id  2 | task 2 |       total time =   36739.13 ms / 31392 tokens
2.05.772.228 I slot print_timing: id  2 | task 2 |    graphs reused =        884
2.05.772.232 I slot print_timing: id  2 | task 2 | draft acceptance = 0.46619 ( 1372 accepted /  2943 generated), mean len =  2.40
2.05.772.752 I slot      release: id  2 | task 2 | stop processing: n_tokens = 31391, truncated = 0

Conversion

Converted from the original Hugging Face model using the latest convert_hf_to_gguf.py.

No model weights were modified.

Notes

This repository contains only the DFlash draft model.

A compatible Ornith-1.0-9B GGUF target model is required for speculative decoding.

Credits

  • z-lab — Original DFlash model
  • deepreinforce-ai Team — Ornith-1.0-9B
  • ggml-org/llama.cpp — GGUF format and DFlash inference implementation

License

This repository contains a converted GGUF version of the original DFlash draft model.

All original licenses, usage restrictions, and intellectual property remain with the upstream authors. Please refer to the original repositories for complete licensing information.

Downloads last month
162
GGUF
Model size
1B params
Architecture
dflash
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for onion515/ornith-9b-dflash

Quantized
(2)
this model