How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Orionfold/SecurityLLM-GGUF",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

SecurityLLM GGUF

GGUF quantizations of ZySec-AI/SecurityLLM, verified end-to-end on the NVIDIA DGX Spark (GB10, 128 GB unified memory).

Spark-tested

Every Orionfold quant ships with a measurement quad on the NVIDIA DGX Spark (GB10, 128 GB unified memory): perplexity, sustained tok/s, thermal envelope, and CyberMetric (n=50, mcq_letter) accuracy. The numbers below are the actual run, not a wishlist.

Variant Size Perplexity (wikitext-2) tok/s on Spark CyberMetric (n=50, mcq_letter)
Q4_K_M 4.1 GB 7.400 47.7 40.0%
Q5_K_M 4.8 GB 7.314 40.0 38.0%
Q6_K 5.5 GB 7.313 35.0 36.0%
Q8_0 7.2 GB 7.307 30.3 36.0%
F16 13.5 GB 7.301 17.4 34.0%

Thermal envelope: sustained-load minutes before thermal throttle on a single GB10 = 5 min. Beyond this, expect tok/s degradation; the duty-cycle disclosure is per Orionfold's quant-card standard.

Variants

Variant Recommended use
Q4_K_M Best balance โ€” fits comfortably in Spark unified memory at 70B; default pick.
Q5_K_M Higher quality than Q4_K_M with modest size bump.
Q6_K Near-lossless; recommended if memory headroom allows.
Q8_0 Effectively lossless; reach for this when quality matters more than throughput.
F16 Reference โ€” no quantization. Use only for measurement / baseline.

How to run

Pull a variant:

huggingface-cli download Orionfold/SecurityLLM-GGUF model-Q5_K_M.gguf \
  --local-dir ./models/securityllm

Serve it via llama-server (OpenAI-compatible API):

llama-server -m ./models/securityllm/model-Q5_K_M.gguf \
  -c 4096 -ngl 99 -t 8 \
  --host 0.0.0.0 --port 8080

Or run in-process via llama-cpp-python:

from llama_cpp import Llama
llm = Llama(
    model_path="./models/securityllm/model-Q5_K_M.gguf",
    n_ctx=4096, n_gpu_layers=99, chat_format="zephyr",
)
out = llm.create_chat_completion(
    messages=[
        {"role": "user",
         "content": "What is the primary purpose of a key-derivation function (KDF)?\n\n"
                    "A) Generate public keys\n"
                    "B) Authenticate digital signatures\n"
                    "C) Encrypt data using a password\n"
                    "D) Transform a secret into keys and Initialization Vectors\n\n"
                    "Reply with only the single letter A, B, C, or D."}
    ],
    temperature=0.0,
)
print(out["choices"][0]["message"]["content"])

LM Studio and Ollama (via a Modelfile) load the GGUF directly with no additional setup.

Methods

Full methodology and Spark-side measurement protocol: Vertical-curator quants on Spark โ€” SecurityLLM-GGUF + CyberMetric mini-eval.

Other Orionfold vertical curators

Same Spark-tested recipe across the curator-on-Spark series:

Each card lists its own measurement quad; the headline numbers are recorded as the actual sweep ran, never pre-corrected.


Published by Orionfold LLC ยท orionfold.com ยท Methods documented at ainative.business/field-notes.

Want to know when the next Orionfold vertical curator drops? Join the launch list at orionfold.com.

Downloads last month
77
GGUF
Model size
7B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Orionfold/SecurityLLM-GGUF

Quantized
(5)
this model