How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Orionfold/finance-chat-GGUF:
# Run inference directly in the terminal:
llama-cli -hf Orionfold/finance-chat-GGUF:
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Orionfold/finance-chat-GGUF:
# Run inference directly in the terminal:
llama-cli -hf Orionfold/finance-chat-GGUF:
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Orionfold/finance-chat-GGUF:
# Run inference directly in the terminal:
./llama-cli -hf Orionfold/finance-chat-GGUF:
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Orionfold/finance-chat-GGUF:
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Orionfold/finance-chat-GGUF:
Use Docker
docker model run hf.co/Orionfold/finance-chat-GGUF:
Quick Links

finance chat GGUF

GGUF quantizations of AdaptLLM/finance-chat, verified end-to-end on the NVIDIA DGX Spark (GB10, 128 GB unified memory).

Notebooks

Two runnable notebooks ship with this model โ€” open either on a free cloud GPU:

Notebook What it does Open
Builder Reproduce this model's build and DGX Spark benchmarks end-to-end with fieldkit. Open In Colab Open in Kaggle
User Load the published model and call it from your own app in a few lines. Open In Colab Open in Kaggle

Spark-tested

Every Orionfold quant ships with a measurement quad on the NVIDIA DGX Spark (GB10, 128 GB unified memory): perplexity, sustained tok/s, thermal envelope, and FinanceBench (n=50, numeric_match) accuracy. The numbers below are the actual run, not a wishlist.

Variant Size Perplexity (wikitext-2) tok/s on Spark FinanceBench (n=50, numeric_match)
Q4_K_M 3.8 GB 6.221 31.1 14.0%
Q5_K_M 4.5 GB 6.164 26.9 16.0%
Q6_K 5.1 GB 6.147 23.9 16.0%
Q8_0 6.7 GB 6.137 8.9 18.0%
F16 12.6 GB 6.137 11.5 18.0%

Thermal envelope: sustained-load minutes before thermal throttle on a single GB10 = 2 min. Beyond this, expect tok/s degradation; the duty-cycle disclosure is per Orionfold's quant-card standard.

Variants

Variant Recommended use
Q4_K_M Best balance โ€” fits comfortably in Spark unified memory at 70B; default pick.
Q5_K_M Higher quality than Q4_K_M with modest size bump.
Q6_K Near-lossless; recommended if memory headroom allows.
Q8_0 Effectively lossless; reach for this when quality matters more than throughput.
F16 Reference โ€” no quantization. Use only for measurement / baseline.

How to run

Pull a variant:

huggingface-cli download Orionfold/finance-chat-GGUF model-Q5_K_M.gguf \
  --local-dir ./models/finance-chat

Serve it via llama-server (OpenAI-compatible API):

llama-server -m ./models/finance-chat/model-Q5_K_M.gguf \
  -c 4096 -ngl 99 -t 8 \
  --host 0.0.0.0 --port 8080

Or run in-process via llama-cpp-python:

from llama_cpp import Llama
llm = Llama(
    model_path="./models/finance-chat/model-Q5_K_M.gguf",
    n_ctx=4096, n_gpu_layers=99, chat_format="llama-2",
)
out = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Explain working capital."}],
    temperature=0.0,
)
print(out["choices"][0]["message"]["content"])

LM Studio and Ollama (via a Modelfile) load the GGUF directly with no additional setup.

Methods

Full methodology and Spark-side measurement protocol: Vertical-curator quants on Spark โ€” finance-chat-GGUF + FinanceBench mini-eval.

Other Orionfold vertical curators

Same Spark-tested recipe across the curator-on-Spark series:

Each card lists its own measurement quad; the headline numbers are recorded as the actual sweep ran, never pre-corrected.


Published by Orionfold LLC ยท orionfold.com ยท Methods documented at ainative.business/field-notes.

Want to know when the next Orionfold vertical curator drops? Join the launch list at orionfold.com.

Downloads last month
495
GGUF
Model size
7B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Orionfold/finance-chat-GGUF

Quantized
(6)
this model