How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf mishig/xet-gguf-edit-test
# Run inference directly in the terminal:
llama-cli -hf mishig/xet-gguf-edit-test
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf mishig/xet-gguf-edit-test
# Run inference directly in the terminal:
llama-cli -hf mishig/xet-gguf-edit-test
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf mishig/xet-gguf-edit-test
# Run inference directly in the terminal:
./llama-cli -hf mishig/xet-gguf-edit-test
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf mishig/xet-gguf-edit-test
# Run inference directly in the terminal:
./build/bin/llama-cli -hf mishig/xet-gguf-edit-test
Use Docker
docker model run hf.co/mishig/xet-gguf-edit-test
Quick Links

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

GGUF Header Edit Benchmark

Benchmark script for measuring how long it takes to edit GGUF headers in-place on Hugging Face with streaming blobs (xet) and create a pull request per file.
It fetches metadata, rebuilds the header with a small change, commits an edit (header slice only), and records timings to a CSV.

Result from benchmark.ts

Rule of thumb (linear fit):
time_minutes β‰ˆ 0.36 Γ— size_GB + 0.25

Model Size (GB) Time (minutes)
0.5 0.28
1.0 0.47
1.5 0.24
2.0 1.06
2.5 1.29
3.0 1.43
3.5 1.59
4.0 1.61
4.5 1.82
5.0 1.98
5.5 2.10
6.0 2.18
6.5 2.14
7.0 4.73
7.5 5.04
8.0 2.71
8.5 2.75
9.0 3.03
9.5 3.11
10.0 3.24

✨ What this does

For each *.gguf file in a model repo:

  1. Discover files via the Hugging Face model tree API.
  2. Fetch GGUF + typed metadata with @huggingface/gguf.
  3. Rebuild the header using buildGgufHeader (preserving endianness, alignment, and tensor info range).
  4. Commit a slice edit (header bytes only) using commitIter with useXet: true to avoid full re-uploads.
  5. Create a PR titled benchmark.
  6. Record timing (wall-clock) to benchmark-results.csv.

🧱 Requirements

  • Node 18+
  • A Hugging Face token with read + write on the target repo: HF_TOKEN
  • NPM packages:
    • @huggingface/gguf
    • @huggingface/hub
  • Network access to huggingface.co

πŸ”§ Setup

npm i
npm run benchmark
Downloads last month
1,750
GGUF
Model size
24B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support