Instructions to use KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF",
	filename="kalm-reranker-v1-small-q8_0.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0
# Run inference directly in the terminal:
llama cli -hf KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0
# Run inference directly in the terminal:
llama cli -hf KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0

Use Docker

docker model run hf.co/KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0

LM Studio
Jan
Ollama
How to use KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF with Ollama:
```
ollama run hf.co/KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0
```

Unsloth Studio

How to use KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF to start chatting

Atomic Chat new
Docker Model Runner
How to use KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF with Docker Model Runner:
```
docker model run hf.co/KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0
```

Lemonade

How to use KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0

Run and chat with the model

lemonade run user.KaLM-Reranker-V1-Small-Q8_0-GGUF-Q8_0

List all available models

lemonade list

KaLM-Reranker-V1-Small Q8_0 GGUF

This model requires the patched llama.cpp runtime bundled in llama.cpp/. Stock llama.cpp does not currently recognize the t5gemma2 architecture. Ollama, LM Studio, llama-server and other stock llama.cpp frontends are not supported by this release.

This repository contains the text-only Q8_0 GGUF conversion of KaLM-Embedding/KaLM-Reranker-V1-Small. It is a reranker, not a text-generation or chat model.

Model file

File	Quantization	Size	SHA256
`kalm-reranker-v1-small-q8_0.gguf`	Q8_0	1,819,439,552 bytes	`fa120c141bfb3cc3a1a1234462e0d45bd92988b29c82fc8a33423a7468683e65`

Architecture: t5gemma2; physical tensors: 679; text parameters: 1,697,782,016. The tokenizer is embedded in the GGUF.

Download

The custom CLI accepts a local -m/--model path and does not implement llama.cpp's -hf option:

hf download KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF \
  --local-dir KaLM-Reranker-V1-Small-Q8_0-GGUF
cd KaLM-Reranker-V1-Small-Q8_0-GGUF
sha256sum --check SHA256SUMS

Build the required runtime

Prerequisites are Git, CMake, Ninja and a C++17 compiler. A CUDA toolkit is also required for the CUDA build. Download this repository first so the patch bundle is available, then:

MODEL_REPO="$PWD"
git clone https://github.com/ggml-org/llama.cpp llama.cpp-src
git -C llama.cpp-src checkout 277a105dc8f8643dab54331926a9830860a03292
bash "$MODEL_REPO/llama.cpp/apply-patches.sh" "$MODEL_REPO/llama.cpp-src"

cmake -S llama.cpp-src -B llama.cpp-src/build -G Ninja \
  -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
cmake --build llama.cpp-src/build --target llama-kalm-reranker -j

For a CPU-only build:

cmake -S llama.cpp-src -B llama.cpp-src/build-cpu -G Ninja \
  -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=OFF
cmake --build llama.cpp-src/build-cpu --target llama-kalm-reranker -j

Score one query-passage pair

CUDA, with an explicit guard against silent CPU fallback:

llama.cpp-src/build/bin/llama-kalm-reranker \
  -m kalm-reranker-v1-small-q8_0.gguf -ngl 99 --require-gpu \
  --query "What is the capital of China?" \
  --passage "The capital of China is Beijing."

CPU:

llama.cpp-src/build-cpu/bin/llama-kalm-reranker \
  -m kalm-reranker-v1-small-q8_0.gguf -ngl 0 \
  --query "What is the capital of China?" \
  --passage "The capital of China is Beijing."

The output is JSON containing yes_logit, no_logit, margin=yes_logit-no_logit, and score=sigmoid(margin). Logits can vary slightly across hardware and backends; use the margin or score for ranking rather than comparing against a hard-coded example logit.

JSONL scoring

llama.cpp-src/build/bin/llama-kalm-reranker \
  -m kalm-reranker-v1-small-q8_0.gguf -ngl 99 --require-gpu \
  --jsonl examples/input.jsonl --out scores.jsonl

Each input line contains string fields qid, pid, query, passage, and an optional instruction. Results preserve input order. The implementation scores pairs sequentially and reuses one context; it is not a parallel batch server.

Frozen inference contract

Encoder text: <Document>: {passage}, right padding, maximum 1024 tokens.
Encoder states use masked mean pooling over consecutive groups of 4 tokens.
Query content is limited to 512 tokens and inserted into the KaLM reranker system/instruction/query template.
The decoder starts with BOS token 2.
yes=4443, no=1904, and score=sigmoid(yes_logit-no_logit).

Evaluation

On the complete FiQA test set—all 648 queries and their frozen top-100 retriever candidates, or 64,800 scored pairs—this Q8_0 model passed the full-set fidelity gate. The gate requires both NDCG@10 and MRR@10 drops relative to Transformers BF16 to be no greater than 0.005. See EVALUATION.md for the complete metrics and reproducibility note.

Runtime limitations

One sequence at a time; JSONL rows are processed sequentially.
No KV cache, prefix cache, Flash Attention, server API, or generation.
CUDA and CPU are supported by the bundled patch set.
The first validated CUDA target is an NVIDIA H100 40 GB MIG instance.
The GGUF is text-only and excludes the vision tower and projector.

Troubleshooting

unsupported architecture: t5gemma2 means a stock or unpatched llama.cpp binary is being used.
llama-kalm-reranker: No such file or directory means the patched target was not built, or the selected build directory is wrong.
A CUDA command that reports CPU buffers was built without GGML_CUDA=ON or cannot see the GPU. Keep --require-gpu enabled to turn fallback into an explicit error.
unknown argument: -hf is expected: this custom CLI requires a prior hf download and a local -m path.

Reproducibility

manifest.json records model, tokenizer, quantization, runtime and validation provenance.
SHA256SUMS verifies the GGUF.
llama.cpp/PATCHSET.json pins the runtime patch base, final tested tree and patch hashes.

License and attribution

The source model declares Apache-2.0. See LICENSE and THIRD_PARTY_NOTICES.md. The bundled llama.cpp patches retain the upstream MIT license in llama.cpp/LICENSE.

Citation

If you find this model useful, please consider citing our papers.

@misc{zhao2026kalmrerankerv1,
      title={KaLM-Reranker-V1: Fast but Not Late Interaction for Compressed Document Reranking}, 
      author={Xinping Zhao and Jiaxin Xu and Ziqi Dai and Xin Zhang and Shouzheng Huang and Danyu Tang and Xinshuo Hu and Meishan Zhang and Baotian Hu and Min Zhang},
      year={2026},
      eprint={2606.22807},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.22807}, 
}

@misc{zhao2026kalmembeddingv2,
      title={KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model}, 
      author={Xinping Zhao and Xinshuo Hu and Zifei Shan and Shouzheng Huang and Yao Zhou and Xin Zhang and Zetian Sun and Zhenyu Liu and Dongfang Li and Xinyuan Wei and Youcheng Pan and Yang Xiang and Meishan Zhang and Haofen Wang and Jun Yu and Baotian Hu and Min Zhang},
      year={2025},
      eprint={2506.20923},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.20923}, 
}

@misc{hu2025kalmembedding,
      title={KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model}, 
      author={Xinshuo Hu and Zifei Shan and Xinping Zhao and Zetian Sun and Zhenyu Liu and Dongfang Li and Shaolin Ye and Xinyuan Wei and Qian Chen and Baotian Hu and Haofen Wang and Jun Yu and Min Zhang},
      year={2025},
      eprint={2501.01028},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.01028}, 
}