Instructions to use KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF", filename="kalm-reranker-v1-small-q8_0.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0 # Run inference directly in the terminal: llama cli -hf KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0 # Run inference directly in the terminal: llama cli -hf KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0 # Run inference directly in the terminal: ./llama-cli -hf KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0
Use Docker
docker model run hf.co/KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0
- LM Studio
- Jan
- Ollama
How to use KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF with Ollama:
ollama run hf.co/KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0
- Unsloth Studio
How to use KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF to start chatting
- Atomic Chat new
- Docker Model Runner
How to use KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF with Docker Model Runner:
docker model run hf.co/KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0
- Lemonade
How to use KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF:Q8_0
Run and chat with the model
lemonade run user.KaLM-Reranker-V1-Small-Q8_0-GGUF-Q8_0
List all available models
lemonade list
output = llm(
"Once upon a time,",
max_tokens=512,
echo=True
)
print(output)KaLM-Reranker-V1-Small Q8_0 GGUF
This model requires the patched llama.cpp runtime bundled in
llama.cpp/. Stock llama.cpp does not currently recognize thet5gemma2architecture. Ollama, LM Studio,llama-serverand other stock llama.cpp frontends are not supported by this release.
This repository contains the text-only Q8_0 GGUF conversion of
KaLM-Embedding/KaLM-Reranker-V1-Small. It is a reranker,
not a text-generation or chat model.
Model file
| File | Quantization | Size | SHA256 |
|---|---|---|---|
kalm-reranker-v1-small-q8_0.gguf |
Q8_0 | 1,819,439,552 bytes | fa120c141bfb3cc3a1a1234462e0d45bd92988b29c82fc8a33423a7468683e65 |
Architecture: t5gemma2; physical tensors: 679;
text parameters: 1,697,782,016. The tokenizer is embedded
in the GGUF.
Download
The custom CLI accepts a local -m/--model path and does not implement
llama.cpp's -hf option:
hf download KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF \
--local-dir KaLM-Reranker-V1-Small-Q8_0-GGUF
cd KaLM-Reranker-V1-Small-Q8_0-GGUF
sha256sum --check SHA256SUMS
Build the required runtime
Prerequisites are Git, CMake, Ninja and a C++17 compiler. A CUDA toolkit is also required for the CUDA build. Download this repository first so the patch bundle is available, then:
MODEL_REPO="$PWD"
git clone https://github.com/ggml-org/llama.cpp llama.cpp-src
git -C llama.cpp-src checkout 277a105dc8f8643dab54331926a9830860a03292
bash "$MODEL_REPO/llama.cpp/apply-patches.sh" "$MODEL_REPO/llama.cpp-src"
cmake -S llama.cpp-src -B llama.cpp-src/build -G Ninja \
-DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
cmake --build llama.cpp-src/build --target llama-kalm-reranker -j
For a CPU-only build:
cmake -S llama.cpp-src -B llama.cpp-src/build-cpu -G Ninja \
-DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=OFF
cmake --build llama.cpp-src/build-cpu --target llama-kalm-reranker -j
Score one query-passage pair
CUDA, with an explicit guard against silent CPU fallback:
llama.cpp-src/build/bin/llama-kalm-reranker \
-m kalm-reranker-v1-small-q8_0.gguf -ngl 99 --require-gpu \
--query "What is the capital of China?" \
--passage "The capital of China is Beijing."
CPU:
llama.cpp-src/build-cpu/bin/llama-kalm-reranker \
-m kalm-reranker-v1-small-q8_0.gguf -ngl 0 \
--query "What is the capital of China?" \
--passage "The capital of China is Beijing."
The output is JSON containing yes_logit, no_logit,
margin=yes_logit-no_logit, and score=sigmoid(margin). Logits can vary
slightly across hardware and backends; use the margin or score for ranking
rather than comparing against a hard-coded example logit.
JSONL scoring
llama.cpp-src/build/bin/llama-kalm-reranker \
-m kalm-reranker-v1-small-q8_0.gguf -ngl 99 --require-gpu \
--jsonl examples/input.jsonl --out scores.jsonl
Each input line contains string fields qid, pid, query, passage, and
an optional instruction. Results preserve input order. The implementation
scores pairs sequentially and reuses one context; it is not a parallel batch
server.
Frozen inference contract
- Encoder text:
<Document>: {passage}, right padding, maximum 1024 tokens. - Encoder states use masked mean pooling over consecutive groups of 4 tokens.
- Query content is limited to 512 tokens and inserted into the KaLM reranker system/instruction/query template.
- The decoder starts with BOS token 2.
yes=4443,no=1904, andscore=sigmoid(yes_logit-no_logit).
Evaluation
On the complete FiQA test set—all 648 queries and their frozen top-100
retriever candidates, or 64,800 scored pairs—this Q8_0 model passed the
full-set fidelity gate. The gate requires both NDCG@10 and MRR@10 drops
relative to Transformers BF16 to be no greater than 0.005. See
EVALUATION.md for the complete metrics and reproducibility
note.
Runtime limitations
- One sequence at a time; JSONL rows are processed sequentially.
- No KV cache, prefix cache, Flash Attention, server API, or generation.
- CUDA and CPU are supported by the bundled patch set.
- The first validated CUDA target is an NVIDIA H100 40 GB MIG instance.
- The GGUF is text-only and excludes the vision tower and projector.
Troubleshooting
unsupported architecture: t5gemma2means a stock or unpatched llama.cpp binary is being used.llama-kalm-reranker: No such file or directorymeans the patched target was not built, or the selected build directory is wrong.- A CUDA command that reports CPU buffers was built without
GGML_CUDA=ONor cannot see the GPU. Keep--require-gpuenabled to turn fallback into an explicit error. unknown argument: -hfis expected: this custom CLI requires a priorhf downloadand a local-mpath.
Reproducibility
manifest.jsonrecords model, tokenizer, quantization, runtime and validation provenance.SHA256SUMSverifies the GGUF.llama.cpp/PATCHSET.jsonpins the runtime patch base, final tested tree and patch hashes.
License and attribution
The source model declares Apache-2.0. See LICENSE and
THIRD_PARTY_NOTICES.md. The bundled llama.cpp
patches retain the upstream MIT license in
llama.cpp/LICENSE.
Citation
If you find this model useful, please consider citing our papers.
@misc{zhao2026kalmrerankerv1,
title={KaLM-Reranker-V1: Fast but Not Late Interaction for Compressed Document Reranking},
author={Xinping Zhao and Jiaxin Xu and Ziqi Dai and Xin Zhang and Shouzheng Huang and Danyu Tang and Xinshuo Hu and Meishan Zhang and Baotian Hu and Min Zhang},
year={2026},
eprint={2606.22807},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.22807},
}
@misc{zhao2026kalmembeddingv2,
title={KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model},
author={Xinping Zhao and Xinshuo Hu and Zifei Shan and Shouzheng Huang and Yao Zhou and Xin Zhang and Zetian Sun and Zhenyu Liu and Dongfang Li and Xinyuan Wei and Youcheng Pan and Yang Xiang and Meishan Zhang and Haofen Wang and Jun Yu and Baotian Hu and Min Zhang},
year={2025},
eprint={2506.20923},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.20923},
}
@misc{hu2025kalmembedding,
title={KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model},
author={Xinshuo Hu and Zifei Shan and Xinping Zhao and Zetian Sun and Zhenyu Liu and Dongfang Li and Shaolin Ye and Xinyuan Wei and Qian Chen and Baotian Hu and Haofen Wang and Jun Yu and Min Zhang},
year={2025},
eprint={2501.01028},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.01028},
}
- Downloads last month
- 17
8-bit
Model tree for KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF
Base model
google/t5gemma-2-1b-1b
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="KaLM-Embedding/KaLM-Reranker-V1-Small-Q8_0-GGUF", filename="kalm-reranker-v1-small-q8_0.gguf", )