KaLM-Reranker-V1-Large Q8_0 GGUF

This model requires the patched llama.cpp runtime bundled in llama.cpp/. Stock llama.cpp does not currently recognize the t5gemma2 architecture. Ollama, LM Studio, llama-server and other stock llama.cpp frontends are not supported by this release.

This repository contains the text-only Q8_0 GGUF conversion of KaLM-Embedding/KaLM-Reranker-V1-Large. It is a reranker, not a text-generation or chat model.

Model file

File Quantization Size SHA256
kalm-reranker-v1-large-q8_0.gguf Q8_0 7,549,112,576 bytes 773b20cb15c4dd459be2c6eabbb1f488ee7379cfec51b8cb7f30ca9e04926829

Architecture: t5gemma2; physical tensors: 887; text parameters: 7,089,110,016. The tokenizer is embedded in the GGUF.

Download

The custom CLI accepts a local -m/--model path and does not implement llama.cpp's -hf option:

hf download KaLM-Embedding/KaLM-Reranker-V1-Large-Q8_0-GGUF \
  --local-dir KaLM-Reranker-V1-Large-Q8_0-GGUF
cd KaLM-Reranker-V1-Large-Q8_0-GGUF
sha256sum --check SHA256SUMS

Build the required runtime

Prerequisites are Git, CMake, Ninja and a C++17 compiler. A CUDA toolkit is also required for the CUDA build. Download this repository first so the patch bundle is available, then:

MODEL_REPO="$PWD"
git clone https://github.com/ggml-org/llama.cpp llama.cpp-src
git -C llama.cpp-src checkout 277a105dc8f8643dab54331926a9830860a03292
bash "$MODEL_REPO/llama.cpp/apply-patches.sh" "$MODEL_REPO/llama.cpp-src"

cmake -S llama.cpp-src -B llama.cpp-src/build -G Ninja \
  -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
cmake --build llama.cpp-src/build --target llama-kalm-reranker -j

For a CPU-only build:

cmake -S llama.cpp-src -B llama.cpp-src/build-cpu -G Ninja \
  -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=OFF
cmake --build llama.cpp-src/build-cpu --target llama-kalm-reranker -j

Score one query-passage pair

CUDA, with an explicit guard against silent CPU fallback:

llama.cpp-src/build/bin/llama-kalm-reranker \
  -m kalm-reranker-v1-large-q8_0.gguf -ngl 99 --require-gpu \
  --query "What is the capital of China?" \
  --passage "The capital of China is Beijing."

CPU:

llama.cpp-src/build-cpu/bin/llama-kalm-reranker \
  -m kalm-reranker-v1-large-q8_0.gguf -ngl 0 \
  --query "What is the capital of China?" \
  --passage "The capital of China is Beijing."

The output is JSON containing yes_logit, no_logit, margin=yes_logit-no_logit, and score=sigmoid(margin). Logits can vary slightly across hardware and backends; use the margin or score for ranking rather than comparing against a hard-coded example logit.

JSONL scoring

llama.cpp-src/build/bin/llama-kalm-reranker \
  -m kalm-reranker-v1-large-q8_0.gguf -ngl 99 --require-gpu \
  --jsonl examples/input.jsonl --out scores.jsonl

Each input line contains string fields qid, pid, query, passage, and an optional instruction. Results preserve input order. The implementation scores pairs sequentially and reuses one context; it is not a parallel batch server.

Frozen inference contract

  • Encoder text: <Document>: {passage}, right padding, maximum 1024 tokens.
  • Encoder states use masked mean pooling over consecutive groups of 4 tokens.
  • Query content is limited to 512 tokens and inserted into the KaLM reranker system/instruction/query template.
  • The decoder starts with BOS token 2.
  • yes=4443, no=1904, and score=sigmoid(yes_logit-no_logit).

Evaluation

On the complete FiQA test set—all 648 queries and their frozen top-100 retriever candidates, or 64,800 scored pairs—this Q8_0 model passed the full-set fidelity gate. The gate requires both NDCG@10 and MRR@10 drops relative to Transformers BF16 to be no greater than 0.005. See EVALUATION.md for the complete metrics and reproducibility note.

Runtime limitations

  • One sequence at a time; JSONL rows are processed sequentially.
  • No KV cache, prefix cache, Flash Attention, server API, or generation.
  • CUDA and CPU are supported by the bundled patch set.
  • The first validated CUDA target is an NVIDIA H100 40 GB MIG instance.
  • The GGUF is text-only and excludes the vision tower and projector.

Troubleshooting

  • unsupported architecture: t5gemma2 means a stock or unpatched llama.cpp binary is being used.
  • llama-kalm-reranker: No such file or directory means the patched target was not built, or the selected build directory is wrong.
  • A CUDA command that reports CPU buffers was built without GGML_CUDA=ON or cannot see the GPU. Keep --require-gpu enabled to turn fallback into an explicit error.
  • unknown argument: -hf is expected: this custom CLI requires a prior hf download and a local -m path.

Reproducibility

License and attribution

The source model declares Apache-2.0. See LICENSE and THIRD_PARTY_NOTICES.md. The bundled llama.cpp patches retain the upstream MIT license in llama.cpp/LICENSE.

Citation

If you find this model useful, please consider citing our papers.

@misc{zhao2026kalmrerankerv1,
      title={KaLM-Reranker-V1: Fast but Not Late Interaction for Compressed Document Reranking}, 
      author={Xinping Zhao and Jiaxin Xu and Ziqi Dai and Xin Zhang and Shouzheng Huang and Danyu Tang and Xinshuo Hu and Meishan Zhang and Baotian Hu and Min Zhang},
      year={2026},
      eprint={2606.22807},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.22807}, 
}

@misc{zhao2026kalmembeddingv2,
      title={KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model}, 
      author={Xinping Zhao and Xinshuo Hu and Zifei Shan and Shouzheng Huang and Yao Zhou and Xin Zhang and Zetian Sun and Zhenyu Liu and Dongfang Li and Xinyuan Wei and Youcheng Pan and Yang Xiang and Meishan Zhang and Haofen Wang and Jun Yu and Baotian Hu and Min Zhang},
      year={2025},
      eprint={2506.20923},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.20923}, 
}

@misc{hu2025kalmembedding,
      title={KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model}, 
      author={Xinshuo Hu and Zifei Shan and Xinping Zhao and Zetian Sun and Zhenyu Liu and Dongfang Li and Shaolin Ye and Xinyuan Wei and Qian Chen and Baotian Hu and Haofen Wang and Jun Yu and Min Zhang},
      year={2025},
      eprint={2501.01028},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.01028}, 
}
Downloads last month
-
GGUF
Model size
7B params
Architecture
t5gemma2
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for KaLM-Embedding/KaLM-Reranker-V1-Large-Q8_0-GGUF

Quantized
(1)
this model

Collection including KaLM-Embedding/KaLM-Reranker-V1-Large-Q8_0-GGUF

Papers for KaLM-Embedding/KaLM-Reranker-V1-Large-Q8_0-GGUF