CodeRankEmbed-GGUF / README.md
handwoven8588's picture
Add CodeRankEmbed f16 + Q8_0 GGUFs and model card
14be410 verified
|
Raw
History Blame Contribute Delete
7.96 kB
metadata
license: mit
base_model:
  - nomic-ai/CodeRankEmbed
base_model_relation: quantized
library_name: gguf
pipeline_tag: feature-extraction
language:
  - code
tags:
  - gguf
  - llama.cpp
  - flash-attention
  - code-retrieval
  - embeddings
  - nomic-bert

CodeRankEmbed-GGUF

GGUF quantizations of nomic-ai/CodeRankEmbed β€” a 137M-parameter nomic-bert code-embedding model (768-dim, CLS-pooled, trained to 2048 tokens).

CodeRankEmbed's stock attention is eager-only β€” there is no flash-attention or SDPA path in its trust_remote_code modeling file, so peak memory grows as O(batch Γ— heads Γ— seqΒ²) and it OOMs at high batch even at 137M params. This repo is one of two ways we give it flash attention β€” the llama.cpp / GGUF runtime path. Converting to GGUF discards the eager Python entirely; llama.cpp re-implements the architecture in its own graph and runs its own flash-attention kernel. The result, built from the original FP32 safetensors and verified end-to-end: faithful metadata, flash attention engages, retrieval-lossless vs the full-precision reference, and ~5.8Γ— lower peak VRAM.

Two paths to flash attention

The same eager-only model, fixed two different ways:

Path Repo How Runtime & deps Reach for it when
PyTorch native varlen handwoven8588/CodeRankEmbed-flash-attn flash_attn varlen baked into modeling_hf_nomic_bert.py (same weights β†’ identical embeddings) sentence-transformers / PyTorch; needs flash_attn + a CUDA half-precision GPU you serve through PyTorch/ST and want full/half-precision embeddings without changing stack
GGUF + llama.cpp (this repo) handwoven8588/CodeRankEmbed-GGUF convert to GGUF; inherit llama.cpp's own ggml flash-attention kernel llama.cpp (llama-server); no torch / flash_attn / triton; quantized, small you serve through llama.cpp, want to drop trust_remote_code, or want a small (CPU-capable) artifact

Files

File Quant Size Cosine vs FP32 sentence-transformers
CodeRankEmbed-f16.gguf F16 274 MB 0.99999 β€” very high quality
CodeRankEmbed-Q8_0.gguf Q8_0 146 MB 0.998 β€” high quality

Both are retrieval-lossless on our simplish delexicalized code-search benchmark (below).

Quick start β€” serve as an embedder

CodeRankEmbed is CLS-pooled and uses a query-only instruction prefix. Both are serve-time settings the GGUF does not carry β€” you must set them:

llama-server -m CodeRankEmbed-Q8_0.gguf --embeddings --pooling cls -c 2048 --embd-normalize 2 -fa on -ngl 99
  • Pooling β€” --pooling cls (NOT mean). nomic-embed-text is mean-pooled; copy-pasting its serve command silently lowers recall. This is the single most likely mistake.
  • Query prefix β€” prepend Represent this query for searching relevant code: to queries only, never to documents/code. Do it client-side; the GGUF doesn't carry the sentence-transformers prompt.
  • Normalization β€” --embd-normalize 2 (L2) so downstream cosine is correct.
  • Context / RoPE β€” cap at the trained length (-c 2048); the unusual rope.freq_base = 1000 is baked in and must not be overridden.
  • Flash attention β€” -fa on (or -fa auto on GPU) is what cuts peak VRAM ~5.8Γ— (see below). On a CPU-only backend the fused kernel falls back and the win evaporates β€” but it still runs.

Quality β€” retrieval-lossless, not just high cosine

Cosine β‰ˆ 0.998 is necessary but not sufficient: it does not prove ranking is preserved. We validated on a delexicalized code-retrieval benchmark β€” Nβ‰ˆ1170 graded queries with the lexical overlap stripped against the deployed tokenizer, so only semantic matching can recover the gold β€” measuring nDCG@10 across a difficulty curve L0 (original query) β†’ Lall (every content token paraphrased away):

model L0 L1 L3 Lall drop (L0β†’Lall)
BM25 (deployed tokenizer) 0.608 0.512 0.462 0.456 0.152
FP32 (sentence-transformers, reference) 0.954 0.944 0.920 0.916 0.038
GGUF f16 0.954 0.944 0.921 0.916 0.037
GGUF Q8_0 0.954 0.943 0.921 0.916 0.038

Every difference from the FP32 reference is ≀ 0.0006 nDCG@10 at every level, including at Lall where the lexical shortcut is entirely gone. The degradation slope β€” the quantity that would expose a real loss of code understanding β€” is flat: FP32 0.0379, f16 0.0374, Q8_0 0.0377 (Ξ”slope βˆ’0.0002). Quantizing to eight bits costs no retrieval quality on the queries hard enough to measure it.

Efficiency β€” the GGUF flash-attention win

The full-precision model's eager attention has peak memory O(batch Γ— heads Γ— seqΒ²). The GGUF path discards that Python entirely and runs llama.cpp's own flash-attention kernel, which engages even for this bidirectional encoder. Measured on an RTX 3090 Ti (24 GB), 16 sequences Γ— 2048-token context:

peak VRAM
-fa off (eager) 11.9 GB
-fa on 2.0 GB (~5.8Γ— less)

The same corpus that OOMs the FP32 reference at batch 64 embeds end-to-end in seconds on the GGUF.

Quant choice & roadmap

We ship f16 and Q8_0; both are retrieval-lossless above. No importance matrix (imatrix) is attached: llama.cpp consumes an imatrix only at ≀ Q6_K and ignores it at Q8_0 and above, so a "dynamic" Q8 and a uniform Q8 are byte-identical β€” there is no per-tensor lever to pull at eight bits.

A lower-bit dynamic Q4/Q3 (imatrix-driven, where the lever does exist) is a sensible next step for this encoder and is not yet measured β€” we expect to add and validate it here later. No claim is made about sub-Q8 quality for this model in the meantime.

Build provenance

Built from the original FP32 nomic-ai/CodeRankEmbed safetensors with a current-master llama.cpp (convert_hf_to_gguf.py β†’ llama-quantize), June 2026. Every metadata field above was checked against the source; cosine parity was proven against the FP32 sentence-transformers model, and the Q8_0 matches the community awhiteside/CodeRankEmbed-Q8_0-GGUF artifact to 1.000000.

One converter fix was required. Dense (non-MoE) NomicBERT models trip a bug in the converter's BERT path: NomicBertModel.modify_tensors reads MoE expert hparams unconditionally, which a dense model doesn't have.

# conversion/bert.py
-        n_experts = self.find_hparam(["num_local_experts", "num_experts"])
+        n_experts = self.find_hparam(["num_local_experts", "num_experts"]) if self.is_moe else 0

(This likely explains why pre-made CodeRankEmbed GGUFs exist at all β€” whoever built them hit and worked around the same thing.)

License & citation

MIT, inherited from nomic-ai/CodeRankEmbed. MIT requires attribution β€” retaining the license/notice, satisfied by the credit here β€” not citation. The CodeRankEmbed authors (the CoRNStack team) additionally request citation; please cite their work:

@misc{suresh2025cornstackhighqualitycontrastivedata,
      title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
      author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji},
      year={2025},
      eprint={2412.01007},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.01007},
}

Related