CodeRankEmbed-GGUF / README.md
handwoven8588's picture
Add CodeRankEmbed f16 + Q8_0 GGUFs and model card
14be410 verified
|
Raw
History Blame Contribute Delete
7.96 kB
---
license: mit
base_model:
- nomic-ai/CodeRankEmbed
base_model_relation: quantized
library_name: gguf
pipeline_tag: feature-extraction
language:
- code
tags:
- gguf
- llama.cpp
- flash-attention
- code-retrieval
- embeddings
- nomic-bert
---
# CodeRankEmbed-GGUF
GGUF quantizations of [`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed) β€” a 137M-parameter `nomic-bert` code-embedding model (768-dim, CLS-pooled, trained to 2048 tokens).
`CodeRankEmbed`'s stock attention is **eager-only** β€” there is no flash-attention or SDPA path in its `trust_remote_code` modeling file, so peak memory grows as `O(batch Γ— heads Γ— seqΒ²)` and it OOMs at high batch even at 137M params. **This repo is one of two ways we give it flash attention** β€” the **llama.cpp / GGUF runtime path**. Converting to GGUF discards the eager Python entirely; llama.cpp re-implements the architecture in its own graph and runs *its own* flash-attention kernel. The result, built from the original FP32 safetensors and verified end-to-end: faithful metadata, flash attention engages, **retrieval-lossless** vs the full-precision reference, and ~5.8Γ— lower peak VRAM.
## Two paths to flash attention
The same eager-only model, fixed two different ways:
| Path | Repo | How | Runtime & deps | Reach for it when |
|---|---|---|---|---|
| **PyTorch native varlen** | [`handwoven8588/CodeRankEmbed-flash-attn`](https://huggingface.co/handwoven8588/CodeRankEmbed-flash-attn) | `flash_attn` varlen baked into `modeling_hf_nomic_bert.py` (same weights β†’ identical embeddings) | sentence-transformers / PyTorch; needs `flash_attn` + a CUDA half-precision GPU | you serve through PyTorch/ST and want full/half-precision embeddings without changing stack |
| **GGUF + llama.cpp** *(this repo)* | `handwoven8588/CodeRankEmbed-GGUF` | convert to GGUF; inherit llama.cpp's own ggml flash-attention kernel | llama.cpp (`llama-server`); **no** torch / `flash_attn` / triton; quantized, small | you serve through llama.cpp, want to drop `trust_remote_code`, or want a small (CPU-capable) artifact |
## Files
| File | Quant | Size | Cosine vs FP32 sentence-transformers |
|---|---|---|---|
| `CodeRankEmbed-f16.gguf` | F16 | 274 MB | 0.99999 β€” very high quality |
| `CodeRankEmbed-Q8_0.gguf` | Q8_0 | 146 MB | 0.998 β€” high quality |
Both are **retrieval-lossless** on our simplish delexicalized code-search benchmark (below).
## Quick start β€” serve as an embedder
CodeRankEmbed is **CLS-pooled** and uses a **query-only instruction prefix**. Both are serve-time settings the GGUF does not carry β€” you must set them:
```bash
llama-server -m CodeRankEmbed-Q8_0.gguf --embeddings --pooling cls -c 2048 --embd-normalize 2 -fa on -ngl 99
```
- **Pooling β€” `--pooling cls`** (NOT mean). nomic-embed-*text* is mean-pooled; copy-pasting its serve command silently lowers recall. This is the single most likely mistake.
- **Query prefix** β€” prepend `Represent this query for searching relevant code: ` to **queries only**, never to documents/code. Do it client-side; the GGUF doesn't carry the sentence-transformers prompt.
- **Normalization β€” `--embd-normalize 2`** (L2) so downstream cosine is correct.
- **Context / RoPE** β€” cap at the trained length (`-c 2048`); the unusual `rope.freq_base = 1000` is baked in and must not be overridden.
- **Flash attention β€” `-fa on`** (or `-fa auto` on GPU) is what cuts peak VRAM ~5.8Γ— (see below). On a CPU-only backend the fused kernel falls back and the win evaporates β€” but it still runs.
## Quality β€” retrieval-lossless, not just high cosine
Cosine β‰ˆ 0.998 is necessary but not sufficient: it does not prove *ranking* is preserved. We validated on a **delexicalized** code-retrieval benchmark β€” Nβ‰ˆ1170 graded queries with the lexical overlap stripped against the deployed tokenizer, so only semantic matching can recover the gold β€” measuring nDCG@10 across a difficulty curve L0 (original query) β†’ Lall (every content token paraphrased away):
| model | L0 | L1 | L3 | Lall | drop (L0β†’Lall) |
|---|---|---|---|---|---|
| BM25 (deployed tokenizer) | 0.608 | 0.512 | 0.462 | 0.456 | 0.152 |
| FP32 (sentence-transformers, reference) | 0.954 | 0.944 | 0.920 | 0.916 | 0.038 |
| **GGUF f16** | 0.954 | 0.944 | 0.921 | 0.916 | 0.037 |
| **GGUF Q8_0** | 0.954 | 0.943 | 0.921 | 0.916 | 0.038 |
Every difference from the FP32 reference is **≀ 0.0006 nDCG@10 at every level**, including at Lall where the lexical shortcut is entirely gone. The degradation **slope** β€” the quantity that would expose a real loss of code understanding β€” is flat: FP32 0.0379, f16 0.0374, Q8_0 0.0377 (Ξ”slope βˆ’0.0002). Quantizing to eight bits costs **no** retrieval quality on the queries hard enough to measure it.
## Efficiency β€” the GGUF flash-attention win
The full-precision model's eager attention has peak memory `O(batch Γ— heads Γ— seqΒ²)`. The GGUF path discards that Python entirely and runs llama.cpp's own flash-attention kernel, which engages even for this bidirectional encoder. Measured on an RTX 3090 Ti (24 GB), 16 sequences Γ— 2048-token context:
| | peak VRAM |
|---|---|
| `-fa off` (eager) | 11.9 GB |
| `-fa on` | 2.0 GB (**~5.8Γ— less**) |
The same corpus that OOMs the FP32 reference at batch 64 embeds end-to-end in seconds on the GGUF.
## Quant choice & roadmap
We ship **f16** and **Q8_0**; both are retrieval-lossless above. No importance matrix (imatrix) is attached: `llama.cpp` consumes an imatrix only at ≀ Q6_K and ignores it at Q8_0 and above, so a "dynamic" Q8 and a uniform Q8 are byte-identical β€” there is no per-tensor lever to pull at eight bits.
A lower-bit **dynamic Q4/Q3** (imatrix-driven, where the lever *does* exist) is a sensible next step for this encoder and is **not yet measured** β€” we expect to add and validate it here later. No claim is made about sub-Q8 quality for this model in the meantime.
## Build provenance
Built from the original FP32 `nomic-ai/CodeRankEmbed` safetensors with a current-master `llama.cpp` (`convert_hf_to_gguf.py` β†’ `llama-quantize`), June 2026. Every metadata field above was checked against the source; cosine parity was proven against the FP32 sentence-transformers model, and the Q8_0 matches the community `awhiteside/CodeRankEmbed-Q8_0-GGUF` artifact to 1.000000.
**One converter fix was required.** Dense (non-MoE) NomicBERT models trip a bug in the converter's BERT path: `NomicBertModel.modify_tensors` reads MoE expert hparams unconditionally, which a dense model doesn't have.
```diff
# conversion/bert.py
- n_experts = self.find_hparam(["num_local_experts", "num_experts"])
+ n_experts = self.find_hparam(["num_local_experts", "num_experts"]) if self.is_moe else 0
```
(This likely explains why pre-made CodeRankEmbed GGUFs exist at all β€” whoever built them hit and worked around the same thing.)
## License & citation
**MIT**, inherited from [`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed). MIT requires attribution β€” retaining the license/notice, satisfied by the credit here β€” **not** citation. The CodeRankEmbed authors (the CoRNStack team) additionally *request* citation; please cite their work:
```bibtex
@misc{suresh2025cornstackhighqualitycontrastivedata,
title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji},
year={2025},
eprint={2412.01007},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.01007},
}
```
## Related
- **Base model:** [`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed)
- **The other flash-attention path** (PyTorch, native `flash_attn` varlen): [`handwoven8588/CodeRankEmbed-flash-attn`](https://huggingface.co/handwoven8588/CodeRankEmbed-flash-attn)