Instructions to use handwoven8588/CodeRankEmbed-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use handwoven8588/CodeRankEmbed-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="handwoven8588/CodeRankEmbed-GGUF",
	filename="CodeRankEmbed-Q8_0.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use handwoven8588/CodeRankEmbed-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0
# Run inference directly in the terminal:
llama cli -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0
# Run inference directly in the terminal:
llama cli -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0

Use Docker

docker model run hf.co/handwoven8588/CodeRankEmbed-GGUF:Q8_0

LM Studio
Jan
Ollama
How to use handwoven8588/CodeRankEmbed-GGUF with Ollama:
```
ollama run hf.co/handwoven8588/CodeRankEmbed-GGUF:Q8_0
```

Unsloth Studio

How to use handwoven8588/CodeRankEmbed-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for handwoven8588/CodeRankEmbed-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for handwoven8588/CodeRankEmbed-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for handwoven8588/CodeRankEmbed-GGUF to start chatting

Atomic Chat new
Docker Model Runner
How to use handwoven8588/CodeRankEmbed-GGUF with Docker Model Runner:
```
docker model run hf.co/handwoven8588/CodeRankEmbed-GGUF:Q8_0
```

Lemonade

How to use handwoven8588/CodeRankEmbed-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull handwoven8588/CodeRankEmbed-GGUF:Q8_0

Run and chat with the model

lemonade run user.CodeRankEmbed-GGUF-Q8_0

List all available models

lemonade list

CodeRankEmbed-GGUF / README.md

handwoven8588

Add CodeRankEmbed f16 + Q8_0 GGUFs and model card

14be410 verified 10 days ago

preview code

Raw

History Blame Contribute Delete

7.96 kB

	---
	license: mit
	base_model:
	- nomic-ai/CodeRankEmbed
	base_model_relation: quantized
	library_name: gguf
	pipeline_tag: feature-extraction
	language:
	- code
	tags:
	- gguf
	- llama.cpp
	- flash-attention
	- code-retrieval
	- embeddings
	- nomic-bert
	---

	# CodeRankEmbed-GGUF

	GGUF quantizations of [`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed) — a 137M-parameter `nomic-bert` code-embedding model (768-dim, CLS-pooled, trained to 2048 tokens).

	`CodeRankEmbed`'s stock attention is eager-only — there is no flash-attention or SDPA path in its `trust_remote_code` modeling file, so peak memory grows as `O(batch × heads × seq²)` and it OOMs at high batch even at 137M params. This repo is one of two ways we give it flash attention — the llama.cpp / GGUF runtime path. Converting to GGUF discards the eager Python entirely; llama.cpp re-implements the architecture in its own graph and runs its own flash-attention kernel. The result, built from the original FP32 safetensors and verified end-to-end: faithful metadata, flash attention engages, retrieval-lossless vs the full-precision reference, and ~5.8× lower peak VRAM.

	## Two paths to flash attention

	The same eager-only model, fixed two different ways:

	\| Path \| Repo \| How \| Runtime & deps \| Reach for it when \|
	\|---\|---\|---\|---\|---\|
	\| PyTorch native varlen \| [`handwoven8588/CodeRankEmbed-flash-attn`](https://huggingface.co/handwoven8588/CodeRankEmbed-flash-attn) \| `flash_attn` varlen baked into `modeling_hf_nomic_bert.py` (same weights → identical embeddings) \| sentence-transformers / PyTorch; needs `flash_attn` + a CUDA half-precision GPU \| you serve through PyTorch/ST and want full/half-precision embeddings without changing stack \|
	\| GGUF + llama.cpp (this repo) \| `handwoven8588/CodeRankEmbed-GGUF` \| convert to GGUF; inherit llama.cpp's own ggml flash-attention kernel \| llama.cpp (`llama-server`); no torch / `flash_attn` / triton; quantized, small \| you serve through llama.cpp, want to drop `trust_remote_code`, or want a small (CPU-capable) artifact \|

	## Files

	\| File \| Quant \| Size \| Cosine vs FP32 sentence-transformers \|
	\|---\|---\|---\|---\|
	\| `CodeRankEmbed-f16.gguf` \| F16 \| 274 MB \| 0.99999 — very high quality \|
	\| `CodeRankEmbed-Q8_0.gguf` \| Q8_0 \| 146 MB \| 0.998 — high quality \|

	Both are retrieval-lossless on our simplish delexicalized code-search benchmark (below).

	## Quick start — serve as an embedder

	CodeRankEmbed is CLS-pooled and uses a query-only instruction prefix. Both are serve-time settings the GGUF does not carry — you must set them:

	```bash
	llama-server -m CodeRankEmbed-Q8_0.gguf --embeddings --pooling cls -c 2048 --embd-normalize 2 -fa on -ngl 99
	```

	- Pooling — `--pooling cls` (NOT mean). nomic-embed-text is mean-pooled; copy-pasting its serve command silently lowers recall. This is the single most likely mistake.
	- Query prefix — prepend `Represent this query for searching relevant code: ` to queries only, never to documents/code. Do it client-side; the GGUF doesn't carry the sentence-transformers prompt.
	- Normalization — `--embd-normalize 2` (L2) so downstream cosine is correct.
	- Context / RoPE — cap at the trained length (`-c 2048`); the unusual `rope.freq_base = 1000` is baked in and must not be overridden.
	- Flash attention — `-fa on` (or `-fa auto` on GPU) is what cuts peak VRAM ~5.8× (see below). On a CPU-only backend the fused kernel falls back and the win evaporates — but it still runs.

	## Quality — retrieval-lossless, not just high cosine

	Cosine ≈ 0.998 is necessary but not sufficient: it does not prove ranking is preserved. We validated on a delexicalized code-retrieval benchmark — N≈1170 graded queries with the lexical overlap stripped against the deployed tokenizer, so only semantic matching can recover the gold — measuring nDCG@10 across a difficulty curve L0 (original query) → Lall (every content token paraphrased away):

	\| model \| L0 \| L1 \| L3 \| Lall \| drop (L0→Lall) \|
	\|---\|---\|---\|---\|---\|---\|
	\| BM25 (deployed tokenizer) \| 0.608 \| 0.512 \| 0.462 \| 0.456 \| 0.152 \|
	\| FP32 (sentence-transformers, reference) \| 0.954 \| 0.944 \| 0.920 \| 0.916 \| 0.038 \|
	\| GGUF f16 \| 0.954 \| 0.944 \| 0.921 \| 0.916 \| 0.037 \|
	\| GGUF Q8_0 \| 0.954 \| 0.943 \| 0.921 \| 0.916 \| 0.038 \|

	Every difference from the FP32 reference is ≤ 0.0006 nDCG@10 at every level, including at Lall where the lexical shortcut is entirely gone. The degradation slope — the quantity that would expose a real loss of code understanding — is flat: FP32 0.0379, f16 0.0374, Q8_0 0.0377 (Δslope −0.0002). Quantizing to eight bits costs no retrieval quality on the queries hard enough to measure it.

	## Efficiency — the GGUF flash-attention win

	The full-precision model's eager attention has peak memory `O(batch × heads × seq²)`. The GGUF path discards that Python entirely and runs llama.cpp's own flash-attention kernel, which engages even for this bidirectional encoder. Measured on an RTX 3090 Ti (24 GB), 16 sequences × 2048-token context:

	\| \| peak VRAM \|
	\|---\|---\|
	\| `-fa off` (eager) \| 11.9 GB \|
	\| `-fa on` \| 2.0 GB (~5.8× less) \|

	The same corpus that OOMs the FP32 reference at batch 64 embeds end-to-end in seconds on the GGUF.

	## Quant choice & roadmap

	We ship f16 and Q8_0; both are retrieval-lossless above. No importance matrix (imatrix) is attached: `llama.cpp` consumes an imatrix only at ≤ Q6_K and ignores it at Q8_0 and above, so a "dynamic" Q8 and a uniform Q8 are byte-identical — there is no per-tensor lever to pull at eight bits.

	A lower-bit dynamic Q4/Q3 (imatrix-driven, where the lever does exist) is a sensible next step for this encoder and is not yet measured — we expect to add and validate it here later. No claim is made about sub-Q8 quality for this model in the meantime.

	## Build provenance

	Built from the original FP32 `nomic-ai/CodeRankEmbed` safetensors with a current-master `llama.cpp` (`convert_hf_to_gguf.py` → `llama-quantize`), June 2026. Every metadata field above was checked against the source; cosine parity was proven against the FP32 sentence-transformers model, and the Q8_0 matches the community `awhiteside/CodeRankEmbed-Q8_0-GGUF` artifact to 1.000000.

	One converter fix was required. Dense (non-MoE) NomicBERT models trip a bug in the converter's BERT path: `NomicBertModel.modify_tensors` reads MoE expert hparams unconditionally, which a dense model doesn't have.

	```diff
	# conversion/bert.py
	- n_experts = self.find_hparam(["num_local_experts", "num_experts"])
	+ n_experts = self.find_hparam(["num_local_experts", "num_experts"]) if self.is_moe else 0
	```

	(This likely explains why pre-made CodeRankEmbed GGUFs exist at all — whoever built them hit and worked around the same thing.)

	## License & citation

	MIT, inherited from [`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed). MIT requires attribution — retaining the license/notice, satisfied by the credit here — not citation. The CodeRankEmbed authors (the CoRNStack team) additionally request citation; please cite their work:

	```bibtex
	@misc{suresh2025cornstackhighqualitycontrastivedata,
	title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
	author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji},
	year={2025},
	eprint={2412.01007},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2412.01007},
	}
	```

	## Related

	- Base model: [`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed)
	- The other flash-attention path (PyTorch, native `flash_attn` varlen): [`handwoven8588/CodeRankEmbed-flash-attn`](https://huggingface.co/handwoven8588/CodeRankEmbed-flash-attn)