Instructions to use handwoven8588/CodeRankEmbed-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use handwoven8588/CodeRankEmbed-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="handwoven8588/CodeRankEmbed-GGUF", filename="CodeRankEmbed-Q8_0.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use handwoven8588/CodeRankEmbed-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0 # Run inference directly in the terminal: llama cli -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0 # Run inference directly in the terminal: llama cli -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0 # Run inference directly in the terminal: ./llama-cli -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0
Use Docker
docker model run hf.co/handwoven8588/CodeRankEmbed-GGUF:Q8_0
- LM Studio
- Jan
- Ollama
How to use handwoven8588/CodeRankEmbed-GGUF with Ollama:
ollama run hf.co/handwoven8588/CodeRankEmbed-GGUF:Q8_0
- Unsloth Studio
How to use handwoven8588/CodeRankEmbed-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for handwoven8588/CodeRankEmbed-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for handwoven8588/CodeRankEmbed-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for handwoven8588/CodeRankEmbed-GGUF to start chatting
- Atomic Chat new
- Docker Model Runner
How to use handwoven8588/CodeRankEmbed-GGUF with Docker Model Runner:
docker model run hf.co/handwoven8588/CodeRankEmbed-GGUF:Q8_0
- Lemonade
How to use handwoven8588/CodeRankEmbed-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull handwoven8588/CodeRankEmbed-GGUF:Q8_0
Run and chat with the model
lemonade run user.CodeRankEmbed-GGUF-Q8_0
List all available models
lemonade list
CodeRankEmbed-GGUF
GGUF quantizations of nomic-ai/CodeRankEmbed — a 137M-parameter nomic-bert code-embedding model (768-dim, CLS-pooled, trained to 2048 tokens).
CodeRankEmbed's stock attention is eager-only — there is no flash-attention or SDPA path in its trust_remote_code modeling file, so peak memory grows as O(batch × heads × seq²) and it OOMs at high batch even at 137M params. This repo is one of two ways we give it flash attention — the llama.cpp / GGUF runtime path. Converting to GGUF discards the eager Python entirely; llama.cpp re-implements the architecture in its own graph and runs its own flash-attention kernel. The result, built from the original FP32 safetensors and verified end-to-end: faithful metadata, flash attention engages, retrieval-lossless vs the full-precision reference, and ~5.8× lower peak VRAM.
Two paths to flash attention
The same eager-only model, fixed two different ways:
| Path | Repo | How | Runtime & deps | Reach for it when |
|---|---|---|---|---|
| PyTorch native varlen | handwoven8588/CodeRankEmbed-flash-attn |
flash_attn varlen baked into modeling_hf_nomic_bert.py (same weights → identical embeddings) |
sentence-transformers / PyTorch; needs flash_attn + a CUDA half-precision GPU |
you serve through PyTorch/ST and want full/half-precision embeddings without changing stack |
| GGUF + llama.cpp (this repo) | handwoven8588/CodeRankEmbed-GGUF |
convert to GGUF; inherit llama.cpp's own ggml flash-attention kernel | llama.cpp (llama-server); no torch / flash_attn / triton; quantized, small |
you serve through llama.cpp, want to drop trust_remote_code, or want a small (CPU-capable) artifact |
Files
| File | Quant | Size | Cosine vs FP32 sentence-transformers |
|---|---|---|---|
CodeRankEmbed-f16.gguf |
F16 | 274 MB | 0.99999 — very high quality |
CodeRankEmbed-Q8_0.gguf |
Q8_0 | 146 MB | 0.998 — high quality |
Both are retrieval-lossless on our simplish delexicalized code-search benchmark (below).
Quick start — serve as an embedder
CodeRankEmbed is CLS-pooled and uses a query-only instruction prefix. Both are serve-time settings the GGUF does not carry — you must set them:
llama-server -m CodeRankEmbed-Q8_0.gguf --embeddings --pooling cls -c 2048 --embd-normalize 2 -fa on -ngl 99
- Pooling —
--pooling cls(NOT mean). nomic-embed-text is mean-pooled; copy-pasting its serve command silently lowers recall. This is the single most likely mistake. - Query prefix — prepend
Represent this query for searching relevant code:to queries only, never to documents/code. Do it client-side; the GGUF doesn't carry the sentence-transformers prompt. - Normalization —
--embd-normalize 2(L2) so downstream cosine is correct. - Context / RoPE — cap at the trained length (
-c 2048); the unusualrope.freq_base = 1000is baked in and must not be overridden. - Flash attention —
-fa on(or-fa autoon GPU) is what cuts peak VRAM ~5.8× (see below). On a CPU-only backend the fused kernel falls back and the win evaporates — but it still runs.
Quality — retrieval-lossless, not just high cosine
Cosine ≈ 0.998 is necessary but not sufficient: it does not prove ranking is preserved. We validated on a delexicalized code-retrieval benchmark — N≈1170 graded queries with the lexical overlap stripped against the deployed tokenizer, so only semantic matching can recover the gold — measuring nDCG@10 across a difficulty curve L0 (original query) → Lall (every content token paraphrased away):
| model | L0 | L1 | L3 | Lall | drop (L0→Lall) |
|---|---|---|---|---|---|
| BM25 (deployed tokenizer) | 0.608 | 0.512 | 0.462 | 0.456 | 0.152 |
| FP32 (sentence-transformers, reference) | 0.954 | 0.944 | 0.920 | 0.916 | 0.038 |
| GGUF f16 | 0.954 | 0.944 | 0.921 | 0.916 | 0.037 |
| GGUF Q8_0 | 0.954 | 0.943 | 0.921 | 0.916 | 0.038 |
Every difference from the FP32 reference is ≤ 0.0006 nDCG@10 at every level, including at Lall where the lexical shortcut is entirely gone. The degradation slope — the quantity that would expose a real loss of code understanding — is flat: FP32 0.0379, f16 0.0374, Q8_0 0.0377 (Δslope −0.0002). Quantizing to eight bits costs no retrieval quality on the queries hard enough to measure it.
Efficiency — the GGUF flash-attention win
The full-precision model's eager attention has peak memory O(batch × heads × seq²). The GGUF path discards that Python entirely and runs llama.cpp's own flash-attention kernel, which engages even for this bidirectional encoder. Measured on an RTX 3090 Ti (24 GB), 16 sequences × 2048-token context:
| peak VRAM | |
|---|---|
-fa off (eager) |
11.9 GB |
-fa on |
2.0 GB (~5.8× less) |
The same corpus that OOMs the FP32 reference at batch 64 embeds end-to-end in seconds on the GGUF.
Quant choice & roadmap
We ship f16 and Q8_0; both are retrieval-lossless above. No importance matrix (imatrix) is attached: llama.cpp consumes an imatrix only at ≤ Q6_K and ignores it at Q8_0 and above, so a "dynamic" Q8 and a uniform Q8 are byte-identical — there is no per-tensor lever to pull at eight bits.
A lower-bit dynamic Q4/Q3 (imatrix-driven, where the lever does exist) is a sensible next step for this encoder and is not yet measured — we expect to add and validate it here later. No claim is made about sub-Q8 quality for this model in the meantime.
Build provenance
Built from the original FP32 nomic-ai/CodeRankEmbed safetensors with a current-master llama.cpp (convert_hf_to_gguf.py → llama-quantize), June 2026. Every metadata field above was checked against the source; cosine parity was proven against the FP32 sentence-transformers model, and the Q8_0 matches the community awhiteside/CodeRankEmbed-Q8_0-GGUF artifact to 1.000000.
One converter fix was required. Dense (non-MoE) NomicBERT models trip a bug in the converter's BERT path: NomicBertModel.modify_tensors reads MoE expert hparams unconditionally, which a dense model doesn't have.
# conversion/bert.py
- n_experts = self.find_hparam(["num_local_experts", "num_experts"])
+ n_experts = self.find_hparam(["num_local_experts", "num_experts"]) if self.is_moe else 0
(This likely explains why pre-made CodeRankEmbed GGUFs exist at all — whoever built them hit and worked around the same thing.)
License & citation
MIT, inherited from nomic-ai/CodeRankEmbed. MIT requires attribution — retaining the license/notice, satisfied by the credit here — not citation. The CodeRankEmbed authors (the CoRNStack team) additionally request citation; please cite their work:
@misc{suresh2025cornstackhighqualitycontrastivedata,
title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji},
year={2025},
eprint={2412.01007},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.01007},
}
Related
- Base model:
nomic-ai/CodeRankEmbed - The other flash-attention path (PyTorch, native
flash_attnvarlen):handwoven8588/CodeRankEmbed-flash-attn
- Downloads last month
- -
8-bit
16-bit
Model tree for handwoven8588/CodeRankEmbed-GGUF
Base model
Snowflake/snowflake-arctic-embed-m-long