Instructions to use handwoven8588/CodeRankEmbed-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use handwoven8588/CodeRankEmbed-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="handwoven8588/CodeRankEmbed-GGUF", filename="CodeRankEmbed-Q8_0.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use handwoven8588/CodeRankEmbed-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0 # Run inference directly in the terminal: llama cli -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0 # Run inference directly in the terminal: llama cli -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0 # Run inference directly in the terminal: ./llama-cli -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf handwoven8588/CodeRankEmbed-GGUF:Q8_0
Use Docker
docker model run hf.co/handwoven8588/CodeRankEmbed-GGUF:Q8_0
- LM Studio
- Jan
- Ollama
How to use handwoven8588/CodeRankEmbed-GGUF with Ollama:
ollama run hf.co/handwoven8588/CodeRankEmbed-GGUF:Q8_0
- Unsloth Studio
How to use handwoven8588/CodeRankEmbed-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for handwoven8588/CodeRankEmbed-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for handwoven8588/CodeRankEmbed-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for handwoven8588/CodeRankEmbed-GGUF to start chatting
- Atomic Chat new
- Docker Model Runner
How to use handwoven8588/CodeRankEmbed-GGUF with Docker Model Runner:
docker model run hf.co/handwoven8588/CodeRankEmbed-GGUF:Q8_0
- Lemonade
How to use handwoven8588/CodeRankEmbed-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull handwoven8588/CodeRankEmbed-GGUF:Q8_0
Run and chat with the model
lemonade run user.CodeRankEmbed-GGUF-Q8_0
List all available models
lemonade list
| license: mit | |
| base_model: | |
| - nomic-ai/CodeRankEmbed | |
| base_model_relation: quantized | |
| library_name: gguf | |
| pipeline_tag: feature-extraction | |
| language: | |
| - code | |
| tags: | |
| - gguf | |
| - llama.cpp | |
| - flash-attention | |
| - code-retrieval | |
| - embeddings | |
| - nomic-bert | |
| # CodeRankEmbed-GGUF | |
| GGUF quantizations of [`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed) β a 137M-parameter `nomic-bert` code-embedding model (768-dim, CLS-pooled, trained to 2048 tokens). | |
| `CodeRankEmbed`'s stock attention is **eager-only** β there is no flash-attention or SDPA path in its `trust_remote_code` modeling file, so peak memory grows as `O(batch Γ heads Γ seqΒ²)` and it OOMs at high batch even at 137M params. **This repo is one of two ways we give it flash attention** β the **llama.cpp / GGUF runtime path**. Converting to GGUF discards the eager Python entirely; llama.cpp re-implements the architecture in its own graph and runs *its own* flash-attention kernel. The result, built from the original FP32 safetensors and verified end-to-end: faithful metadata, flash attention engages, **retrieval-lossless** vs the full-precision reference, and ~5.8Γ lower peak VRAM. | |
| ## Two paths to flash attention | |
| The same eager-only model, fixed two different ways: | |
| | Path | Repo | How | Runtime & deps | Reach for it when | | |
| |---|---|---|---|---| | |
| | **PyTorch native varlen** | [`handwoven8588/CodeRankEmbed-flash-attn`](https://huggingface.co/handwoven8588/CodeRankEmbed-flash-attn) | `flash_attn` varlen baked into `modeling_hf_nomic_bert.py` (same weights β identical embeddings) | sentence-transformers / PyTorch; needs `flash_attn` + a CUDA half-precision GPU | you serve through PyTorch/ST and want full/half-precision embeddings without changing stack | | |
| | **GGUF + llama.cpp** *(this repo)* | `handwoven8588/CodeRankEmbed-GGUF` | convert to GGUF; inherit llama.cpp's own ggml flash-attention kernel | llama.cpp (`llama-server`); **no** torch / `flash_attn` / triton; quantized, small | you serve through llama.cpp, want to drop `trust_remote_code`, or want a small (CPU-capable) artifact | | |
| ## Files | |
| | File | Quant | Size | Cosine vs FP32 sentence-transformers | | |
| |---|---|---|---| | |
| | `CodeRankEmbed-f16.gguf` | F16 | 274 MB | 0.99999 β very high quality | | |
| | `CodeRankEmbed-Q8_0.gguf` | Q8_0 | 146 MB | 0.998 β high quality | | |
| Both are **retrieval-lossless** on our simplish delexicalized code-search benchmark (below). | |
| ## Quick start β serve as an embedder | |
| CodeRankEmbed is **CLS-pooled** and uses a **query-only instruction prefix**. Both are serve-time settings the GGUF does not carry β you must set them: | |
| ```bash | |
| llama-server -m CodeRankEmbed-Q8_0.gguf --embeddings --pooling cls -c 2048 --embd-normalize 2 -fa on -ngl 99 | |
| ``` | |
| - **Pooling β `--pooling cls`** (NOT mean). nomic-embed-*text* is mean-pooled; copy-pasting its serve command silently lowers recall. This is the single most likely mistake. | |
| - **Query prefix** β prepend `Represent this query for searching relevant code: ` to **queries only**, never to documents/code. Do it client-side; the GGUF doesn't carry the sentence-transformers prompt. | |
| - **Normalization β `--embd-normalize 2`** (L2) so downstream cosine is correct. | |
| - **Context / RoPE** β cap at the trained length (`-c 2048`); the unusual `rope.freq_base = 1000` is baked in and must not be overridden. | |
| - **Flash attention β `-fa on`** (or `-fa auto` on GPU) is what cuts peak VRAM ~5.8Γ (see below). On a CPU-only backend the fused kernel falls back and the win evaporates β but it still runs. | |
| ## Quality β retrieval-lossless, not just high cosine | |
| Cosine β 0.998 is necessary but not sufficient: it does not prove *ranking* is preserved. We validated on a **delexicalized** code-retrieval benchmark β Nβ1170 graded queries with the lexical overlap stripped against the deployed tokenizer, so only semantic matching can recover the gold β measuring nDCG@10 across a difficulty curve L0 (original query) β Lall (every content token paraphrased away): | |
| | model | L0 | L1 | L3 | Lall | drop (L0βLall) | | |
| |---|---|---|---|---|---| | |
| | BM25 (deployed tokenizer) | 0.608 | 0.512 | 0.462 | 0.456 | 0.152 | | |
| | FP32 (sentence-transformers, reference) | 0.954 | 0.944 | 0.920 | 0.916 | 0.038 | | |
| | **GGUF f16** | 0.954 | 0.944 | 0.921 | 0.916 | 0.037 | | |
| | **GGUF Q8_0** | 0.954 | 0.943 | 0.921 | 0.916 | 0.038 | | |
| Every difference from the FP32 reference is **β€ 0.0006 nDCG@10 at every level**, including at Lall where the lexical shortcut is entirely gone. The degradation **slope** β the quantity that would expose a real loss of code understanding β is flat: FP32 0.0379, f16 0.0374, Q8_0 0.0377 (Ξslope β0.0002). Quantizing to eight bits costs **no** retrieval quality on the queries hard enough to measure it. | |
| ## Efficiency β the GGUF flash-attention win | |
| The full-precision model's eager attention has peak memory `O(batch Γ heads Γ seqΒ²)`. The GGUF path discards that Python entirely and runs llama.cpp's own flash-attention kernel, which engages even for this bidirectional encoder. Measured on an RTX 3090 Ti (24 GB), 16 sequences Γ 2048-token context: | |
| | | peak VRAM | | |
| |---|---| | |
| | `-fa off` (eager) | 11.9 GB | | |
| | `-fa on` | 2.0 GB (**~5.8Γ less**) | | |
| The same corpus that OOMs the FP32 reference at batch 64 embeds end-to-end in seconds on the GGUF. | |
| ## Quant choice & roadmap | |
| We ship **f16** and **Q8_0**; both are retrieval-lossless above. No importance matrix (imatrix) is attached: `llama.cpp` consumes an imatrix only at β€ Q6_K and ignores it at Q8_0 and above, so a "dynamic" Q8 and a uniform Q8 are byte-identical β there is no per-tensor lever to pull at eight bits. | |
| A lower-bit **dynamic Q4/Q3** (imatrix-driven, where the lever *does* exist) is a sensible next step for this encoder and is **not yet measured** β we expect to add and validate it here later. No claim is made about sub-Q8 quality for this model in the meantime. | |
| ## Build provenance | |
| Built from the original FP32 `nomic-ai/CodeRankEmbed` safetensors with a current-master `llama.cpp` (`convert_hf_to_gguf.py` β `llama-quantize`), June 2026. Every metadata field above was checked against the source; cosine parity was proven against the FP32 sentence-transformers model, and the Q8_0 matches the community `awhiteside/CodeRankEmbed-Q8_0-GGUF` artifact to 1.000000. | |
| **One converter fix was required.** Dense (non-MoE) NomicBERT models trip a bug in the converter's BERT path: `NomicBertModel.modify_tensors` reads MoE expert hparams unconditionally, which a dense model doesn't have. | |
| ```diff | |
| # conversion/bert.py | |
| - n_experts = self.find_hparam(["num_local_experts", "num_experts"]) | |
| + n_experts = self.find_hparam(["num_local_experts", "num_experts"]) if self.is_moe else 0 | |
| ``` | |
| (This likely explains why pre-made CodeRankEmbed GGUFs exist at all β whoever built them hit and worked around the same thing.) | |
| ## License & citation | |
| **MIT**, inherited from [`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed). MIT requires attribution β retaining the license/notice, satisfied by the credit here β **not** citation. The CodeRankEmbed authors (the CoRNStack team) additionally *request* citation; please cite their work: | |
| ```bibtex | |
| @misc{suresh2025cornstackhighqualitycontrastivedata, | |
| title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking}, | |
| author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji}, | |
| year={2025}, | |
| eprint={2412.01007}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL}, | |
| url={https://arxiv.org/abs/2412.01007}, | |
| } | |
| ``` | |
| ## Related | |
| - **Base model:** [`nomic-ai/CodeRankEmbed`](https://huggingface.co/nomic-ai/CodeRankEmbed) | |
| - **The other flash-attention path** (PyTorch, native `flash_attn` varlen): [`handwoven8588/CodeRankEmbed-flash-attn`](https://huggingface.co/handwoven8588/CodeRankEmbed-flash-attn) | |