codecompass-embed / README.md

Simplify README: single benchmark table, factual highlights

72021f1 verified 12 days ago

5.12 kB

	---
	license: apache-2.0
	language:
	- en
	- code
	library_name: transformers
	tags:
	- code
	- embeddings
	- retrieval
	- code-search
	- semantic-search
	- feature-extraction
	- sentence-transformers
	datasets:
	- code-rag-bench/cornstack
	- bigcode/stackoverflow
	- code_search_net
	pipeline_tag: feature-extraction
	base_model: Qwen/Qwen2.5-Coder-0.5B
	model-index:
	- name: CodeCompass-Embed
	results:
	- task:
	type: retrieval
	name: Code Retrieval
	dataset:
	type: CoIR-Retrieval/codetrans-dl
	name: CodeTrans-DL
	metrics:
	- type: ndcg@10
	value: 0.3305
	name: NDCG@10
	- task:
	type: retrieval
	name: Code Retrieval
	dataset:
	type: CoIR-Retrieval/CodeSearchNet-python
	name: CodeSearchNet Python
	metrics:
	- type: ndcg@10
	value: 0.9228
	name: NDCG@10
	---

	# CodeCompass-Embed

	CodeCompass-Embed is a code embedding model fine-tuned from [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B) for semantic code search and retrieval tasks.

	## Model Highlights

	- 🏆 #1 on CodeTrans-DL (code translation between frameworks)
	- 🥇 #4 on CodeSearchNet-Python (natural language to code search)
	- ⚡ 494M parameters, 896-dim embeddings
	- 🔄 Bidirectional attention (converted from causal LLM)
	- 🎯 Mean pooling with L2 normalization
	- 📏 Trained at 512 tokens, extrapolates to longer sequences via RoPE

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| Qwen2.5-Coder-0.5B \|
	\| Parameters \| 494M \|
	\| Embedding Dimension \| 896 \|
	\| Max Sequence Length \| 512 (training) / 32K (inference) \|
	\| Pooling \| Mean \|
	\| Normalization \| L2 \|
	\| Attention \| Bidirectional (all 24 layers) \|

	## Benchmark Results (CoIR)

	Evaluated on the [CoIR Benchmark](https://github.com/CoIR-team/coir) (NDCG@10). Sorted by CSN-Python.

	\| Model \| Params \| CSN-Python \| CodeTrans-DL \| Text2SQL \| SO-QA \| CF-ST \| Apps \|
	\|-------\|--------\|------------\|--------------\|----------\|-------\|-------\|------\|
	\| SFR-Embedding-Code \| 400M \| 0.9505 \| 0.2683 \| 0.9949 \| 0.9107 \| 0.7258 \| 0.2212 \|
	\| Jina-Code-v2 \| 161M \| 0.9439 \| 0.2739 \| 0.5169 \| 0.8874 \| 0.6975 \| 0.1538 \|
	\| CodeRankEmbed \| 137M \| 0.9378 \| 0.2604 \| 0.7686 \| 0.8990 \| 0.7166 \| 0.1993 \|
	\| CodeCompass-Embed \| 494M \| 0.9228 \| 0.3305 \| 0.5673 \| 0.6480 \| 0.4080 \| 0.1277 \|
	\| Snowflake-Arctic-Embed-L \| 568M \| 0.9146 \| 0.1958 \| 0.5401 \| 0.8718 \| 0.6503 \| 0.1435 \|
	\| BGE-M3 \| 568M \| 0.8976 \| 0.2194 \| 0.5728 \| 0.8501 \| 0.6437 \| 0.1445 \|
	\| BGE-Base-en-v1.5 \| 109M \| 0.8944 \| 0.2125 \| 0.5265 \| 0.8581 \| 0.6423 \| 0.1415 \|
	\| CodeT5+-110M \| 110M \| 0.8702 \| 0.1794 \| 0.3275 \| 0.8147 \| 0.5804 \| 0.1179 \|

	CodeCompass-Embed ranks #1 on CodeTrans-DL and #4 on CSN-Python.

	## Usage

	```python
	import torch
	import torch.nn.functional as F
	from transformers import AutoModel, AutoTokenizer

	model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")

	# Enable bidirectional attention
	for layer in model.layers:
	layer.self_attn.is_causal = False

	model.eval()

	def encode(texts, is_query=False):
	if is_query:
	texts = [f"Instruct: Find the most relevant code snippet given the following query:
	Query: {t}" for t in texts]

	inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")

	with torch.no_grad():
	outputs = model(**inputs, output_hidden_states=True)
	hidden = outputs.hidden_states[-1]
	mask = inputs["attention_mask"].unsqueeze(-1).float()
	embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
	embeddings = F.normalize(embeddings, p=2, dim=-1)

	return embeddings

	query_emb = encode(["sort a list"], is_query=True)
	code_embs = encode(["def sort(lst): return sorted(lst)"])
	similarity = (query_emb @ code_embs.T).item()
	```

	## Instruction Templates

	\| Task \| Template \|
	\|------\|----------\|
	\| NL to Code \| `Instruct: Find the most relevant code snippet given the following query:
	Query: {q}` \|
	\| Code to Code \| `Instruct: Find an equivalent code snippet given the following code snippet:
	Query: {q}` \|
	\| Tech Q&A \| `Instruct: Find the most relevant answer given the following question:
	Query: {q}` \|
	\| Text to SQL \| `Instruct: Given a natural language question and schema, find the corresponding SQL query:
	Query: {q}` \|

	Documents do not need instruction prefixes.

	## Training

	- Data: 8.8M samples from CoRNStack, StackOverflow, CodeSearchNet
	- Loss: InfoNCE (τ=0.05) with 7 hard negatives per sample
	- Batch Size: 1024 (via GradCache)
	- Steps: 950
	- Hardware: NVIDIA H100

	## Limitations

	- Weaker on Q&A style tasks (StackOverflow-QA, CodeFeedback)
	- Trained on Python/JavaScript/Java/Go/PHP/Ruby

	## Citation

	```bibtex
	@misc{codecompass2026,
	author = {Faisal Mumtaz},
	title = {CodeCompass-Embed: A Code Embedding Model for Semantic Code Search},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/faisalmumtaz/codecompass-embed}
	}
	```

	## License

	Apache 2.0