File size: 5,974 Bytes
6b2f6ab ec357dc a4d0122 6b2f6ab 14632b6 b9fddd3 524e216 00888b9 524e216 e0b59fe 00888b9 b9fddd3 14632b6 b9fddd3 00888b9 b9fddd3 524e216 b9fddd3 524e216 b9fddd3 524e216 b9fddd3 524e216 b9fddd3 00888b9 b9fddd3 14632b6 d85679f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
---
license: apache-2.0
language:
- en
- zh
base_model:
- Qwen/Qwen2.5-Coder-0.5B-Instruct
library_name: transformers
tags:
- code
- sentence-transformers
pipeline_tag: feature-extraction
---
<div align="center" style="display: flex; justify-content: center; align-items: center; gap: 20px;">
<a href="https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main/" style="display: flex; align-items: center; text-decoration: none; color: inherit;">
<img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" width="30" height="30" style="vertical-align: middle; margin-right: 8px;">
<span style="font-size: 1.5em; font-weight: bold;">CodeFuse-Embeddings</span>
</a>
</div>
# A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling
[Paper](https://huggingface.co/papers/2512.21332) | [Code](https://github.com/codefuse-ai/CodeFuse-Embeddings)
**C2LLMs (Code Contrastive Large Language Models)** are powerful new models for generating code embeddings, designed to capture the deep semantics of source code.
#### Key Features
- **Powerful Base Model**: Built upon the state-of-the-art `Qwen2.5-Coder`, inheriting its exceptional code comprehension capabilities.
- **Intelligent Pooling with PMA**: Instead of traditional `mean pooling` or `last token pooling`, C2LLM uses **PMA (Pooling by Multi-head Attention)**. This allows the model to dynamically focus on the most critical parts of the code, creating a more informative and robust embedding.
- **Trained for Retrieval**: C2LLM was fine-tuned on a massive dataset of **3 million query-document pairs**, optimizing it for real-world code retrieval and semantic search tasks. Supporting Text2Code/Code2Code/Code2Text tasks.
C2LLM is designed to be a go-to model for tasks like code search and Retrieval-Augmented Generation (RAG). For more details, please see our [GitHub repository](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main).
# Model Details
# How to use
## Usage (**HuggingFace Transformers**)
```Python
from transformers import AutoModel, AutoTokenizer
import torch
model_path = "codefuse-ai/C2LLM-0.5B"
# Load the model
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, trust_remote_code=True)
# Prepare your custom instruction
instruction = "xxxxx"
# Prepare the data
sentences = ['''int r = (int) params >> 8 & 0xff;
int p = (int) params & 0xff;
byte[] derived1 = SCrypt.scrypt(passwd.getBytes("UTF-8"), salt, N, r, p, 32);
if (derived0.length != derived1.length) return false;
int result = 0;
for (int i = 0; i < derived0.length; i++) {
result |= derived0[i] ^ derived1[i];
}
return result == 0;
} catch (UnsupportedEncodingException e) {
throw new IllegalStateException("JVM doesn't support UTF-8?");
} catch (GeneralSecurityException e) {
throw new IllegalStateException("JVM doesn't support SHA1PRNG or HMAC_SHA256?");
}
}''',
'''
}
if (tempFrom > tempTo) {
return new RangeInfo(inclusive ? tempTo : tempTo + 1, tempFrom + 1, true);
}
return new RangeInfo(tempFrom, inclusive ? tempTo + 1 : tempTo, false);
}''']
sentences = [instruction+sentence for sentence in sentences]
# Get the embeddings
embeddings = model.encode(sentences)
```
## Usage (**Sentence-Transformers**)
```python
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer("codefuse-ai/C2LLM-0.5B", trust_remote_code=True, tokenizer_kwargs={"padding_side":"left"})
# Prepare your custom instruction
instruction = "xxxxx"
# Prepare the data
sentences = ['''int r = (int) params >> 8 & 0xff;
int p = (int) params & 0xff;
byte[] derived1 = SCrypt.scrypt(passwd.getBytes("UTF-8"), salt, N, r, p, 32);
if (derived0.length != derived1.length) return false;
int result = 0;
for (int i = 0; i < derived0.length; i++) {
result |= derived0[i] ^ derived1[i];
}
return result == 0;
} catch (UnsupportedEncodingException e) {
throw new IllegalStateException("JVM doesn't support UTF-8?");
} catch (GeneralSecurityException e) {
throw new IllegalStateException("JVM doesn't support SHA1PRNG or HMAC_SHA256?");
}
}''',
'''
}
if (tempFrom > tempTo) {
return new RangeInfo(inclusive ? tempTo : tempTo + 1, tempFrom + 1, true);
}
return new RangeInfo(tempFrom, inclusive ? tempTo + 1 : tempTo, false);
}''']
sentences = [instruction+sentence for sentence in sentences]
# Get the embeddings
embeddings = model.encode(sentences)
```
## Evaluation (**MTEB**)
```python
from sentence_transformers import SentenceTransformer
from mteb.models import ModelMeta
from mteb.cache import ResultCache
model_name = "codefuse-ai/C2LLM-0.5B"
# Load the model
model = mteb.get_model(model_name) # if the model is not implemented in MTEB it will be eq. to SentenceTransformer(model_name)
# Select tasks
tasks = mteb.get_tasks(tasks=["AppsRetrieval", "CodeSearchNetCCRetrieval", "CodeEditSearchRetrieval","CodeSearchNetRetrieval","CodeFeedbackMT","CodeFeedbackST","CodeTransOceanContest","CodeTransOceanDL","COIRCodeSearchNetRetrieval","CosQA","StackOverflowQA","SyntheticText2SQL"])
# Cache the result
cache = ResultCache("./c2llm_results")
# Evaluate
results = mteb.evaluate(model, tasks=tasks, cache=cache, encode_kwargs={"batch_size": 16})
```
## Support Us
If you find this project helpful, please give it a star. It means a lot to us!
[](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main)
## Citation
@article{2025C2LLM,
title={C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling},
author={Jin Qin and Zihan Liao and Ziyin Zhang and Hang Yu and Peng Di and Rui Wang},
journal = {CoRR},
volume = {abs/2512.21332},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2512.21332},
doi = {10.48550/ARXIV.2512.21332},
eprinttype = {arXiv},
eprint = {2512.21332}
}
|