File size: 5,974 Bytes
6b2f6ab
 
 
 
 
 
 
 
 
 
ec357dc
a4d0122
6b2f6ab
14632b6
 
 
 
 
 
 
b9fddd3
524e216
00888b9
524e216
e0b59fe
 
00888b9
b9fddd3
 
 
 
 
 
 
14632b6
b9fddd3
 
 
 
 
 
 
00888b9
b9fddd3
 
 
 
 
 
 
 
524e216
 
 
b9fddd3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
524e216
 
b9fddd3
 
 
 
 
 
 
 
 
 
524e216
 
 
 
b9fddd3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
524e216
 
b9fddd3
 
 
 
 
 
00888b9
b9fddd3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14632b6
 
 
 
 
 
d85679f
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
license: apache-2.0
language:
- en
- zh
base_model:
- Qwen/Qwen2.5-Coder-0.5B-Instruct
library_name: transformers
tags:
- code
- sentence-transformers
pipeline_tag: feature-extraction
---
<div align="center" style="display: flex; justify-content: center; align-items: center; gap: 20px;">
    <a href="https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main/" style="display: flex; align-items: center; text-decoration: none; color: inherit;">
        <img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" width="30" height="30" style="vertical-align: middle; margin-right: 8px;">
        <span style="font-size: 1.5em; font-weight: bold;">CodeFuse-Embeddings</span>
    </a>
</div>



# A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling

[Paper](https://huggingface.co/papers/2512.21332) | [Code](https://github.com/codefuse-ai/CodeFuse-Embeddings)

**C2LLMs (Code Contrastive Large Language Models)** are powerful new models for generating code embeddings, designed to capture the deep semantics of source code. 

#### Key Features

- **Powerful Base Model**: Built upon the state-of-the-art `Qwen2.5-Coder`, inheriting its exceptional code comprehension capabilities.
- **Intelligent Pooling with PMA**: Instead of traditional `mean pooling` or `last token pooling`, C2LLM uses **PMA (Pooling by Multi-head Attention)**. This allows the model to dynamically focus on the most critical parts of the code, creating a more informative and robust embedding.
- **Trained for Retrieval**: C2LLM was fine-tuned on a massive dataset of **3 million query-document pairs**, optimizing it for real-world code retrieval and semantic search tasks. Supporting Text2Code/Code2Code/Code2Text tasks.

C2LLM is designed to be a go-to model for tasks like code search and Retrieval-Augmented Generation (RAG). For more details, please see our [GitHub repository](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main). 

#  Model Details

# How to use

## Usage (**HuggingFace Transformers**)

```Python
from transformers import AutoModel, AutoTokenizer
import torch

model_path = "codefuse-ai/C2LLM-0.5B"

# Load the model
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, trust_remote_code=True)

# Prepare your custom instruction
instruction = "xxxxx"

# Prepare the data
sentences = ['''int r = (int) params >> 8 & 0xff;
int p = (int) params & 0xff;

byte[] derived1 = SCrypt.scrypt(passwd.getBytes("UTF-8"), salt, N, r, p, 32);

if (derived0.length != derived1.length) return false;

int result = 0;
for (int i = 0; i < derived0.length; i++) {
result |= derived0[i] ^ derived1[i];
}
return result == 0;
} catch (UnsupportedEncodingException e) {
throw new IllegalStateException("JVM doesn't support UTF-8?");
} catch (GeneralSecurityException e) {
throw new IllegalStateException("JVM doesn't support SHA1PRNG or HMAC_SHA256?");
}
}''',
'''	
}
if (tempFrom > tempTo) {
return new RangeInfo(inclusive ? tempTo : tempTo + 1, tempFrom + 1, true);
}
return new RangeInfo(tempFrom, inclusive ? tempTo + 1 : tempTo, false);
}''']

sentences = [instruction+sentence for sentence in sentences]

# Get the embeddings
embeddings = model.encode(sentences)
```

## Usage (**Sentence-Transformers**)

```python
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("codefuse-ai/C2LLM-0.5B", trust_remote_code=True, tokenizer_kwargs={"padding_side":"left"})

# Prepare your custom instruction
instruction = "xxxxx"

# Prepare the data
sentences = ['''int r = (int) params >> 8 & 0xff;
int p = (int) params & 0xff;

byte[] derived1 = SCrypt.scrypt(passwd.getBytes("UTF-8"), salt, N, r, p, 32);

if (derived0.length != derived1.length) return false;

int result = 0;
for (int i = 0; i < derived0.length; i++) {
result |= derived0[i] ^ derived1[i];
}
return result == 0;
} catch (UnsupportedEncodingException e) {
throw new IllegalStateException("JVM doesn't support UTF-8?");
} catch (GeneralSecurityException e) {
throw new IllegalStateException("JVM doesn't support SHA1PRNG or HMAC_SHA256?");
}
}''',
'''	
}
if (tempFrom > tempTo) {
return new RangeInfo(inclusive ? tempTo : tempTo + 1, tempFrom + 1, true);
}
return new RangeInfo(tempFrom, inclusive ? tempTo + 1 : tempTo, false);
}''']

sentences = [instruction+sentence for sentence in sentences]

# Get the embeddings
embeddings = model.encode(sentences)
```

## Evaluation (**MTEB**)

```python
from sentence_transformers import SentenceTransformer
from mteb.models import ModelMeta
from mteb.cache import ResultCache

model_name = "codefuse-ai/C2LLM-0.5B"

# Load the model
model = mteb.get_model(model_name) # if the model is not implemented in MTEB it will be eq. to SentenceTransformer(model_name)

# Select tasks
tasks = mteb.get_tasks(tasks=["AppsRetrieval", "CodeSearchNetCCRetrieval", "CodeEditSearchRetrieval","CodeSearchNetRetrieval","CodeFeedbackMT","CodeFeedbackST","CodeTransOceanContest","CodeTransOceanDL","COIRCodeSearchNetRetrieval","CosQA","StackOverflowQA","SyntheticText2SQL"])

# Cache the result
cache = ResultCache("./c2llm_results")

# Evaluate
results = mteb.evaluate(model, tasks=tasks, cache=cache, encode_kwargs={"batch_size": 16})
```

## Support Us

If you find this project helpful, please give it a star. It means a lot to us!

[![GitHub stars](https://img.shields.io/github/stars/codefuse-ai/CodeFuse-Embeddings?style=social)](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main)

## Citation

@article{2025C2LLM,
  title={C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling}, 
      author={Jin Qin and Zihan Liao and Ziyin Zhang and Hang Yu and Peng Di and Rui Wang},
  journal      = {CoRR},
  volume       = {abs/2512.21332},
  year         = {2025},
  url          = {https://doi.org/10.48550/arXiv.2512.21332},
  doi          = {10.48550/ARXIV.2512.21332},
  eprinttype    = {arXiv},
  eprint       = {2512.21332}
}