File size: 1,983 Bytes
1566f7a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
---
license: cc-by-nc-4.0
---


# CPRetriever-Code

**CPRetriever-Code** is a code embedding model trained via contrastive learning for **code-related retrieval tasks** in competitive programming. It achieves strong performance on tasks such as:

* **Text-to-Code** retrieval (problem description β†’ relevant code)
* **Code-to-Code** retrieval (find alternate solutions to the same problem)

This model is part of the [CPRet](https://github.com/coldchair/CPRet) suite for competitive programming retrieval research.

## πŸ”§ Usage

You can load this model using the `sentence-transformers` library:

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("coldchair16/CPRetriever-Code")
embeddings = model.encode([
    "def mex_query(arr):\n    n = len(arr)\n    seen = set()\n    for i in range(n):\n        seen.add(arr[i])\n    i = 0\n    while True:\n        if i not in seen:\n            return i\n        i += 1"
])
```

## πŸ’‘ Applications

This model is optimized for **code-level semantic retrieval** in competitive programming settings:

* **Text-to-Code**: Retrieve relevant code snippets given a natural language problem description.
* **Code-to-Code**: Retrieve alternative implementations of the same problem.

It is particularly effective for analyzing programming contest submissions, searching solution variants, and building educational tools for code understanding.

## πŸ“š Training and Evaluation

CPRetriever-Code is trained via **contrastive learning** using positive and hard negative code pairs derived from [CPRet-data](https://huggingface.co/datasets/coldchair16/CPRet-data).

For the training pipeline, see the full project:
πŸ‘‰ [CPRet on GitHub](https://github.com/coldchair/CPRet?tab=readme-ov-file)

## πŸ“¦ Model Card

* Architecture: `Salesforce/SFR-Embedding-Code-2B_R` (encoder backbone)
* Training: Contrastive objective on code/code and text/code pairs
* Format: Compatible with `sentence-transformers`