File size: 6,037 Bytes
de80973
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
license: apache-2.0
language:
- code
library_name: peft
tags:
- code-search
- text-embeddings
- decoder-only
- supervised-contrastive-learning
- codegemma
- llm2vec
---

## πŸ“– Are Decoder-Only Large Language Models the Silver Bullet for Code Search?

This model is an official artifact from our research paper: **"[Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)"**.

In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies.

For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository:

➑️ **[GitHub: Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)**

---

# Model Card: DCS-CodeGemma-7b-it-SupCon-CSN

## πŸ“œ Model Description

This is a PEFT adapter for the **`meta-llama/CodeLlama-7b-Instruct-hf`** model, fine-tuned for the task of **Code Search** as part of the research mentioned above.

The model was trained using the **Supervised Contrastive Learning** method proposed in the [llm2vec](https://github.com/McGill-NLP/llm2vec) framework, designed to generate high-quality vector embeddings for code snippets.

## πŸ”¬ Model Performance & Reproducibility

The table below provides details about this model, its corresponding results in our paper, and how to reproduce the evaluation.

| Attribute                  | Details                                                                                                                         |
| :------------------------- | :------------------------------------------------------------------------------------------------------------------------------ |
| **Base Model** | `meta-llama/CodeLlama-7b-Instruct-hf`                                                                                                           |
| **Fine-tuning Method** | Supervised Contrastive Learning via `llm2vec`                                                                                   |                                                                      |
| **Evaluation Script** | [CSN_Test_Finetuning_Decoder_Model.py](https://github.com/Georgepitt/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CSN_Test_Finetuning_Decoder_Model.py),<br>[CoSQA_Plus_Test_Finetuning_Decoder_Model.py](https://github.com/ChenyxEugene/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CoSQA_Plus_Test_Finetuning_Decoder_Model.py) |
| **Prerequisite Model** | This model must be loaded on top of an MNTP pre-trained model.                        |

---

## πŸš€ How to Use (with `llm2vec`)

For best results, we strongly recommend using the official `llm2vec` wrapper to load and use this model.

**1. Install Dependencies**
```bash
pip install llm2vec transformers torch peft accelerate
```

**2. Example Usage**

> **Important**: The `llm2vec` supervised contrastive (SupCon) models are fine-tuned on top of **MNTP (Masked Next Token Prediction)** models. Therefore, loading requires first merging the MNTP weights before loading the SupCon adapter.

```python
import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig
from peft import PeftModel
from llm2vec import LLM2Vec

# --- 1. Define Model IDs ---
base_model_id = "meta-llama/CodeLlama-7b-Instruct-hf"
mntp_model_id = "SYSUSELab/DCS-CodeLlama-7B-It-MNTP"
supcon_model_id = "SYSUSELab/DCS-CodeLlama-7B-It-SupCon-CSN" 

# --- 2. Load Base Model and MNTP Adapter ---
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    base_model_id,
    trust_remote_code=True,
    config=config,
    torch_dtype=torch.bfloat16,
    device_map="cuda" if torch.cuda.is_available() else "cpu",
)
model = PeftModel.from_pretrained(model, mntp_model_id)
model = model.merge_and_unload()

# --- 3. Load the Supervised (this model) Adapter on top of the MNTP-merged model ---
model = PeftModel.from_pretrained(model, supcon_model_id)

# --- 4. Use the LLM2Vec Wrapper for Encoding ---
l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)

queries = ["how to read a file in Python?"]
code_snippets = ["with open('file.txt', 'r') as f:\n    content = f.read()"]
query_embeddings = l2v.encode(queries)
code_embeddings = l2v.encode(code_snippets)

print("Query Embedding Shape:", query_embeddings.shape)
# This usage example is adapted from the official llm2vec repository. Credits to the original authors.
```

---

## πŸ“„ Citation

If you use our model or work in your research, please cite our paper. As our method is built upon `llm2vec`, please also cite their foundational work.

**Our Paper:**
* **Paper Link:** [Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)
* **GitHub:** [https://github.com/Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)
* **BibTeX:**
    ```bibtex
    @article{chen2024decoder,
      title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?},
      author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin},
      journal={arXiv preprint arXiv:2410.22240},
      year={2024}
    }
    ```

**llm2vec (Foundational Work):**
* **Paper Link:** [LLM2Vec: Large Language Models Are Good Contextual Text Encoders](https://arxiv.org/abs/2404.05961)
* **GitHub:** [https://github.com/McGill-NLP/llm2vec](https://github.com/McGill-NLP/llm2vec)
* **BibTeX:**
    ```bibtex
    @article{vaishaal2024llm2vec,
        title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders},
        author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran},
        journal={arXiv preprint arXiv:2404.05961},
        year={2024}
    }
    ```