File size: 6,037 Bytes
de80973 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
---
license: apache-2.0
language:
- code
library_name: peft
tags:
- code-search
- text-embeddings
- decoder-only
- supervised-contrastive-learning
- codegemma
- llm2vec
---
## π Are Decoder-Only Large Language Models the Silver Bullet for Code Search?
This model is an official artifact from our research paper: **"[Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)"**.
In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies.
For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository:
β‘οΈ **[GitHub: Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)**
---
# Model Card: DCS-CodeGemma-7b-it-SupCon-CSN
## π Model Description
This is a PEFT adapter for the **`meta-llama/CodeLlama-7b-Instruct-hf`** model, fine-tuned for the task of **Code Search** as part of the research mentioned above.
The model was trained using the **Supervised Contrastive Learning** method proposed in the [llm2vec](https://github.com/McGill-NLP/llm2vec) framework, designed to generate high-quality vector embeddings for code snippets.
## π¬ Model Performance & Reproducibility
The table below provides details about this model, its corresponding results in our paper, and how to reproduce the evaluation.
| Attribute | Details |
| :------------------------- | :------------------------------------------------------------------------------------------------------------------------------ |
| **Base Model** | `meta-llama/CodeLlama-7b-Instruct-hf` |
| **Fine-tuning Method** | Supervised Contrastive Learning via `llm2vec` | |
| **Evaluation Script** | [CSN_Test_Finetuning_Decoder_Model.py](https://github.com/Georgepitt/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CSN_Test_Finetuning_Decoder_Model.py),<br>[CoSQA_Plus_Test_Finetuning_Decoder_Model.py](https://github.com/ChenyxEugene/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CoSQA_Plus_Test_Finetuning_Decoder_Model.py) |
| **Prerequisite Model** | This model must be loaded on top of an MNTP pre-trained model. |
---
## π How to Use (with `llm2vec`)
For best results, we strongly recommend using the official `llm2vec` wrapper to load and use this model.
**1. Install Dependencies**
```bash
pip install llm2vec transformers torch peft accelerate
```
**2. Example Usage**
> **Important**: The `llm2vec` supervised contrastive (SupCon) models are fine-tuned on top of **MNTP (Masked Next Token Prediction)** models. Therefore, loading requires first merging the MNTP weights before loading the SupCon adapter.
```python
import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig
from peft import PeftModel
from llm2vec import LLM2Vec
# --- 1. Define Model IDs ---
base_model_id = "meta-llama/CodeLlama-7b-Instruct-hf"
mntp_model_id = "SYSUSELab/DCS-CodeLlama-7B-It-MNTP"
supcon_model_id = "SYSUSELab/DCS-CodeLlama-7B-It-SupCon-CSN"
# --- 2. Load Base Model and MNTP Adapter ---
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
base_model_id,
trust_remote_code=True,
config=config,
torch_dtype=torch.bfloat16,
device_map="cuda" if torch.cuda.is_available() else "cpu",
)
model = PeftModel.from_pretrained(model, mntp_model_id)
model = model.merge_and_unload()
# --- 3. Load the Supervised (this model) Adapter on top of the MNTP-merged model ---
model = PeftModel.from_pretrained(model, supcon_model_id)
# --- 4. Use the LLM2Vec Wrapper for Encoding ---
l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)
queries = ["how to read a file in Python?"]
code_snippets = ["with open('file.txt', 'r') as f:\n content = f.read()"]
query_embeddings = l2v.encode(queries)
code_embeddings = l2v.encode(code_snippets)
print("Query Embedding Shape:", query_embeddings.shape)
# This usage example is adapted from the official llm2vec repository. Credits to the original authors.
```
---
## π Citation
If you use our model or work in your research, please cite our paper. As our method is built upon `llm2vec`, please also cite their foundational work.
**Our Paper:**
* **Paper Link:** [Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)
* **GitHub:** [https://github.com/Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)
* **BibTeX:**
```bibtex
@article{chen2024decoder,
title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?},
author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin},
journal={arXiv preprint arXiv:2410.22240},
year={2024}
}
```
**llm2vec (Foundational Work):**
* **Paper Link:** [LLM2Vec: Large Language Models Are Good Contextual Text Encoders](https://arxiv.org/abs/2404.05961)
* **GitHub:** [https://github.com/McGill-NLP/llm2vec](https://github.com/McGill-NLP/llm2vec)
* **BibTeX:**
```bibtex
@article{vaishaal2024llm2vec,
title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders},
author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran},
journal={arXiv preprint arXiv:2404.05961},
year={2024}
}
``` |