pitt111's picture
Upload 3 files
305b6eb verified
---
license: apache-2.0
language:
- code
library_name: peft
tags:
- code-search
- text-embeddings
- decoder-only
- supervised-contrastive-learning
- codegemma
- llm2vec
---
## πŸ“– Are Decoder-Only Large Language Models the Silver Bullet for Code Search?
This model is an official artifact from our research paper: **"[Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)"**.
In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies.
For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository:
➑️ **[GitHub: Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)**
---
# Model Card: DCS-CodeGemma-7b-it-SupCon-CSN
## πŸ“œ Model Description
This is a PEFT adapter for the **`google/codegemma-7b-it`** model, fine-tuned for the task of **Code Search** as part of the research mentioned above.
The model was trained using the **Supervised Contrastive Learning** method proposed in the [llm2vec](https://github.com/McGill-NLP/llm2vec) framework, designed to generate high-quality vector embeddings for code snippets.
## πŸ”¬ Model Performance & Reproducibility
The table below provides details about this model, its corresponding results in our paper, and how to reproduce the evaluation.
| Attribute | Details |
| :------------------------- | :------------------------------------------------------------------------------------------------------------------------------ |
| **Base Model** | `google/codegemma-7b-it` |
| **Fine-tuning Method** | Supervised Contrastive Learning via `llm2vec` | |
| **Evaluation Script** | [CSN_Test_Finetuning_Decoder_Model.py](https://github.com/Georgepitt/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CSN_Test_Finetuning_Decoder_Model.py),<br>[CoSQA_Plus_Test_Finetuning_Decoder_Model.py](https://github.com/ChenyxEugene/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CoSQA_Plus_Test_Finetuning_Decoder_Model.py) |
| **Prerequisite Model** | This model must be loaded on top of an MNTP pre-trained model. |
---
## πŸš€ How to Use (with `llm2vec`)
For best results, we strongly recommend using the official `llm2vec` wrapper to load and use this model.
**1. Install Dependencies**
```bash
pip install llm2vec transformers torch peft accelerate
```
**2. Example Usage**
> **Important**: The `llm2vec` supervised contrastive (SupCon) models are fine-tuned on top of **MNTP (Masked Next Token Prediction)** models. Therefore, loading requires first merging the MNTP weights before loading the SupCon adapter.
```python
import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig
from peft import PeftModel
from llm2vec import LLM2Vec
# --- 1. Define Model IDs ---
base_model_id = "google/codegemma-7b-it"
mntp_model_id = "SYSUSELab/DCS-CodeGemma-7B-It-MNTP"
supcon_model_id = "SYSUSELab/DCS-CodeGemma-7B-It-SupCon-CSN-python"
# --- 2. Load Base Model and MNTP Adapter ---
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
base_model_id,
trust_remote_code=True,
config=config,
torch_dtype=torch.bfloat16,
device_map="cuda" if torch.cuda.is_available() else "cpu",
)
model = PeftModel.from_pretrained(model, mntp_model_id)
model = model.merge_and_unload()
# --- 3. Load the Supervised (this model) Adapter on top of the MNTP-merged model ---
model = PeftModel.from_pretrained(model, supcon_model_id)
# --- 4. Use the LLM2Vec Wrapper for Encoding ---
l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)
queries = ["how to read a file in Python?"]
code_snippets = ["with open('file.txt', 'r') as f:\n content = f.read()"]
query_embeddings = l2v.encode(queries)
code_embeddings = l2v.encode(code_snippets)
print("Query Embedding Shape:", query_embeddings.shape)
# This usage example is adapted from the official llm2vec repository. Credits to the original authors.
```
---
## πŸ“„ Citation
If you use our model or work in your research, please cite our paper. As our method is built upon `llm2vec`, please also cite their foundational work.
**Our Paper:**
* **Paper Link:** [Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)
* **GitHub:** [https://github.com/Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)
* **BibTeX:**
```bibtex
@article{chen2024decoder,
title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?},
author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin},
journal={arXiv preprint arXiv:2410.22240},
year={2024}
}
```
**llm2vec (Foundational Work):**
* **Paper Link:** [LLM2Vec: Large Language Models Are Good Contextual Text Encoders](https://arxiv.org/abs/2404.05961)
* **GitHub:** [https://github.com/McGill-NLP/llm2vec](https://github.com/McGill-NLP/llm2vec)
* **BibTeX:**
```bibtex
@article{vaishaal2024llm2vec,
title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders},
author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran},
journal={arXiv preprint arXiv:2404.05961},
year={2024}
}
```