|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- code |
|
|
library_name: peft |
|
|
tags: |
|
|
- code-search |
|
|
- text-embeddings |
|
|
- decoder-only |
|
|
- supervised-contrastive-learning |
|
|
- codegemma |
|
|
- llm2vec |
|
|
--- |
|
|
|
|
|
## π Are Decoder-Only Large Language Models the Silver Bullet for Code Search? |
|
|
|
|
|
This model is an official artifact from our research paper: **"[Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)"**. |
|
|
|
|
|
In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies. |
|
|
|
|
|
For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository: |
|
|
|
|
|
β‘οΈ **[GitHub: Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)** |
|
|
|
|
|
--- |
|
|
|
|
|
# Model Card: DCS-CodeGemma-7b-it-SupCon-CSN |
|
|
|
|
|
## π Model Description |
|
|
|
|
|
This is a PEFT adapter for the **`google/codegemma-7b-it`** model, fine-tuned for the task of **Code Search** as part of the research mentioned above. |
|
|
|
|
|
The model was trained using the **Supervised Contrastive Learning** method proposed in the [llm2vec](https://github.com/McGill-NLP/llm2vec) framework, designed to generate high-quality vector embeddings for code snippets. |
|
|
|
|
|
## π¬ Model Performance & Reproducibility |
|
|
|
|
|
The table below provides details about this model, its corresponding results in our paper, and how to reproduce the evaluation. |
|
|
|
|
|
| Attribute | Details | |
|
|
| :------------------------- | :------------------------------------------------------------------------------------------------------------------------------ | |
|
|
| **Base Model** | `google/codegemma-7b-it` | |
|
|
| **Fine-tuning Method** | Supervised Contrastive Learning via `llm2vec` | | |
|
|
| **Evaluation Script** | [CSN_Test_Finetuning_Decoder_Model.py](https://github.com/Georgepitt/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CSN_Test_Finetuning_Decoder_Model.py),<br>[CoSQA_Plus_Test_Finetuning_Decoder_Model.py](https://github.com/ChenyxEugene/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CoSQA_Plus_Test_Finetuning_Decoder_Model.py) | |
|
|
| **Prerequisite Model** | This model must be loaded on top of an MNTP pre-trained model. | |
|
|
|
|
|
--- |
|
|
|
|
|
## π How to Use (with `llm2vec`) |
|
|
|
|
|
For best results, we strongly recommend using the official `llm2vec` wrapper to load and use this model. |
|
|
|
|
|
**1. Install Dependencies** |
|
|
```bash |
|
|
pip install llm2vec transformers torch peft accelerate |
|
|
``` |
|
|
|
|
|
**2. Example Usage** |
|
|
|
|
|
> **Important**: The `llm2vec` supervised contrastive (SupCon) models are fine-tuned on top of **MNTP (Masked Next Token Prediction)** models. Therefore, loading requires first merging the MNTP weights before loading the SupCon adapter. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModel, AutoConfig |
|
|
from peft import PeftModel |
|
|
from llm2vec import LLM2Vec |
|
|
|
|
|
# --- 1. Define Model IDs --- |
|
|
base_model_id = "google/codegemma-7b-it" |
|
|
mntp_model_id = "SYSUSELab/DCS-CodeGemma-7B-It-MNTP" |
|
|
supcon_model_id = "SYSUSELab/DCS-CodeGemma-7B-It-SupCon-CSN-java" |
|
|
|
|
|
# --- 2. Load Base Model and MNTP Adapter --- |
|
|
tokenizer = AutoTokenizer.from_pretrained(base_model_id) |
|
|
config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True) |
|
|
model = AutoModel.from_pretrained( |
|
|
base_model_id, |
|
|
trust_remote_code=True, |
|
|
config=config, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="cuda" if torch.cuda.is_available() else "cpu", |
|
|
) |
|
|
model = PeftModel.from_pretrained(model, mntp_model_id) |
|
|
model = model.merge_and_unload() |
|
|
|
|
|
# --- 3. Load the Supervised (this model) Adapter on top of the MNTP-merged model --- |
|
|
model = PeftModel.from_pretrained(model, supcon_model_id) |
|
|
|
|
|
# --- 4. Use the LLM2Vec Wrapper for Encoding --- |
|
|
l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512) |
|
|
|
|
|
queries = ["how to read a file in Python?"] |
|
|
code_snippets = ["with open('file.txt', 'r') as f:\n content = f.read()"] |
|
|
query_embeddings = l2v.encode(queries) |
|
|
code_embeddings = l2v.encode(code_snippets) |
|
|
|
|
|
print("Query Embedding Shape:", query_embeddings.shape) |
|
|
# This usage example is adapted from the official llm2vec repository. Credits to the original authors. |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you use our model or work in your research, please cite our paper. As our method is built upon `llm2vec`, please also cite their foundational work. |
|
|
|
|
|
**Our Paper:** |
|
|
* **Paper Link:** [Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240) |
|
|
* **GitHub:** [https://github.com/Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch) |
|
|
* **BibTeX:** |
|
|
```bibtex |
|
|
@article{chen2024decoder, |
|
|
title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?}, |
|
|
author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin}, |
|
|
journal={arXiv preprint arXiv:2410.22240}, |
|
|
year={2024} |
|
|
} |
|
|
``` |
|
|
|
|
|
**llm2vec (Foundational Work):** |
|
|
* **Paper Link:** [LLM2Vec: Large Language Models Are Good Contextual Text Encoders](https://arxiv.org/abs/2404.05961) |
|
|
* **GitHub:** [https://github.com/McGill-NLP/llm2vec](https://github.com/McGill-NLP/llm2vec) |
|
|
* **BibTeX:** |
|
|
```bibtex |
|
|
@article{vaishaal2024llm2vec, |
|
|
title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders}, |
|
|
author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran}, |
|
|
journal={arXiv preprint arXiv:2404.05961}, |
|
|
year={2024} |
|
|
} |
|
|
``` |