| | --- |
| | license: apache-2.0 |
| | language: |
| | - code |
| | library_name: peft |
| | tags: |
| | - code-search |
| | - text-embeddings |
| | - decoder-only |
| | - supervised-contrastive-learning |
| | - codegemma |
| | - llm2vec |
| | --- |
| | |
| | ## π Are Decoder-Only Large Language Models the Silver Bullet for Code Search? |
| |
|
| | This model is an official artifact from our research paper: **"[Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)"**. |
| |
|
| | In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies. |
| |
|
| | For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository: |
| |
|
| | β‘οΈ **[GitHub: Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)** |
| |
|
| | --- |
| |
|
| | # Model Card: DCS-CodeGemma-7b-it-SupCon-CSN |
| |
|
| | ## π Model Description |
| |
|
| | This is a PEFT adapter for the **`uukuguy/speechless-code-mistral-7b-v1.0`** model, fine-tuned for the task of **Code Search** as part of the research mentioned above. |
| |
|
| | The model was trained using the **Supervised Contrastive Learning** method proposed in the [llm2vec](https://github.com/McGill-NLP/llm2vec) framework, designed to generate high-quality vector embeddings for code snippets. |
| |
|
| | ## π¬ Model Performance & Reproducibility |
| |
|
| | The table below provides details about this model, its corresponding results in our paper, and how to reproduce the evaluation. |
| |
|
| | | Attribute | Details | |
| | | :------------------------- | :------------------------------------------------------------------------------------------------------------------------------ | |
| | | **Base Model** | `uukuguy/speechless-code-mistral-7b-v1.0` | |
| | | **Fine-tuning Method** | Supervised Contrastive Learning via `llm2vec` | | |
| | | **Evaluation Script** | [CSN_Test_Finetuning_Decoder_Model.py](https://github.com/Georgepitt/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CSN_Test_Finetuning_Decoder_Model.py),<br>[CoSQA_Plus_Test_Finetuning_Decoder_Model.py](https://github.com/ChenyxEugene/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CoSQA_Plus_Test_Finetuning_Decoder_Model.py) | |
| | | **Prerequisite Model** | This model must be loaded on top of an MNTP pre-trained model. | |
| |
|
| | --- |
| |
|
| | ## π How to Use (with `llm2vec`) |
| |
|
| | For best results, we strongly recommend using the official `llm2vec` wrapper to load and use this model. |
| |
|
| | **1. Install Dependencies** |
| | ```bash |
| | pip install llm2vec transformers torch peft accelerate |
| | ``` |
| |
|
| | **2. Example Usage** |
| |
|
| | > **Important**: The `llm2vec` supervised contrastive (SupCon) models are fine-tuned on top of **MNTP (Masked Next Token Prediction)** models. Therefore, loading requires first merging the MNTP weights before loading the SupCon adapter. |
| |
|
| | ```python |
| | import torch |
| | from transformers import AutoTokenizer, AutoModel, AutoConfig |
| | from peft import PeftModel |
| | from llm2vec import LLM2Vec |
| | |
| | # --- 1. Define Model IDs --- |
| | base_model_id = "uukuguy/speechless-code-mistral-7b-v1.0" |
| | mntp_model_id = "SYSUSELab/DCS-CodeMistral-7B-It-MNTP" |
| | supcon_model_id = "SYSUSELab/DCS-CodeMistral-7B-It-SupCon-CSN" |
| | |
| | # --- 2. Load Base Model and MNTP Adapter --- |
| | tokenizer = AutoTokenizer.from_pretrained(base_model_id) |
| | config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True) |
| | model = AutoModel.from_pretrained( |
| | base_model_id, |
| | trust_remote_code=True, |
| | config=config, |
| | torch_dtype=torch.bfloat16, |
| | device_map="cuda" if torch.cuda.is_available() else "cpu", |
| | ) |
| | model = PeftModel.from_pretrained(model, mntp_model_id) |
| | model = model.merge_and_unload() |
| | |
| | # --- 3. Load the Supervised (this model) Adapter on top of the MNTP-merged model --- |
| | model = PeftModel.from_pretrained(model, supcon_model_id) |
| | |
| | # --- 4. Use the LLM2Vec Wrapper for Encoding --- |
| | l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512) |
| | |
| | queries = ["how to read a file in Python?"] |
| | code_snippets = ["with open('file.txt', 'r') as f:\n content = f.read()"] |
| | query_embeddings = l2v.encode(queries) |
| | code_embeddings = l2v.encode(code_snippets) |
| | |
| | print("Query Embedding Shape:", query_embeddings.shape) |
| | # This usage example is adapted from the official llm2vec repository. Credits to the original authors. |
| | ``` |
| |
|
| | --- |
| |
|
| | ## π Citation |
| |
|
| | If you use our model or work in your research, please cite our paper. As our method is built upon `llm2vec`, please also cite their foundational work. |
| |
|
| | **Our Paper:** |
| | * **Paper Link:** [Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240) |
| | * **GitHub:** [https://github.com/Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch) |
| | * **BibTeX:** |
| | ```bibtex |
| | @article{chen2024decoder, |
| | title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?}, |
| | author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin}, |
| | journal={arXiv preprint arXiv:2410.22240}, |
| | year={2024} |
| | } |
| | ``` |
| | |
| | **llm2vec (Foundational Work):** |
| | * **Paper Link:** [LLM2Vec: Large Language Models Are Good Contextual Text Encoders](https://arxiv.org/abs/2404.05961) |
| | * **GitHub:** [https://github.com/McGill-NLP/llm2vec](https://github.com/McGill-NLP/llm2vec) |
| | * **BibTeX:** |
| | ```bibtex |
| | @article{vaishaal2024llm2vec, |
| | title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders}, |
| | author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran}, |
| | journal={arXiv preprint arXiv:2404.05961}, |
| | year={2024} |
| | } |
| | ``` |