pitt111 commited on
Commit
f80f2a6
·
verified ·
1 Parent(s): cd669be

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +130 -3
README.md CHANGED
@@ -1,3 +1,130 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - code
5
+ library_name: peft
6
+ tags:
7
+ - code-search
8
+ - text-embeddings
9
+ - decoder-only
10
+ - supervised-contrastive-learning
11
+ - codegemma
12
+ - llm2vec
13
+ ---
14
+
15
+ ## 📖 Are Decoder-Only Large Language Models the Silver Bullet for Code Search?
16
+
17
+ This model is an official artifact from our research paper: **"[Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)"**.
18
+
19
+ In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies.
20
+
21
+ For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository:
22
+
23
+ ➡️ **[GitHub: Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)**
24
+
25
+ ---
26
+
27
+ # Model Card: DCS-CodeGemma-7b-it-SupCon-CSN
28
+
29
+ ## 📜 Model Description
30
+
31
+ This is a PEFT adapter for the **`uukuguy/speechless-code-mistral-7b-v1.0`** model, fine-tuned for the task of **Code Search** as part of the research mentioned above.
32
+
33
+ The model was trained using the **Supervised Contrastive Learning** method proposed in the [llm2vec](https://github.com/McGill-NLP/llm2vec) framework, designed to generate high-quality vector embeddings for code snippets.
34
+
35
+ ## 🔬 Model Performance & Reproducibility
36
+
37
+ The table below provides details about this model, its corresponding results in our paper, and how to reproduce the evaluation.
38
+
39
+ | Attribute | Details |
40
+ | :------------------------- | :------------------------------------------------------------------------------------------------------------------------------ |
41
+ | **Base Model** | `uukuguy/speechless-code-mistral-7b-v1.0` |
42
+ | **Fine-tuning Method** | Supervised Contrastive Learning via `llm2vec` | |
43
+ | **Evaluation Script** | [CSN_Test_Finetuning_Decoder_Model.py](https://github.com/Georgepitt/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CSN_Test_Finetuning_Decoder_Model.py),<br>[CoSQA_Plus_Test_Finetuning_Decoder_Model.py](https://github.com/ChenyxEugene/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CoSQA_Plus_Test_Finetuning_Decoder_Model.py) |
44
+ | **Prerequisite Model** | This model must be loaded on top of an MNTP pre-trained model. |
45
+
46
+ ---
47
+
48
+ ## 🚀 How to Use (with `llm2vec`)
49
+
50
+ For best results, we strongly recommend using the official `llm2vec` wrapper to load and use this model.
51
+
52
+ **1. Install Dependencies**
53
+ ```bash
54
+ pip install llm2vec transformers torch peft accelerate
55
+ ```
56
+
57
+ **2. Example Usage**
58
+
59
+ > **Important**: The `llm2vec` supervised contrastive (SupCon) models are fine-tuned on top of **MNTP (Masked Next Token Prediction)** models. Therefore, loading requires first merging the MNTP weights before loading the SupCon adapter.
60
+
61
+ ```python
62
+ import torch
63
+ from transformers import AutoTokenizer, AutoModel, AutoConfig
64
+ from peft import PeftModel
65
+ from llm2vec import LLM2Vec
66
+
67
+ # --- 1. Define Model IDs ---
68
+ base_model_id = "uukuguy/speechless-code-mistral-7b-v1.0"
69
+ mntp_model_id = "SYSUSELab/DCS-CodeMistral-7B-It-MNTP"
70
+ supcon_model_id = "SYSUSELab/DCS-CodeMistral-7B-It-SupCon-CSN"
71
+
72
+ # --- 2. Load Base Model and MNTP Adapter ---
73
+ tokenizer = AutoTokenizer.from_pretrained(base_model_id)
74
+ config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True)
75
+ model = AutoModel.from_pretrained(
76
+ base_model_id,
77
+ trust_remote_code=True,
78
+ config=config,
79
+ torch_dtype=torch.bfloat16,
80
+ device_map="cuda" if torch.cuda.is_available() else "cpu",
81
+ )
82
+ model = PeftModel.from_pretrained(model, mntp_model_id)
83
+ model = model.merge_and_unload()
84
+
85
+ # --- 3. Load the Supervised (this model) Adapter on top of the MNTP-merged model ---
86
+ model = PeftModel.from_pretrained(model, supcon_model_id)
87
+
88
+ # --- 4. Use the LLM2Vec Wrapper for Encoding ---
89
+ l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)
90
+
91
+ queries = ["how to read a file in Python?"]
92
+ code_snippets = ["with open('file.txt', 'r') as f:\n content = f.read()"]
93
+ query_embeddings = l2v.encode(queries)
94
+ code_embeddings = l2v.encode(code_snippets)
95
+
96
+ print("Query Embedding Shape:", query_embeddings.shape)
97
+ # This usage example is adapted from the official llm2vec repository. Credits to the original authors.
98
+ ```
99
+
100
+ ---
101
+
102
+ ## 📄 Citation
103
+
104
+ If you use our model or work in your research, please cite our paper. As our method is built upon `llm2vec`, please also cite their foundational work.
105
+
106
+ **Our Paper:**
107
+ * **Paper Link:** [Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)
108
+ * **GitHub:** [https://github.com/Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)
109
+ * **BibTeX:**
110
+ ```bibtex
111
+ @article{chen2024decoder,
112
+ title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?},
113
+ author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin},
114
+ journal={arXiv preprint arXiv:2410.22240},
115
+ year={2024}
116
+ }
117
+ ```
118
+
119
+ **llm2vec (Foundational Work):**
120
+ * **Paper Link:** [LLM2Vec: Large Language Models Are Good Contextual Text Encoders](https://arxiv.org/abs/2404.05961)
121
+ * **GitHub:** [https://github.com/McGill-NLP/llm2vec](https://github.com/McGill-NLP/llm2vec)
122
+ * **BibTeX:**
123
+ ```bibtex
124
+ @article{vaishaal2024llm2vec,
125
+ title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders},
126
+ author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran},
127
+ journal={arXiv preprint arXiv:2404.05961},
128
+ year={2024}
129
+ }
130
+ ```