Upload README.md

f80f2a6 verified 4 months ago

6.05 kB

	---
	license: apache-2.0
	language:
	- code
	library_name: peft
	tags:
	- code-search
	- text-embeddings
	- decoder-only
	- supervised-contrastive-learning
	- codegemma
	- llm2vec
	---

	## 📖 Are Decoder-Only Large Language Models the Silver Bullet for Code Search?

	This model is an official artifact from our research paper: "[Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)".

	In this work, we conduct a large-scale systematic evaluation of decoder-only Large Language Models for the task of code search and present a set of effective fine-tuning and optimization strategies.

	For complete details on all our experiments, to reproduce the full training/evaluation pipeline, or to use other models from the paper, please visit our official GitHub repository:

	➡️ [GitHub: Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)

	---

	# Model Card: DCS-CodeGemma-7b-it-SupCon-CSN

	## 📜 Model Description

	This is a PEFT adapter for the `uukuguy/speechless-code-mistral-7b-v1.0` model, fine-tuned for the task of Code Search as part of the research mentioned above.

	The model was trained using the Supervised Contrastive Learning method proposed in the [llm2vec](https://github.com/McGill-NLP/llm2vec) framework, designed to generate high-quality vector embeddings for code snippets.

	## 🔬 Model Performance & Reproducibility

	The table below provides details about this model, its corresponding results in our paper, and how to reproduce the evaluation.

	\| Attribute \| Details \|
	\| :------------------------- \| :------------------------------------------------------------------------------------------------------------------------------ \|
	\| Base Model \| `uukuguy/speechless-code-mistral-7b-v1.0` \|
	\| Fine-tuning Method \| Supervised Contrastive Learning via `llm2vec` \| \|
	\| Evaluation Script \| [CSN_Test_Finetuning_Decoder_Model.py](https://github.com/Georgepitt/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CSN_Test_Finetuning_Decoder_Model.py),<br>[CoSQA_Plus_Test_Finetuning_Decoder_Model.py](https://github.com/ChenyxEugene/DecoderLLMs-CodeSearch/blob/main/Fine-tuning/CoSQA_Plus_Test_Finetuning_Decoder_Model.py) \|
	\| Prerequisite Model \| This model must be loaded on top of an MNTP pre-trained model. \|

	---

	## 🚀 How to Use (with `llm2vec`)

	For best results, we strongly recommend using the official `llm2vec` wrapper to load and use this model.

	1. Install Dependencies
	```bash
	pip install llm2vec transformers torch peft accelerate
	```

	2. Example Usage

	> Important: The `llm2vec` supervised contrastive (SupCon) models are fine-tuned on top of MNTP (Masked Next Token Prediction) models. Therefore, loading requires first merging the MNTP weights before loading the SupCon adapter.

	```python
	import torch
	from transformers import AutoTokenizer, AutoModel, AutoConfig
	from peft import PeftModel
	from llm2vec import LLM2Vec

	# --- 1. Define Model IDs ---
	base_model_id = "uukuguy/speechless-code-mistral-7b-v1.0"
	mntp_model_id = "SYSUSELab/DCS-CodeMistral-7B-It-MNTP"
	supcon_model_id = "SYSUSELab/DCS-CodeMistral-7B-It-SupCon-CSN"

	# --- 2. Load Base Model and MNTP Adapter ---
	tokenizer = AutoTokenizer.from_pretrained(base_model_id)
	config = AutoConfig.from_pretrained(base_model_id, trust_remote_code=True)
	model = AutoModel.from_pretrained(
	base_model_id,
	trust_remote_code=True,
	config=config,
	torch_dtype=torch.bfloat16,
	device_map="cuda" if torch.cuda.is_available() else "cpu",
	)
	model = PeftModel.from_pretrained(model, mntp_model_id)
	model = model.merge_and_unload()

	# --- 3. Load the Supervised (this model) Adapter on top of the MNTP-merged model ---
	model = PeftModel.from_pretrained(model, supcon_model_id)

	# --- 4. Use the LLM2Vec Wrapper for Encoding ---
	l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)

	queries = ["how to read a file in Python?"]
	code_snippets = ["with open('file.txt', 'r') as f:\n content = f.read()"]
	query_embeddings = l2v.encode(queries)
	code_embeddings = l2v.encode(code_snippets)

	print("Query Embedding Shape:", query_embeddings.shape)
	# This usage example is adapted from the official llm2vec repository. Credits to the original authors.
	```

	---

	## 📄 Citation

	If you use our model or work in your research, please cite our paper. As our method is built upon `llm2vec`, please also cite their foundational work.

	Our Paper:
	* Paper Link: [Are Decoder-Only Large Language Models the Silver Bullet for Code Search?](https://arxiv.org/abs/2410.22240)
	* GitHub: [https://github.com/Georgepitt/DecoderLLMs-CodeSearch](https://github.com/Georgepitt/DecoderLLMs-CodeSearch)
	* BibTeX:
	```bibtex
	@article{chen2024decoder,
	title={Are Decoder-Only Large Language Models the Silver Bullet for Code Search?},
	author={Chen, Yuxuan and Liu, Mingwei and Ou, Guangsheng and Li, Anji and Dai, Dekun and Wang, Yanlin and Zheng, Zibin},
	journal={arXiv preprint arXiv:2410.22240},
	year={2024}
	}
	```

	llm2vec (Foundational Work):
	* Paper Link: [LLM2Vec: Large Language Models Are Good Contextual Text Encoders](https://arxiv.org/abs/2404.05961)
	* GitHub: [https://github.com/McGill-NLP/llm2vec](https://github.com/McGill-NLP/llm2vec)
	* BibTeX:
	```bibtex
	@article{vaishaal2024llm2vec,
	title={LLM2Vec: Large Language Models Are Good Contextual Text Encoders},
	author={Vaishaal, Shankar and Bansal, Mohit and Arora, Simran},
	journal={arXiv preprint arXiv:2404.05961},
	year={2024}
	}
	```