Peverell
/

cocosoda_ruby

Model card Files Files and versions

cocosoda_ruby / README.md

SalazarPevelll

model

51c57f8 almost 2 years ago

|

history blame contribute delete

3.18 kB

	# CoCoSoDa: Effective Contrastive Learning for Code Search

	Our approach adopts the pre-trained model as the base code/query encoder and optimizes it using multimodal contrastive learning and soft data augmentation.

	![1](Figure/CoCoSoDa.png)

	CoCoSoDa is comprised of the following four components:
	* Pre-trained code/query encoder captures the semantic information of a code snippet or a natural language query and maps it into a high-dimensional embedding space.
	as the code/query encoder.
	* Momentum code/query encoder encodes the samples (code snippets or queries) of current and previous mini-batches to enrich the negative samples.

	* Soft data augmentation is to dynamically mask or replace some tokens in a sample (code/query) to generate a similar sample as a form of data augmentation.

	* Multimodal contrastive learning loss function is used as the optimization objective and consists of inter-modal and intra-modal contrastive learning loss. They are used to minimize the distance of the representations of similar samples and maximize the distance of different samples in the embedding space.



	## Source code
	### Environment
	```
	conda create -n CoCoSoDa python=3.6 -y
	conda activate CoCoSoDa
	pip install torch==1.10 transformers==4.12.5 seaborn==0.11.2 fast-histogram nltk==3.6.5 networkx==2.5.1 tree_sitter tqdm prettytable gdown more-itertools tensorboardX sklearn
	```
	### Data

	```
	cd dataset
	bash get_data.sh
	```

	Data statistic is shown in this Table.

	\| PL \| Training \| Validation \| Test \| Candidate Codes\|
	\| :--------- \| :------: \| :----: \| :----: \|:----: \|
	\| Ruby \| 24,927 \| 1,400 \| 1,261 \|4,360\|
	\| JavaScript \| 58,025 \| 3,885 \| 3,291 \|13,981\|
	\| Java \| 164,923 \| 5,183 \| 10,955 \|40,347\|
	\| Go \| 167,288 \| 7,325 \| 8,122 \|28,120\|
	\| PHP \| 241,241 \| 12,982 \| 14,014 \|52,660\|
	\| Python \| 251,820 \| 13,914 \| 14,918 \|43,827\|

	It will take about 10min.

	### Training and Evaualtion

	We have uploaded the pre-trained model to [huggingface](https://huggingface.co/). You can directly download [DeepSoftwareAnalytics/CoCoSoDa](https://huggingface.co/DeepSoftwareAnalytics/CoCoSoDa) and fine-tune it.
	#### Pre-training (Optional)

	```
	bash run_cocosoda.sh $lang
	```
	The optimized model is saved in `./saved_models/cocosoda/`. You can upload them to [huggingface](https://huggingface.co/).

	It will take about 3 days.

	#### Fine-tuning


	```
	lang=java
	bash run_fine_tune.sh $lang
	```
	#### Zero-shot running

	```
	lang=python
	bash run_zero-shot.sh $lang
	```


	### Results

	#### The Model Evaluated with MRR

	\| Model \| Ruby \| Javascript \| Go \| Python \| Java \| PHP \| Avg. \|
	\| -------------- \| :-------: \| :--------: \| :-------: \| :-------: \| :-------: \| :-------: \| :-------: \|
	\| CoCoSoDa \| 0.818\| 0.764\| 0.921 \|0.757\| 0.763\| 0.703 \|0.788\|

	## Appendix

	The description of baselines, addtional experimetal results and discussion are shown in `Appendix/Appendix.pdf`.


	## Contact
	Feel free to contact Ensheng Shi (enshengshi@qq.com) if you have any further questions or no response to github issue for more than 1 day.