|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- code-understanding |
|
|
- unixcoder |
|
|
pipeline_tag: feature-extraction |
|
|
--- |
|
|
|
|
|
# RepoSim4Py |
|
|
|
|
|
An embedding-approach-based tool for comparing semantic similarities between different Python repositories by using different information from repositories. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
**RepoSim4Py** is a pipeline based on the HuggingFace platform for generating embeddings according to specified Github Python repositories. |
|
|
For each Python repository, it generates embeddings at different levels based on the source code, code documentation, requirements, and README files within the repository. |
|
|
By taking the mean of these embeddings, a repository-level mean embedding is generated. |
|
|
These embeddings can be used to compute semantic similarities at different levels, for example, using cosine similarity to get comparison. |
|
|
|
|
|
### Model Description |
|
|
|
|
|
The model used by **RepoSim4Py** is **UniXcoder** fine-tuned on [code search task](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder/downstream-tasks/code-search), using the [AdvTest](https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv) dataset. |
|
|
|
|
|
- **Pipeline developed by:** [Henry65](https://huggingface.co/Henry65) |
|
|
- **Repository:** [RepoSim4Py](https://github.com/RepoMining/RepoSim4Py) |
|
|
- **Model type:** **code understanding** |
|
|
- **Language(s):** **Python** |
|
|
- **License:** **MIT** |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** [UniXcoder](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder) |
|
|
- **Paper:** [UniXcoder: Unified Cross-Modal Pre-training for Code Representation](https://arxiv.org/pdf/2203.03850.pdf) |
|
|
|
|
|
## Uses |
|
|
|
|
|
Below is an example of how to use the RepoSim pipeline to easily generate embeddings for GitHub Python repositories. |
|
|
|
|
|
First, initialise the pipeline: |
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
model = pipeline(model="Lazyhope/RepoSim", trust_remote_code=True) |
|
|
``` |
|
|
Then specify one (or multiple repositories in a tuple) as input and get the result as a list of dictionaries: |
|
|
```python |
|
|
repo_infos = model("lazyhope/python-hello-world") |
|
|
print(repo_infos) |
|
|
``` |
|
|
Output (Long tensor outputs are omitted): |
|
|
```python |
|
|
[{'name': 'lazyhope/python-hello-world', |
|
|
'topics': [], |
|
|
'license': 'MIT', |
|
|
'stars': 0, |
|
|
'code_embeddings': [["def main():\n print('Hello World!')", |
|
|
[-2.0755109786987305, |
|
|
2.813878297805786, |
|
|
2.352170467376709, ...]]], |
|
|
'mean_code_embedding': [-2.0755109786987305, |
|
|
2.813878297805786, |
|
|
2.352170467376709, ...], |
|
|
'doc_embeddings': [['Prints hello world', |
|
|
[-2.3749449253082275, |
|
|
0.5409570336341858, |
|
|
2.2958014011383057, ...]]], |
|
|
'mean_doc_embedding': [-2.3749449253082275, |
|
|
0.5409570336341858, |
|
|
2.2958014011383057, ...]}] |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
Please follow the original [UniXcoder](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder/downstream-tasks/code-search) page for details of fine-tuning it on code search task. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
We used the [awesome-python](https://github.com/vinta/awesome-python) list which contains over 500 Python repositories categorized in different topics, in order to label similar repositories. |
|
|
The evaluation metrics and results can be found in the RepoSim repository, under the [notebooks](https://github.com/RepoAnalysis/RepoSim/tree/main/notebooks) folder. |
|
|
|
|
|
## Acknowledgements |
|
|
Many thanks to authors of the UniXcoder model and the AdvTest dataset, as well as the awesome python list for providing a useful baseline. |
|
|
- **UniXcoder** (https://github.com/microsoft/CodeBERT/tree/master/UniXcoder) |
|
|
- **AdvTest** (https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv) |
|
|
- **awesome-python** (https://github.com/vinta/awesome-python) |
|
|
|
|
|
## Authors |
|
|
- **Zihao Li** (https://github.com/lazyhope) |
|
|
- **Rosa Filgueira** (https://www.rosafilgueira.com) |