Spaces:
Configuration error
Configuration error
| license: bsd-3-clause | |
| tags: | |
| - codet5 | |
| datasets: | |
| - code_search_net | |
| inference: true | |
| # CodeT5-base for Code Summarization | |
| [CodeT5-base](https://huggingface.co/Salesforce/codet5-base) model fine-tuned on CodeSearchNet data in a multi-lingual training setting ( | |
| Ruby/JavaScript/Go/Python/Java/PHP) for code summarization. It was introduced in this EMNLP 2021 | |
| paper [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/abs/2109.00859) | |
| by Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi. Please check out more | |
| at [this repository](https://github.com/salesforce/CodeT5). | |
| ## How to use | |
| Here is how to use this model: | |
| ```python | |
| from transformers import RobertaTokenizer, T5ForConditionalGeneration | |
| if __name__ == '__main__': | |
| tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base-multi-sum') | |
| model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum') | |
| text = """def svg_to_image(string, size=None): | |
| if isinstance(string, unicode): | |
| string = string.encode('utf-8') | |
| renderer = QtSvg.QSvgRenderer(QtCore.QByteArray(string)) | |
| if not renderer.isValid(): | |
| raise ValueError('Invalid SVG data.') | |
| if size is None: | |
| size = renderer.defaultSize() | |
| image = QtGui.QImage(size, QtGui.QImage.Format_ARGB32) | |
| painter = QtGui.QPainter(image) | |
| renderer.render(painter) | |
| return image""" | |
| input_ids = tokenizer(text, return_tensors="pt").input_ids | |
| generated_ids = model.generate(input_ids, max_length=20) | |
| print(tokenizer.decode(generated_ids[0], skip_special_tokens=True)) | |
| # this prints: "Convert a SVG string to a QImage." | |
| ``` | |
| ## Fine-tuning data | |
| We employ the filtered version of CodeSearchNet data [[Husain et al., 2019](https://arxiv.org/abs/1909.09436)] | |
| from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text) benchmark for fine-tuning on | |
| code summarization. The data is tokenized with our pre-trained code-specific BPE (Byte-Pair Encoding) tokenizer. One can | |
| prepare text (or code) for the model using RobertaTokenizer with the vocab files from [codet5-base](https://huggingface.co/Salesforce/codet5-base). | |
| ### Data statistic | |
| | Programming Language | Training | Dev | Test | | |
| | :------------------- | :------: | :----: | :----: | | |
| | Python | 251,820 | 13,914 | 14,918 | | |
| | PHP | 241,241 | 12,982 | 14,014 | | |
| | Go | 167,288 | 7,325 | 8,122 | | |
| | Java | 164,923 | 5,183 | 10,955 | | |
| | JavaScript | 58,025 | 3,885 | 3,291 | | |
| | Ruby | 24,927 | 1,400 | 1,261 | | |
| ## Training procedure | |
| We fine-tune codet5-base on these six programming languages (Ruby/JavaScript/Go/Python/Java/PHP) in the multi-task learning setting. We employ the | |
| balanced sampling to avoid biasing towards high-resource tasks. Please refer to the [paper](https://arxiv.org/abs/2109.00859) for more details. | |
| ## Evaluation results | |
| Unlike the paper allowing to select different best checkpoints for different programming languages (PLs), here we employ one checkpoint for | |
| all PLs. Besides, we remove the task control prefix to specify the PL in training and inference. The results on the test set are shown as below: | |
| | Model | Ruby | Javascript | Go | Python | Java | PHP | Overall | | |
| | ----------- | :-------: | :--------: | :-------: | :-------: | :-------: | :-------: | :-------: | | |
| | Seq2Seq | 9.64 | 10.21 | 13.98 | 15.93 | 15.09 | 21.08 | 14.32 | | |
| | Transformer | 11.18 | 11.59 | 16.38 | 15.81 | 16.26 | 22.12 | 15.56 | | |
| | [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf) | 11.17 | 11.90 | 17.72 | 18.14 | 16.47 | 24.02 | 16.57 | | |
| | [CodeBERT](https://arxiv.org/pdf/2002.08155.pdf) | 12.16 | 14.90 | 18.07 | 19.06 | 17.65 | 25.16 | 17.83 | | |
| | [PLBART](https://aclanthology.org/2021.naacl-main.211.pdf) | 14.11 |15.56 | 18.91 | 19.30 | 18.45 | 23.58 | 18.32 | | |
| | [CodeT5-small](https://arxiv.org/abs/2109.00859) |14.87 | 15.32 | 19.25 | 20.04 | 19.92 | 25.46 | 19.14 | | |
| | [CodeT5-base](https://arxiv.org/abs/2109.00859) | **15.24** | 16.16 | 19.56 | 20.01 | **20.31** | 26.03 | 19.55 | | |
| | [CodeT5-base-multi-sum](https://arxiv.org/abs/2109.00859) | **15.24** | **16.18** | **19.95** | **20.42** | 20.26 | **26.10** | **19.69** | | |
| ## Citation | |
| ```bibtex | |
| @inproceedings{ | |
| wang2021codet5, | |
| title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, | |
| author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi}, | |
| booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021}, | |
| year={2021}, | |
| } | |
| ``` |