| | --- |
| | datasets: |
| | - Mir-2002/python-google-style-docstrings |
| | language: |
| | - en |
| | metrics: |
| | - bleu |
| | - rouge |
| | base_model: |
| | - Salesforce/codet5p-220m-bimodal |
| | pipeline_tag: summarization |
| | tags: |
| | - code |
| | --- |
| | |
| | # Overview |
| |
|
| | This is a fine tuned CodeT5+ (220m) bimodal model tuned on a dataset consisting of 59,000 Python code-docstring pairs. The docstrings are in Google style format. |
| | A google style docstring is formatted as follows: |
| | ``` |
| | <Description of the code> |
| | |
| | Args: |
| | <var1> (<data-type>) : <description of var1> |
| | <var2> (<data_type>) : <description of var2> |
| | |
| | Returns: |
| | <var3> (<data-type>) : <description of var3> |
| | |
| | Raises: |
| | <var4> (<data-type>) : <description of var4> |
| | ``` |
| |
|
| | For more information on my dataset, please see the included referenced dataset. |
| |
|
| | You can test the model using this: |
| |
|
| | ```python |
| | from transformers import T5ForConditionalGeneration, AutoTokenizer |
| | |
| | checkpoint = "Mir-2002/codet5p-google-style-docstrings" |
| | device = "cuda" # or CPU |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(checkpoint) |
| | model = T5ForConditionalGeneration.from_pretrained(checkpoint).to(device) |
| | |
| | input = """ |
| | def calculate_sum(a, b): |
| | return a + b |
| | """ |
| | |
| | inputs = tokenizer.encode(input, return_tensors="pt").to(device) |
| | outputs = model.generate(inputs, max_length=128) |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | # Calculate the sum of two numbers. |
| | |
| | # Args: |
| | # a (int): The first number. |
| | # b (int): The second number. |
| | |
| | ``` |
| |
|
| | # Hyperparameters |
| |
|
| | MAX_SOURCE_LENGTH = 256 <br> |
| | MAX_TARGET_LENGTH = 128 <br> |
| | BATCH_SIZE = 16 <br> |
| | NUM_EPOCHS = 35 <br> |
| | LEARNING_RATE = 3e-5 <br> |
| | GRADIENT_ACCUMULATION_STEPS = 4 <br> |
| | EARLY_STOPPING_PATIENCE = 2 <br> |
| | WEIGHT_DECAY = 0.01 <br> |
| | OPTIMIZER = ADAFACTOR <br> |
| | LR_SCHEDULER = LINEAR <br> |
| | |
| | # Loss |
| | |
| | On the 35th epoch, the model achieved the following loss: |
| | |
| | | Epoch | Training Loss | Validation Loss | |
| | | ----------- | ----------- | ----------- | |
| | | 35 | 0.894800 | 1.268536 |
| | |
| | |
| | # BLEU and ROUGE Scores |
| | |
| | | BLEU | ROUGE-1 | ROUGE-2 | ROUGE-L |
| | | ----------- | ----------- | ----------- | ----------- | |
| | | 35.40 | 58.55 | 39.46 | 52.43 | |