File size: 2,067 Bytes
3ce03a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b1ebecb
 
b1a76c5
b1ebecb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3ce03a9
 
896e32d
 
 
 
 
 
 
 
 
 
3ce03a9
 
 
 
 
004aa8e
 
 
 
3ce03a9
 
 
2fc8469
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
datasets:
- Mir-2002/python-google-style-docstrings
language:
- en
metrics:
- bleu
- rouge
base_model:
- Salesforce/codet5p-220m-bimodal
pipeline_tag: summarization
tags:
- code
---

# Overview

This is a fine tuned CodeT5+ (220m) bimodal model tuned on a dataset consisting of 59,000 Python code-docstring pairs. The docstrings are in Google style format.
A google style docstring is formatted as follows:
```
<Description of the code>

Args:
<var1> (<data-type>) : <description of var1>
<var2> (<data_type>) : <description of var2>

Returns:
<var3> (<data-type>) : <description of var3>

Raises:
<var4> (<data-type>) : <description of var4>
```

For more information on my dataset, please see the included referenced dataset.

You can test the model using this:

```python
from transformers import T5ForConditionalGeneration, AutoTokenizer

checkpoint = "Mir-2002/codet5p-google-style-docstrings"
device = "cuda" # or CPU

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = T5ForConditionalGeneration.from_pretrained(checkpoint).to(device)

input = """
def calculate_sum(a, b):
    return a + b
"""

inputs = tokenizer.encode(input, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Calculate the sum of two numbers.

# Args:
# a (int): The first number.
# b (int): The second number.

```

# Hyperparameters

MAX_SOURCE_LENGTH = 256 <br>
MAX_TARGET_LENGTH = 128 <br>
BATCH_SIZE = 16 <br>
NUM_EPOCHS = 35 <br>
LEARNING_RATE = 3e-5 <br>
GRADIENT_ACCUMULATION_STEPS = 4 <br>
EARLY_STOPPING_PATIENCE = 2 <br>
WEIGHT_DECAY = 0.01 <br>
OPTIMIZER = ADAFACTOR <br>
LR_SCHEDULER = LINEAR <br>

# Loss

On the 35th epoch, the model achieved the following loss:

| Epoch      | Training Loss | Validation Loss |
| ----------- | ----------- | ----------- |
| 35     | 0.894800	       | 1.268536


# BLEU and ROUGE Scores

| BLEU      | ROUGE-1 | ROUGE-2 | ROUGE-L
| ----------- | ----------- | ----------- | ----------- |
| 35.40     | 58.55	       | 39.46 | 52.43 |