| # CodeT5 for Code Comment Generation | |
| This is a CodeT5 model fine-tuned from Salesforce/codet5-base for generating natural language comments from Python code snippets. It maps code snippets to descriptive comments and can be used for automated code documentation, code understanding, or educational purposes. | |
| # Model Details | |
| **Model Description** | |
| **Model Type:** Sequence-to-Sequence Transformer | |
| **Base Model:** Salesforce/codet5-base | |
| **Maximum Sequence Length:** 128 tokens (input and output) | |
| **Output:** Natural language comments describing the input code | |
| **Task:** Code-to-comment generation | |
| # Model Sources | |
| **Documentation:** CodeT5 Documentation | |
| **Repository:** CodeT5 on GitHub | |
| **Hugging Face:** CodeT5 on Hugging Face | |
| # Full Model Architecture | |
| ``` | |
| T5ForConditionalGeneration( | |
| (shared): Embedding(32100, 768) | |
| (encoder): T5Stack( | |
| (embed_tokens): Embedding(32100, 768) | |
| (block): ModuleList(...) | |
| (final_layer_norm): LayerNorm((768,), eps=1e-12) | |
| (dropout): Dropout(p=0.1) | |
| ) | |
| (decoder): T5Stack( | |
| (embed_tokens): Embedding(32100, 768) | |
| (block): ModuleList(...) | |
| (final_layer_norm): LayerNorm((768,), eps=1e-12) | |
| (dropout): Dropout(p=0.1) | |
| ) | |
| (lm_head): Linear(in_features=768, out_features=32100, bias=False) | |
| ) | |
| ``` | |
| ```bash | |
| pip install -U transformers torch datasets | |
| #Then, load the model and run inference: | |
| ``` | |
| from transformers import T5ForConditionalGeneration, RobertaTokenizer | |
| # Download from the 🤗 Hub | |
| ```python | |
| model_name = "AventIQ-AI/t5_code_summarizer" # Update with your HF model ID | |
| tokenizer = RobertaTokenizer.from_pretrained(model_name) | |
| model = T5ForConditionalGeneration.from_pretrained(model_name) | |
| # Move to GPU if available | |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") | |
| model.to(device) | |
| # Inference | |
| code_snippet = "sum(d * 10 ** i for i, d in enumerate(x[::-1]))" | |
| inputs = tokenizer(code_snippet, max_length=128, truncation=True, padding="max_length", return_tensors="pt").to(device) | |
| outputs = model.generate( | |
| input_ids=inputs["input_ids"], | |
| attention_mask=inputs["attention_mask"], | |
| max_length=128, | |
| num_beams=4, | |
| early_stopping=True | |
| ) | |
| comment = tokenizer.decode(outputs[0], skip_special_tokens=True) | |
| print(f"Code: {code_snippet}") | |
| print(f"Comment: {comment}") | |
| # Expected output: Something close to "Concatenate elements of a list 'x' of multiple integers to a single integer" | |
| ``` | |
| # Training Details | |
| Training Dataset | |
| **Name:** janrauhl/conala | |
| **Size:** 2,300 training samples, 477 validation samples | |
| **Columns:** snippet (code), rewritten_intent (comment), intent, question_id | |
| # Approximate Statistics (based on inspection): | |
| ``` | |
| snippet: | |
| Type: string | |
| Min length: ~10 tokens | |
| Mean length: ~20-30 tokens (estimated) | |
| Max length: ~100 tokens (before truncation) | |
| rewritten_intent: | |
| Type: string | |
| Min length: ~5 tokens | |
| Mean length: ~10-15 tokens (estimated) | |
| Max length: ~50 tokens (before truncation) | |
| Samples: | |
| snippet: sum(d * 10 ** i for i, d in enumerate(x[::-1])), rewritten_intent: "Concatenate elements of a list 'x' of multiple integers to a single integer" | |
| snippet: int(''.join(map(str, x))), rewritten_intent: "Convert a list of integers into a single integer" | |
| snippet: datetime.strptime('2010-11-13 10:33:54.227806', '%Y-%m-%d %H:%M:%S.%f'), rewritten_intent: "Convert a DateTime string back to a DateTime object of format '%Y-%m-%d %H:%M:%S.%f'" | |
| ``` | |
| # Training Hyperparameters | |
| ### Non-Default Hyperparameters: | |
| - **per_device_train_batch_size:** 4 | |
| - **per_device_eval_batch_size:** 4 | |
| - **gradient_accumulation_steps:** 2 (effective batch size = 8) | |
| - **num_train_epochs:** 10 | |
| - **learning_rate:** 1e-4 | |
| - **fp16:** True | |