AventIQ-AI
/

t5_code_summarizer

Safetensors

Model card Files Files and versions

xet

Community

YashikaNagpal commited on Feb 28, 2025

Commit

ee31b37

verified ·

1 Parent(s): 5a99f95

Create README.md

Browse files

Files changed (1) hide show

README.md +111 -0

README.md ADDED Viewed

	@@ -0,0 +1,111 @@

+# CodeT5 for Code Comment Generation
+This is a CodeT5 model fine-tuned from Salesforce/codet5-base for generating natural language comments from Python code snippets. It maps code snippets to descriptive comments and can be used for automated code documentation, code understanding, or educational purposes.
+# Model Details
+**Model Description**
+**Model Type:** Sequence-to-Sequence Transformer
+**Base Model:** Salesforce/codet5-base
+**Maximum Sequence Length:** 128 tokens (input and output)
+**Output:** Natural language comments describing the input code
+**Task:** Code-to-comment generation
+# Model Sources
+**Documentation:** CodeT5 Documentation
+**Repository:** CodeT5 on GitHub
+**Hugging Face:** CodeT5 on Hugging Face
+# Full Model Architecture
+```
+T5ForConditionalGeneration(
+  (shared): Embedding(32100, 768)
+  (encoder): T5Stack(
+    (embed_tokens): Embedding(32100, 768)
+    (block): ModuleList(...)
+    (final_layer_norm): LayerNorm((768,), eps=1e-12)
+    (dropout): Dropout(p=0.1)
+  )
+  (decoder): T5Stack(
+    (embed_tokens): Embedding(32100, 768)
+    (block): ModuleList(...)
+    (final_layer_norm): LayerNorm((768,), eps=1e-12)
+    (dropout): Dropout(p=0.1)
+  )
+  (lm_head): Linear(in_features=768, out_features=32100, bias=False)
+)
+```
+pip install -U transformers torch datasets
+Then, load the model and run inference:
+```
+from transformers import T5ForConditionalGeneration, RobertaTokenizer
+# Download from the 🤗 Hub (replace with your model ID after uploading)
+model_name = "your-username/codet5-conala-comments"  # Update with your HF model ID
+tokenizer = RobertaTokenizer.from_pretrained(model_name)
+model = T5ForConditionalGeneration.from_pretrained(model_name)
+# Move to GPU if available
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model.to(device)
+# Inference
+code_snippet = "sum(d * 10 ** i for i, d in enumerate(x[::-1]))"
+inputs = tokenizer(code_snippet, max_length=128, truncation=True, padding="max_length", return_tensors="pt").to(device)
+outputs = model.generate(
+    input_ids=inputs["input_ids"],
+    attention_mask=inputs["attention_mask"],
+    max_length=128,
+    num_beams=4,
+    early_stopping=True
+)
+comment = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(f"Code: {code_snippet}")
+print(f"Comment: {comment}")
+# Expected output: Something close to "Concatenate elements of a list 'x' of multiple integers to a single integer"
+```
+# Training Details
+Training Dataset
+**Name:** janrauhl/conala
+**Size:** 2,300 training samples, 477 validation samples
+**Columns:** snippet (code), rewritten_intent (comment), intent, question_id
+# Approximate Statistics (based on inspection):
+```
+snippet:
+Type: string
+Min length: ~10 tokens
+Mean length: ~20-30 tokens (estimated)
+Max length: ~100 tokens (before truncation)
+rewritten_intent:
+Type: string
+Min length: ~5 tokens
+Mean length: ~10-15 tokens (estimated)
+Max length: ~50 tokens (before truncation)
+Samples:
+snippet: sum(d * 10 ** i for i, d in enumerate(x[::-1])), rewritten_intent: "Concatenate elements of a list 'x' of multiple integers to a single integer"
+snippet: int(''.join(map(str, x))), rewritten_intent: "Convert a list of integers into a single integer"
+snippet: datetime.strptime('2010-11-13 10:33:54.227806', '%Y-%m-%d %H:%M:%S.%f'), rewritten_intent: "Convert a DateTime string back to a DateTime object of format '%Y-%m-%d %H:%M:%S.%f'"
+```
+# Training Hyperparameters
+Non-Default Hyperparameters:
+**per_device_train_batch_size:** 4
+**per_device_eval_batch_size:** 4
+**gradient_accumulation_steps:** 2 (effective batch size = 8)
+**num_train_epochs:** 10
+**learning_rate:** 1e-4
+**fp16:** True
+```
+@article{wang2021codet5,
+    title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation},
+    author={Wang, Yue and Wang, Weishi and Joty, Shafiq and Hoi, Steven C. H.},
+    journal={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
+    year={2021},
+    url={https://arxiv.org/abs/2109.00859}
+}
+```