Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CodeBERTa-Python-DocGen
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
`CodeBERTa-Python-DocGen` is a RoBERTa-based model fine-tuned for code-related tasks, specifically **Code-to-Text Generation** (docstring synthesis) and **Code-Text Retrieval** (finding relevant code given a natural language query). The model is pre-trained on a large corpus of Python code from public repositories, with a focus on pairing function bodies with high-quality docstrings and descriptive comments.
|
| 5 |
+
|
| 6 |
+
It excels at understanding the semantic relationship between a Python function's implementation details and its natural language documentation.
|
| 7 |
+
|
| 8 |
+
## Model Architecture
|
| 9 |
+
* **Base Model:** RoBERTa (Robustly optimized BERT Pretraining Approach)
|
| 10 |
+
* **Pre-training:** Masked Language Modeling (MLM) on Python source code.
|
| 11 |
+
* **Fine-tuning Task:** Two-fold:
|
| 12 |
+
1. **Generation:** Conditional text generation where the input is the function body and the target is the docstring.
|
| 13 |
+
2. **Retrieval:** Learning cross-modal embeddings between the function body and the docstring (using contrastive loss).
|
| 14 |
+
* **Tokenization:** Byte-Pair Encoding (BPE) optimized for code syntax, including special tokens for `<START_CODE>`, `<END_CODE>`, `<START_DOCSTRING>`, and `<END_DOCSTRING>`.
|
| 15 |
+
* **Max Sequence Length:** 512 tokens.
|
| 16 |
+
|
| 17 |
+
## Intended Use
|
| 18 |
+
* **Automated Docstring Generation:** Creating initial or full documentation summaries for new or existing Python functions.
|
| 19 |
+
* **Code Search Engine:** Ranking and retrieving the most relevant Python function body from a database given a user's natural language search query (e.g., "function to calculate L2 distance").
|
| 20 |
+
* **Code Comment Completion:** Suggesting descriptive inline comments within a function body.
|
| 21 |
+
* **Code-Text Similarity:** Measuring the semantic similarity between arbitrary code snippets and their descriptive summaries.
|
| 22 |
+
|
| 23 |
+
## Limitations
|
| 24 |
+
* **Hallucination in Docstrings:** While generally coherent, generated docstrings may sometimes misrepresent the actual logic of complex or subtle code due to the generative nature of the model.
|
| 25 |
+
* **Library Scope:** The model performs best on code utilizing common scientific and data science libraries present in the training data (e.g., `numpy`, `pandas`, `sklearn`). Performance can be lower for highly specialized, domain-specific libraries.
|
| 26 |
+
* **Complexity:** The quality of the docstring degrades rapidly for functions exceeding 100 lines of code or having very high cyclomatic complexity.
|
| 27 |
+
|
| 28 |
+
## Example Code (PyTorch - Text Generation)
|
| 29 |
+
|
| 30 |
+
```python
|
| 31 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 32 |
+
import torch
|
| 33 |
+
|
| 34 |
+
model_name = "Code/CodeBERTa-Python-DocGen"
|
| 35 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 36 |
+
model = AutoModelForCausalLM.from_pretrained(model_name)
|
| 37 |
+
|
| 38 |
+
# Input: A Python function body
|
| 39 |
+
code_input = """
|
| 40 |
+
def calculate_l2_norm(vector_a, vector_b):
|
| 41 |
+
diff = np.array(vector_a) - np.array(vector_b)
|
| 42 |
+
return np.sqrt(np.sum(diff ** 2))
|
| 43 |
+
"""
|
| 44 |
+
|
| 45 |
+
# Prepare the prompt for docstring generation:
|
| 46 |
+
prompt = f"<START_CODE> {code_input} <END_CODE> <START_DOCSTRING>"
|
| 47 |
+
|
| 48 |
+
input_ids = tokenizer.encode(prompt, return_tensors="pt")
|
| 49 |
+
|
| 50 |
+
# Generate the docstring
|
| 51 |
+
output_ids = model.generate(
|
| 52 |
+
input_ids,
|
| 53 |
+
max_length=100,
|
| 54 |
+
do_sample=True,
|
| 55 |
+
top_k=50,
|
| 56 |
+
top_p=0.95,
|
| 57 |
+
num_return_sequences=1,
|
| 58 |
+
eos_token_id=tokenizer.encode("<END_DOCSTRING>")[0]
|
| 59 |
+
)
|
| 60 |
+
|
| 61 |
+
# Decode and clean the output
|
| 62 |
+
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=False)
|
| 63 |
+
docstring = generated_text.split("<START_DOCSTRING>")[1].split("<END_DOCSTRING>")[0].strip()
|
| 64 |
+
|
| 65 |
+
print(f"Generated Docstring:\n{docstring}")
|
| 66 |
+
# Expected output: Calculates the L2 (Euclidean) distance between two numerical vectors.
|
| 67 |
+
# :param vector_a: A list or numpy array. :param vector_b: A list or numpy array. :return: The L2 distance (float).
|