| # CodeBERTa-Python-DocGen | |
| ## Overview | |
| `CodeBERTa-Python-DocGen` is a RoBERTa-based model fine-tuned for code-related tasks, specifically **Code-to-Text Generation** (docstring synthesis) and **Code-Text Retrieval** (finding relevant code given a natural language query). The model is pre-trained on a large corpus of Python code from public repositories, with a focus on pairing function bodies with high-quality docstrings and descriptive comments. | |
| It excels at understanding the semantic relationship between a Python function's implementation details and its natural language documentation. | |
| ## Model Architecture | |
| * **Base Model:** RoBERTa (Robustly optimized BERT Pretraining Approach) | |
| * **Pre-training:** Masked Language Modeling (MLM) on Python source code. | |
| * **Fine-tuning Task:** Two-fold: | |
| 1. **Generation:** Conditional text generation where the input is the function body and the target is the docstring. | |
| 2. **Retrieval:** Learning cross-modal embeddings between the function body and the docstring (using contrastive loss). | |
| * **Tokenization:** Byte-Pair Encoding (BPE) optimized for code syntax, including special tokens for `<START_CODE>`, `<END_CODE>`, `<START_DOCSTRING>`, and `<END_DOCSTRING>`. | |
| * **Max Sequence Length:** 512 tokens. | |
| ## Intended Use | |
| * **Automated Docstring Generation:** Creating initial or full documentation summaries for new or existing Python functions. | |
| * **Code Search Engine:** Ranking and retrieving the most relevant Python function body from a database given a user's natural language search query (e.g., "function to calculate L2 distance"). | |
| * **Code Comment Completion:** Suggesting descriptive inline comments within a function body. | |
| * **Code-Text Similarity:** Measuring the semantic similarity between arbitrary code snippets and their descriptive summaries. | |
| ## Limitations | |
| * **Hallucination in Docstrings:** While generally coherent, generated docstrings may sometimes misrepresent the actual logic of complex or subtle code due to the generative nature of the model. | |
| * **Library Scope:** The model performs best on code utilizing common scientific and data science libraries present in the training data (e.g., `numpy`, `pandas`, `sklearn`). Performance can be lower for highly specialized, domain-specific libraries. | |
| * **Complexity:** The quality of the docstring degrades rapidly for functions exceeding 100 lines of code or having very high cyclomatic complexity. | |
| ## Example Code (PyTorch - Text Generation) | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| import torch | |
| model_name = "Code/CodeBERTa-Python-DocGen" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForCausalLM.from_pretrained(model_name) | |
| # Input: A Python function body | |
| code_input = """ | |
| def calculate_l2_norm(vector_a, vector_b): | |
| diff = np.array(vector_a) - np.array(vector_b) | |
| return np.sqrt(np.sum(diff ** 2)) | |
| """ | |
| # Prepare the prompt for docstring generation: | |
| prompt = f"<START_CODE> {code_input} <END_CODE> <START_DOCSTRING>" | |
| input_ids = tokenizer.encode(prompt, return_tensors="pt") | |
| # Generate the docstring | |
| output_ids = model.generate( | |
| input_ids, | |
| max_length=100, | |
| do_sample=True, | |
| top_k=50, | |
| top_p=0.95, | |
| num_return_sequences=1, | |
| eos_token_id=tokenizer.encode("<END_DOCSTRING>")[0] | |
| ) | |
| # Decode and clean the output | |
| generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=False) | |
| docstring = generated_text.split("<START_DOCSTRING>")[1].split("<END_DOCSTRING>")[0].strip() | |
| print(f"Generated Docstring:\n{docstring}") | |
| # Expected output: Calculates the L2 (Euclidean) distance between two numerical vectors. | |
| # :param vector_a: A list or numpy array. :param vector_b: A list or numpy array. :return: The L2 distance (float). |