Tasfiya025 commited on
Commit
e4dbdd4
·
verified ·
1 Parent(s): d63af7d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -0
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CodeBERTa-Python-DocGen
2
+
3
+ ## Overview
4
+ `CodeBERTa-Python-DocGen` is a RoBERTa-based model fine-tuned for code-related tasks, specifically **Code-to-Text Generation** (docstring synthesis) and **Code-Text Retrieval** (finding relevant code given a natural language query). The model is pre-trained on a large corpus of Python code from public repositories, with a focus on pairing function bodies with high-quality docstrings and descriptive comments.
5
+
6
+ It excels at understanding the semantic relationship between a Python function's implementation details and its natural language documentation.
7
+
8
+ ## Model Architecture
9
+ * **Base Model:** RoBERTa (Robustly optimized BERT Pretraining Approach)
10
+ * **Pre-training:** Masked Language Modeling (MLM) on Python source code.
11
+ * **Fine-tuning Task:** Two-fold:
12
+ 1. **Generation:** Conditional text generation where the input is the function body and the target is the docstring.
13
+ 2. **Retrieval:** Learning cross-modal embeddings between the function body and the docstring (using contrastive loss).
14
+ * **Tokenization:** Byte-Pair Encoding (BPE) optimized for code syntax, including special tokens for `<START_CODE>`, `<END_CODE>`, `<START_DOCSTRING>`, and `<END_DOCSTRING>`.
15
+ * **Max Sequence Length:** 512 tokens.
16
+
17
+ ## Intended Use
18
+ * **Automated Docstring Generation:** Creating initial or full documentation summaries for new or existing Python functions.
19
+ * **Code Search Engine:** Ranking and retrieving the most relevant Python function body from a database given a user's natural language search query (e.g., "function to calculate L2 distance").
20
+ * **Code Comment Completion:** Suggesting descriptive inline comments within a function body.
21
+ * **Code-Text Similarity:** Measuring the semantic similarity between arbitrary code snippets and their descriptive summaries.
22
+
23
+ ## Limitations
24
+ * **Hallucination in Docstrings:** While generally coherent, generated docstrings may sometimes misrepresent the actual logic of complex or subtle code due to the generative nature of the model.
25
+ * **Library Scope:** The model performs best on code utilizing common scientific and data science libraries present in the training data (e.g., `numpy`, `pandas`, `sklearn`). Performance can be lower for highly specialized, domain-specific libraries.
26
+ * **Complexity:** The quality of the docstring degrades rapidly for functions exceeding 100 lines of code or having very high cyclomatic complexity.
27
+
28
+ ## Example Code (PyTorch - Text Generation)
29
+
30
+ ```python
31
+ from transformers import AutoTokenizer, AutoModelForCausalLM
32
+ import torch
33
+
34
+ model_name = "Code/CodeBERTa-Python-DocGen"
35
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
36
+ model = AutoModelForCausalLM.from_pretrained(model_name)
37
+
38
+ # Input: A Python function body
39
+ code_input = """
40
+ def calculate_l2_norm(vector_a, vector_b):
41
+ diff = np.array(vector_a) - np.array(vector_b)
42
+ return np.sqrt(np.sum(diff ** 2))
43
+ """
44
+
45
+ # Prepare the prompt for docstring generation:
46
+ prompt = f"<START_CODE> {code_input} <END_CODE> <START_DOCSTRING>"
47
+
48
+ input_ids = tokenizer.encode(prompt, return_tensors="pt")
49
+
50
+ # Generate the docstring
51
+ output_ids = model.generate(
52
+ input_ids,
53
+ max_length=100,
54
+ do_sample=True,
55
+ top_k=50,
56
+ top_p=0.95,
57
+ num_return_sequences=1,
58
+ eos_token_id=tokenizer.encode("<END_DOCSTRING>")[0]
59
+ )
60
+
61
+ # Decode and clean the output
62
+ generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=False)
63
+ docstring = generated_text.split("<START_DOCSTRING>")[1].split("<END_DOCSTRING>")[0].strip()
64
+
65
+ print(f"Generated Docstring:\n{docstring}")
66
+ # Expected output: Calculates the L2 (Euclidean) distance between two numerical vectors.
67
+ # :param vector_a: A list or numpy array. :param vector_b: A list or numpy array. :return: The L2 distance (float).