outlander23 commited on
Commit
4a706e0
Β·
verified Β·
1 Parent(s): 8d90a1e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +159 -25
README.md CHANGED
@@ -1,41 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- license: apache-2.0
3
- language:
4
- - cpp
5
- metrics:
6
- - bleu
7
- library_name: transformers
8
- pipeline_tag: text-generation
9
- tags:
10
- - code-generation
11
- - code-completion
12
- - competitive-programming
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  ---
14
 
 
 
 
 
 
 
 
 
 
15
 
16
- # CodeLanderAI Model
17
 
18
- This model is a fine-tuned version of the `CodeT5` model specifically designed for code completion in competitive programming. It was trained on a custom dataset of 12 million code samples derived from 2 million source code files.
 
 
 
19
 
20
- ## Intended Use
21
 
22
- The model is intended for generating code completions based on the context provided by the user. It only supports cpp programming languages commonly used in competitive programming.
23
 
24
- ### Languages Supported
25
- - C++
 
26
 
 
27
 
28
- ### Metrics
29
 
30
- The model was evaluated using the following metrics:
31
 
32
- - **BLEU Score:** Measures the quality of generated code against reference code.
33
- - **CodeBLEU:** A metric tailored for code generation, considering syntax and structure.
34
- - **Accuracy:** How often the model provides the correct code completion.
35
- - **Perplexity:** Indicates how well the model predicts the next token in a sequence.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
- ### Datasets
38
 
39
- The model was fine-tuned on a custom dataset containing code samples from competitive programming platforms.
40
 
 
41
 
 
 
1
+ # πŸš€ Codelander
2
+
3
+ ---
4
+
5
+ ## πŸ“– Overview
6
+
7
+ This specialized **CodeT5** model has been fine-tuned for **C++ code completion** tasks.
8
+ It excels at understanding **C++ syntax** and **common programming patterns** to provide intelligent code suggestions as you type.
9
+
10
+ ---
11
+
12
+ ## ✨ Key Features
13
+
14
+ - πŸ”Ή Context-aware completions for C++ functions, classes, and control structures
15
+ - πŸ”Ή Handles complex C++ syntax including **templates, STL, and modern C++ features**
16
+ - πŸ”Ή Trained on **competitive programming solutions** from high-quality Codeforces submissions
17
+ - πŸ”Ή Low latency suitable for **real-time editor integration**
18
+
19
+ ---
20
+
21
+ ## πŸ“Š Model Performance
22
+
23
+ | Metric | Value |
24
+ |---------------------|---------|
25
+ | Training Loss | 1.2475 |
26
+ | Validation Loss | 1.0016 |
27
+ | Training Epochs | 3 |
28
+ | Training Steps | 14010 |
29
+ | Samples per second | 6.275 |
30
+
31
  ---
32
+
33
+ ## βš™οΈ Installation & Usage
34
+
35
+ ### πŸ”§ Direct Integration with HuggingFace Transformers
36
+
37
+ ```python
38
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
39
+
40
+ # Load model and tokenizer
41
+ model = AutoModelForSeq2SeqLM.from_pretrained("outlander23/codelander")
42
+ tokenizer = AutoTokenizer.from_pretrained("outlander23/codelander")
43
+
44
+ # Generate completion
45
+ def get_completion(code_prefix, max_new_tokens=100):
46
+ inputs = tokenizer(f"complete C++ code: {code_prefix}", return_tensors="pt")
47
+ outputs = model.generate(
48
+ inputs.input_ids,
49
+ max_new_tokens=max_new_tokens,
50
+ temperature=0.7,
51
+ top_p=0.9,
52
+ do_sample=True
53
+ )
54
+ return tokenizer.decode(outputs[0], skip_special_tokens=True)
55
+ ```
56
+
57
  ---
58
 
59
+ ## πŸ—οΈ Model Architecture
60
+
61
+ - Base Model: **Salesforce/codet5-base**
62
+ - Parameters: **220M**
63
+ - Context Window: **512 tokens**
64
+ - Fine-tuning: **Seq2Seq training on C++ code snippets**
65
+ - Training Time: ~ **5 hours**
66
+
67
+ ---
68
 
69
+ ## πŸ“‚ Training Data
70
 
71
+ - Dataset: **open-r1/codeforces-submissions**
72
+ - Selection: **Accepted C++ solutions only**
73
+ - Size: **50,000+ code samples**
74
+ - Processing: **Prefix-suffix pairs with random splits**
75
 
76
+ ---
77
 
78
+ ## ⚠️ Limitations
79
 
80
+ - ❌ May generate syntactically correct but semantically incorrect code
81
+ - ❌ Limited knowledge of **domain-specific libraries** not present in training data
82
+ - ❌ May occasionally produce **incomplete code fragments**
83
 
84
+ ---
85
 
86
+ ## πŸ’» Example Completions
87
 
88
+ ### βœ… Example 1: Factorial Function
89
 
90
+ **Input:**
91
+ ```cpp
92
+ int factorial(int n) {
93
+ if (n <= 1) {
94
+ return 1;
95
+ } else {
96
+ ```
97
+
98
+ **Completion:**
99
+ ```cpp
100
+ return n * factorial(n - 1);
101
+ }
102
+ }
103
+ ```
104
+
105
+ ---
106
+
107
+ ### βœ… Example 2: Merge Sort
108
+
109
+ **Input:**
110
+ ```cpp
111
+ vector<int> merge_sort(vector<int>& arr) {
112
+ if (arr.size() <= 1) return arr;
113
+
114
+ int mid = arr.size() / 2;
115
+ vector<int> left(arr.begin(), arr.begin() + mid);
116
+ vector<int> right(arr.begin() + mid, arr.end());
117
+
118
+ left = merge_sort(left);
119
+ right = merge_sort(right);
120
+ ```
121
+
122
+ **Completion:**
123
+ ```cpp
124
+ vector<int> result;
125
+ int i = 0, j = 0;
126
+
127
+ while (i < left.size() && j < right.size()) {
128
+ if (left[i] < right[j]) {
129
+ result.push_back(left[i++]);
130
+ } else {
131
+ result.push_back(right[j++]);
132
+ }
133
+ }
134
+
135
+ while (i < left.size()) result.push_back(left[i++]);
136
+ while (j < right.size()) result.push_back(right[j++]);
137
+
138
+ return result;
139
+ }
140
+ ```
141
+
142
+ ---
143
+
144
+ ## πŸ“ˆ Training Details
145
+
146
+ - Training completed on: **2025-08-28 12:51:09 UTC**
147
+ - Training epochs: **3/3**
148
+ - Total steps: **14010**
149
+ - Training loss: **1.2475**
150
+
151
+ ### πŸ“Š Epoch Performance
152
+
153
+ | Epoch | Training Loss | Validation Loss |
154
+ |-------|---------------|-----------------|
155
+ | 1 | 1.2638 | 1.1004 |
156
+ | 2 | 1.1551 | 1.0250 |
157
+ | 3 | 1.1081 | 1.0016 |
158
+
159
+ ---
160
+
161
+ ## πŸ–₯️ Compatibility
162
+
163
+ - βœ… Compatible with **Transformers 4.30.0+**
164
+ - βœ… Optimized for **Python 3.8+**
165
+ - βœ… Supports both **CPU and GPU inference**
166
+
167
+ ---
168
 
169
+ ## ❀️ Credits
170
 
171
+ Made with ❀️ by **outlander23**
172
 
173
+ > "Good code is its own best documentation." – *Steve McConnell*
174
 
175
+ ---