kaushik-harsh-99 commited on
Commit
bc56b51
·
verified ·
1 Parent(s): 95f644c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +380 -0
README.md ADDED
@@ -0,0 +1,380 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ library_name: scikit-learn
6
+ tags:
7
+ - code-classification
8
+ - programming-language-detection
9
+ - source-code
10
+ - machine-learning
11
+ - fasttext
12
+ - modernbert
13
+ - classification
14
+ - nlp
15
+ - code-analysis
16
+ - software-engineering
17
+ pipeline_tag: text-classification
18
+ metrics:
19
+ - accuracy
20
+ - precision
21
+ - recall
22
+ - f1
23
+ model-index:
24
+ - name: SGD Logistic Regression
25
+ results:
26
+ - task:
27
+ type: text-classification
28
+ name: Programming Language Classification
29
+ dataset:
30
+ type: custom
31
+ name: Code Language Classification Dataset
32
+ metrics:
33
+ - type: accuracy
34
+ value: 91.1
35
+ name: Test Accuracy
36
+ - name: FastText
37
+ results:
38
+ - task:
39
+ type: text-classification
40
+ name: Programming Language Classification
41
+ dataset:
42
+ type: custom
43
+ name: Code Language Classification Dataset
44
+ metrics:
45
+ - type: accuracy
46
+ value: 95.5
47
+ name: Test Accuracy
48
+ datasets:
49
+ - kaushik-harsh-99/Code-Language-Classification
50
+ base_model:
51
+ - answerdotai/ModernBERT-base
52
+ ---
53
+ # Experiment Timeline
54
+
55
+ The primary objective of this project is to systematically explore different approaches to programming language classification, ranging from traditional machine learning methods to modern transformer architectures.
56
+
57
+ Rather than immediately training a large neural network, the project follows a progressive benchmarking strategy. Each model serves as a baseline for the next stage, allowing direct comparison of accuracy, model size, training cost, inference speed, and deployment complexity.
58
+
59
+ The experiments are designed to answer several questions:
60
+
61
+ - How far can classical machine learning be pushed on source code classification?
62
+ - How much improvement does FastText provide over linear models?
63
+ - How much additional performance can transformer architectures achieve?
64
+ - What is the optimal trade-off between accuracy and model size?
65
+ - Can large transformer models later be distilled into smaller deployable models?
66
+
67
+ ---
68
+
69
+ # Phase 1 — SGD Logistic Regression Baseline
70
+
71
+ ## Motivation
72
+
73
+ The first goal was to establish a strong classical machine learning baseline.
74
+
75
+ Programming languages contain many distinctive lexical and syntactic patterns:
76
+
77
+ ```text
78
+ #include
79
+ public class
80
+ def
81
+ fn
82
+ let
83
+ import
84
+ ```
85
+
86
+ Character n-gram models are known to perform surprisingly well for language identification tasks because they capture these patterns directly without requiring deep semantic understanding.
87
+
88
+ Because of this, a linear classifier using hashed character n-gram features was selected as the initial benchmark.
89
+
90
+ ---
91
+
92
+ ## Architecture
93
+
94
+ ### Feature Extraction
95
+
96
+ - HashingVectorizer
97
+ - Character-level features
98
+ - Character n-grams: `(2, 6)`
99
+ - 131,072 hashed dimensions
100
+ - No vocabulary storage
101
+ - Constant-memory feature extraction
102
+
103
+ ### Classifier
104
+
105
+ - SGDClassifier
106
+ - Logistic Regression objective (`log_loss`)
107
+ - Incremental training using `partial_fit`
108
+ - Streaming JSONL training pipeline
109
+
110
+ ---
111
+
112
+ ## Training Strategy
113
+
114
+ The entire dataset was streamed from disk in batches.
115
+
116
+ Benefits:
117
+
118
+ - Constant RAM usage
119
+ - Scalable to millions of samples
120
+ - No need to load the entire dataset into memory
121
+ - Fast experimentation
122
+
123
+ The classifier was trained for multiple epochs while evaluating both validation and test performance after every epoch.
124
+
125
+ ---
126
+
127
+ ## Results
128
+
129
+ ### Test Accuracy
130
+
131
+ **~91.1%**
132
+
133
+ ---
134
+
135
+ ## Observations
136
+
137
+ The model performed significantly better than expected for such a simple architecture.
138
+
139
+ ### Strengths
140
+
141
+ - Extremely fast training
142
+ - Fast inference
143
+ - Simple implementation
144
+ - Excellent scalability
145
+
146
+ ### Weaknesses
147
+
148
+ - Difficulty separating structurally similar languages
149
+ - Limited contextual understanding
150
+ - Large sparse parameter matrix
151
+ - Performance ceiling reached relatively quickly
152
+
153
+ ### Common Confusion Pairs
154
+
155
+ - C ↔ C++
156
+ - JavaScript ↔ TypeScript
157
+ - HTML ↔ Markdown
158
+
159
+ ---
160
+
161
+ # Phase 2 — FastText
162
+
163
+ ## Motivation
164
+
165
+ After establishing the linear baseline, the next objective was to evaluate FastText.
166
+
167
+ FastText occupies an interesting position between classical machine learning and neural networks.
168
+
169
+ It introduces:
170
+
171
+ - Learned embeddings
172
+ - Character-level subword information
173
+ - Efficient training
174
+ - Low inference latency
175
+
176
+ while remaining dramatically smaller and faster than transformer models.
177
+
178
+ ---
179
+
180
+ ## Data Preparation
181
+
182
+ FastText requires a custom supervised text format:
183
+
184
+ ```text
185
+ __label__Python print("hello")
186
+ ```
187
+
188
+ A dedicated conversion pipeline was created to transform JSONL datasets into FastText format.
189
+
190
+ ### Preventing Label Leakage
191
+
192
+ During preprocessing, special care was taken to prevent accidental label leakage.
193
+
194
+ Source code occasionally contained the token:
195
+
196
+ ```text
197
+ __label__
198
+ ```
199
+
200
+ which FastText interprets as a valid training label.
201
+
202
+ To prevent this issue:
203
+
204
+ ```text
205
+ __label__ → __lbl__
206
+ ```
207
+
208
+ was applied during dataset conversion.
209
+
210
+ This eliminated spurious classes and ensured correct training.
211
+
212
+ ---
213
+
214
+ ## Architecture
215
+
216
+ ### Configuration
217
+
218
+ ```text
219
+ dim = 50
220
+ wordNgrams = 3
221
+ minn = 2
222
+ maxn = 5
223
+ minCount = 100
224
+ bucket = 50000
225
+ loss = softmax
226
+ epoch = 25
227
+ learning_rate = 0.7
228
+ ```
229
+
230
+ ---
231
+
232
+ ## Hyperparameter Exploration
233
+
234
+ A significant amount of experimentation was performed around:
235
+
236
+ - Embedding dimension
237
+ - Character subword lengths
238
+ - Vocabulary size
239
+ - Bucket size
240
+ - Epoch count
241
+ - Learning rate
242
+ - Model size reduction
243
+
244
+ The goal was not merely to maximize accuracy, but also to produce a compact deployable model.
245
+
246
+ ---
247
+
248
+ ## Results
249
+
250
+ ### Test Accuracy
251
+
252
+ **~95.5%**
253
+
254
+ ### Improvement Over SGD
255
+
256
+ **+4.4 percentage points**
257
+
258
+ ---
259
+
260
+ ## Observations
261
+
262
+ FastText substantially outperformed the linear baseline.
263
+
264
+ ### Key Findings
265
+
266
+ - Character subwords are extremely powerful for source code.
267
+ - Many language-specific keywords are captured effectively.
268
+ - FastText dramatically reduced confusion between related languages.
269
+ - Training remained relatively fast despite the dataset scale.
270
+
271
+ FastText proved to be one of the strongest accuracy-to-compute trade-offs observed during the project.
272
+
273
+ ---
274
+
275
+ # Phase 3 — ModernBERT
276
+
277
+ ## Motivation
278
+
279
+ While FastText achieved strong results, it still relies primarily on local token and character patterns.
280
+
281
+ Modern transformer architectures can model:
282
+
283
+ - Long-range dependencies
284
+ - Structural relationships
285
+ - Contextual representations
286
+ - Semantic information
287
+
288
+ The next phase aims to determine the maximum achievable accuracy on the dataset.
289
+
290
+ ---
291
+
292
+ ## Architecture
293
+
294
+ ### Model
295
+
296
+ - ModernBERT-base
297
+
298
+ ### Task
299
+
300
+ - Sequence Classification
301
+
302
+ ### Training Features
303
+
304
+ - Mixed Precision Training
305
+ - Gradient Checkpointing
306
+ - Dynamic Padding
307
+ - Large Effective Batch Size
308
+ - Validation Tracking Throughout Training
309
+ - Automatic Best Checkpoint Selection
310
+
311
+ ---
312
+
313
+ ## Current Status
314
+
315
+ **Training In Progress**
316
+
317
+ The dataset contains approximately:
318
+
319
+ ```text
320
+ 1.6 million training samples
321
+ ```
322
+
323
+ Validation metrics are evaluated multiple times per epoch and checkpoints are saved throughout training to enable detailed learning curve analysis.
324
+
325
+ ---
326
+
327
+ ## Objectives
328
+
329
+ The ModernBERT experiments aim to answer:
330
+
331
+ 1. What is the maximum achievable accuracy on this dataset?
332
+ 2. Which language pairs remain difficult after FastText?
333
+ 3. How much improvement does contextual modeling provide?
334
+ 4. Is the improvement sufficient to justify the additional compute cost?
335
+
336
+ ---
337
+
338
+ # Planned Future Work
339
+
340
+ ## Knowledge Distillation
341
+
342
+ After training the ModernBERT teacher model:
343
+
344
+ ```text
345
+ ModernBERT Teacher
346
+
347
+ Student Model
348
+ ```
349
+
350
+ The goal is to transfer knowledge from the transformer into smaller models.
351
+
352
+ ### Potential Student Architectures
353
+
354
+ - Distilled ModernBERT variants
355
+ - Compact transformer models
356
+ - FastText students
357
+ - Lightweight deployment models
358
+
359
+ ---
360
+
361
+
362
+ # Current Benchmark Summary
363
+
364
+ | Model | Accuracy |
365
+ |---------|---------:|
366
+ | SGD Logistic Regression | ~91.1% |
367
+ | FastText | ~95.5% |
368
+ | ModernBERT-base | Training |
369
+
370
+ ---
371
+
372
+ # Key Takeaways So Far
373
+
374
+ - Character n-gram features provide a surprisingly strong baseline for programming language classification.
375
+ - FastText delivers a substantial performance improvement while maintaining practical training and inference costs.
376
+ - Careful preprocessing is critical, particularly when using FastText label prefixes.
377
+ - Source code classification benefits heavily from character-level information.
378
+ - Larger neural models should be evaluated not only on accuracy but also on deployment cost, memory footprint, and inference speed.
379
+
380
+ The project continues to evolve toward a high-accuracy, deployment-friendly code language classifier capable of operating efficiently at large scale.