File size: 8,409 Bytes
bc56b51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12f31d6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bc56b51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12f31d6
bc56b51
12f31d6
bc56b51
 
12f31d6
bc56b51
12f31d6
bc56b51
12f31d6
bc56b51
 
 
 
 
 
 
 
 
 
 
 
 
12f31d6
bc56b51
12f31d6
bc56b51
12f31d6
bc56b51
12f31d6
bc56b51
12f31d6
bc56b51
12f31d6
bc56b51
12f31d6
bc56b51
12f31d6
bc56b51
12f31d6
bc56b51
12f31d6
bc56b51
12f31d6
 
 
 
 
bc56b51
12f31d6
bc56b51
 
 
12f31d6
bc56b51
12f31d6
bc56b51
12f31d6
bc56b51
12f31d6
 
 
bc56b51
12f31d6
 
 
 
 
 
 
bc56b51
12f31d6
bc56b51
12f31d6
 
 
 
 
bc56b51
 
 
12f31d6
bc56b51
12f31d6
bc56b51
12f31d6
bc56b51
 
 
12f31d6
 
 
 
 
 
 
 
 
bc56b51
12f31d6
bc56b51
12f31d6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
---
language:
- en
license: mit
library_name: scikit-learn
tags:
- code-classification
- programming-language-detection
- source-code
- machine-learning
- fasttext
- modernbert
- classification
- nlp
- code-analysis
- software-engineering
pipeline_tag: text-classification
metrics:
- accuracy
- precision
- recall
- f1
model-index:
- name: SGD Logistic Regression
  results:
  - task:
      type: text-classification
      name: Programming Language Classification
    dataset:
      type: custom
      name: Code Language Classification Dataset
    metrics:
    - type: accuracy
      value: 91.1
      name: SGD Test Accuracy
- name: FastText
  results:
  - task:
      type: text-classification
      name: Programming Language Classification
    dataset:
      type: custom
      name: Code Language Classification Dataset
    metrics:
    - type: accuracy
      value: 95.5
      name: FastText Test Accuracy
datasets:
- kaushik-harsh-99/Code-Language-Classification
---
# Experiment Timeline

The primary objective of this project is to systematically explore different approaches to programming language classification, ranging from traditional machine learning methods to modern transformer architectures.

Rather than immediately training a large neural network, the project follows a progressive benchmarking strategy. Each model serves as a baseline for the next stage, allowing direct comparison of accuracy, model size, training cost, inference speed, and deployment complexity.

The experiments are designed to answer several questions:

- How far can classical machine learning be pushed on source code classification?
- How much improvement does FastText provide over linear models?
- How much additional performance can transformer architectures achieve?
- What is the optimal trade-off between accuracy and model size?
- Can large transformer models later be distilled into smaller deployable models?

---

# Phase 1 β€” SGD Logistic Regression Baseline

## Motivation

The first goal was to establish a strong classical machine learning baseline.

Programming languages contain many distinctive lexical and syntactic patterns:

```text
#include
public class
def
fn
let
import
```

Character n-gram models are known to perform surprisingly well for language identification tasks because they capture these patterns directly without requiring deep semantic understanding.

Because of this, a linear classifier using hashed character n-gram features was selected as the initial benchmark.

---

## Architecture

### Feature Extraction

- HashingVectorizer
- Character-level features
- Character n-grams: `(2, 6)`
- 131,072 hashed dimensions
- No vocabulary storage
- Constant-memory feature extraction

### Classifier

- SGDClassifier
- Logistic Regression objective (`log_loss`)
- Incremental training using `partial_fit`
- Streaming JSONL training pipeline

---

## Training Strategy

The entire dataset was streamed from disk in batches.

Benefits:

- Constant RAM usage
- Scalable to millions of samples
- No need to load the entire dataset into memory
- Fast experimentation

The classifier was trained for multiple epochs while evaluating both validation and test performance after every epoch.

---

## Results

### Test Accuracy

**~91.1%**

---

## Observations

The model performed significantly better than expected for such a simple architecture.

### Strengths

- Extremely fast training
- Fast inference
- Simple implementation
- Excellent scalability

### Weaknesses

- Difficulty separating structurally similar languages
- Limited contextual understanding
- Large sparse parameter matrix
- Performance ceiling reached relatively quickly

### Common Confusion Pairs

- C ↔ C++
- JavaScript ↔ TypeScript
- HTML ↔ Markdown

---

# Phase 2 β€” FastText

## Motivation

After establishing the linear baseline, the next objective was to evaluate FastText.

FastText occupies an interesting position between classical machine learning and neural networks.

It introduces:

- Learned embeddings
- Character-level subword information
- Efficient training
- Low inference latency

while remaining dramatically smaller and faster than transformer models.

---

## Data Preparation

FastText requires a custom supervised text format:

```text
__label__Python print("hello")
```

A dedicated conversion pipeline was created to transform JSONL datasets into FastText format.

### Preventing Label Leakage

During preprocessing, special care was taken to prevent accidental label leakage.

Source code occasionally contained the token:

```text
__label__
```

which FastText interprets as a valid training label.

To prevent this issue:

```text
__label__ β†’ __lbl__
```

was applied during dataset conversion.

This eliminated spurious classes and ensured correct training.

---

## Architecture

### Configuration

```text
dim = 50
wordNgrams = 3
minn = 2
maxn = 5
minCount = 100
bucket = 50000
loss = softmax
epoch = 25
learning_rate = 0.7
```

---

## Hyperparameter Exploration

A significant amount of experimentation was performed around:

- Embedding dimension
- Character subword lengths
- Vocabulary size
- Bucket size
- Epoch count
- Learning rate
- Model size reduction

The goal was not merely to maximize accuracy, but also to produce a compact deployable model.

---

## Results

### Test Accuracy

**~95.5%**

### Improvement Over SGD

**+4.4 percentage points**

---

## Observations

FastText substantially outperformed the linear baseline.

### Key Findings

- Character subwords are extremely powerful for source code.
- Many language-specific keywords are captured effectively.
- FastText dramatically reduced confusion between related languages.
- Training remained relatively fast despite the dataset scale.

FastText proved to be one of the strongest accuracy-to-compute trade-offs observed during the project.

---

# Phase 3 β€” ModernBERT

## Motivation

After achieving strong results with FastText, the next stage of the project explored whether transformer architectures could further improve programming language classification performance.

Unlike FastText, transformer models can learn:

- Long-range dependencies
- Global context
- Structural relationships
- Context-aware representations

The goal was to determine whether additional model capacity translates into meaningful real-world gains for source code language identification.

---

## Architecture

### Model

- ModernBERT-base

### Task

- Sequence Classification

## Results

### Approximate Test Accuracy

**~97–98%**

### Improvement Over FastText

**~2–3 percentage points**

---

## Observations

ModernBERT achieved the highest overall accuracy among all models tested.

However, experimentation revealed that the improvement over FastText was relatively small considering the large increase in computational requirements.

Compared with FastText:

- Training time increased dramatically
- GPU memory usage increased significantly
- Inference became substantially slower
- Model size increased considerably
- Deployment became more complex

Although ModernBERT achieved higher accuracy, the gain remained limited relative to the increase in compute.

---

## Key Finding

For programming language classification specifically:

> Transformer-based neural networks do not appear to be the most efficient solution for this task.

Programming languages contain strong lexical and structural signals that can already be captured extremely effectively using lightweight approaches.

FastText achieved performance surprisingly close to ModernBERT while requiring only a fraction of:

- Compute
- Training time
- Memory
- Storage
- Inference cost

---

# Current Benchmark Summary

| Model | Test Accuracy | Relative Compute |
|--------|--------------:|-----------------:|
| SGD Logistic Regression | ~91.1% | Very Low |
| FastText | ~95.5% | Low |
| ModernBERT-base | ~97–98% | Extremely High |

---

# Current Conclusions

## 1. Classical machine learning remains surprisingly competitive

Character-level linear models establish a strong baseline even at large scale.

---

## 2. FastText provides the strongest accuracy-to-compute ratio

Current experiments indicate FastText delivers the best balance of:

- Accuracy
- Training speed
- Inference speed
- Memory efficiency
- Deployment simplicity

while remaining within only a few percentage points of transformer performance.

---