kaushik-harsh-99 commited on
Commit
12f31d6
·
verified ·
1 Parent(s): cf1471f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -81
README.md CHANGED
@@ -21,35 +21,32 @@ metrics:
21
  - recall
22
  - f1
23
  model-index:
24
- - name: SGD Logistic Regression
25
- results:
26
- - task:
27
- type: text-classification
28
- name: Programming Language Classification
29
- dataset:
30
- type: custom
31
- name: Code Language Classification Dataset
32
- metrics:
33
- - type: accuracy
34
- value: 91.1
35
- name: SGD Test Accuracy
36
-
37
- - name: FastText
38
- results:
39
- - task:
40
- type: text-classification
41
- name: Programming Language Classification
42
- dataset:
43
- type: custom
44
- name: Code Language Classification Dataset
45
- metrics:
46
- - type: accuracy
47
- value: 95.5
48
- name: FastText Test Accuracy
49
  datasets:
50
  - kaushik-harsh-99/Code-Language-Classification
51
- base_model:
52
- - answerdotai/ModernBERT-base
53
  ---
54
  # Experiment Timeline
55
 
@@ -277,16 +274,16 @@ FastText proved to be one of the strongest accuracy-to-compute trade-offs observ
277
 
278
  ## Motivation
279
 
280
- While FastText achieved strong results, it still relies primarily on local token and character patterns.
281
 
282
- Modern transformer architectures can model:
283
 
284
  - Long-range dependencies
 
285
  - Structural relationships
286
- - Contextual representations
287
- - Semantic information
288
 
289
- The next phase aims to determine the maximum achievable accuracy on the dataset.
290
 
291
  ---
292
 
@@ -300,82 +297,82 @@ The next phase aims to determine the maximum achievable accuracy on the dataset.
300
 
301
  - Sequence Classification
302
 
303
- ### Training Features
304
 
305
- - Mixed Precision Training
306
- - Gradient Checkpointing
307
- - Dynamic Padding
308
- - Large Effective Batch Size
309
- - Validation Tracking Throughout Training
310
- - Automatic Best Checkpoint Selection
311
 
312
- ---
313
 
314
- ## Current Status
315
 
316
- **Training In Progress**
317
 
318
- The dataset contains approximately:
319
 
320
- ```text
321
- 1.6 million training samples
322
- ```
323
 
324
- Validation metrics are evaluated multiple times per epoch and checkpoints are saved throughout training to enable detailed learning curve analysis.
325
 
326
- ---
327
 
328
- ## Objectives
329
 
330
- The ModernBERT experiments aim to answer:
 
 
 
 
331
 
332
- 1. What is the maximum achievable accuracy on this dataset?
333
- 2. Which language pairs remain difficult after FastText?
334
- 3. How much improvement does contextual modeling provide?
335
- 4. Is the improvement sufficient to justify the additional compute cost?
336
 
337
  ---
338
 
339
- # Planned Future Work
340
 
341
- ## Knowledge Distillation
342
 
343
- After training the ModernBERT teacher model:
344
 
345
- ```text
346
- ModernBERT Teacher
347
-
348
- Student Model
349
- ```
350
 
351
- The goal is to transfer knowledge from the transformer into smaller models.
 
 
 
 
 
 
352
 
353
- ### Potential Student Architectures
354
 
355
- - Distilled ModernBERT variants
356
- - Compact transformer models
357
- - FastText students
358
- - Lightweight deployment models
 
359
 
360
  ---
361
 
 
362
 
363
- # Current Benchmark Summary
364
 
365
- | Model | Accuracy |
366
- |---------|---------:|
367
- | SGD Logistic Regression | ~91.1% |
368
- | FastText | ~95.5% |
369
- | ModernBERT-base | Training |
370
 
371
  ---
372
 
373
- # Key Takeaways So Far
 
 
 
 
 
 
 
 
374
 
375
- - Character n-gram features provide a surprisingly strong baseline for programming language classification.
376
- - FastText delivers a substantial performance improvement while maintaining practical training and inference costs.
377
- - Careful preprocessing is critical, particularly when using FastText label prefixes.
378
- - Source code classification benefits heavily from character-level information.
379
- - Larger neural models should be evaluated not only on accuracy but also on deployment cost, memory footprint, and inference speed.
380
 
381
- The project continues to evolve toward a high-accuracy, deployment-friendly code language classifier capable of operating efficiently at large scale.
 
21
  - recall
22
  - f1
23
  model-index:
24
+ - name: SGD Logistic Regression
25
+ results:
26
+ - task:
27
+ type: text-classification
28
+ name: Programming Language Classification
29
+ dataset:
30
+ type: custom
31
+ name: Code Language Classification Dataset
32
+ metrics:
33
+ - type: accuracy
34
+ value: 91.1
35
+ name: SGD Test Accuracy
36
+ - name: FastText
37
+ results:
38
+ - task:
39
+ type: text-classification
40
+ name: Programming Language Classification
41
+ dataset:
42
+ type: custom
43
+ name: Code Language Classification Dataset
44
+ metrics:
45
+ - type: accuracy
46
+ value: 95.5
47
+ name: FastText Test Accuracy
 
48
  datasets:
49
  - kaushik-harsh-99/Code-Language-Classification
 
 
50
  ---
51
  # Experiment Timeline
52
 
 
274
 
275
  ## Motivation
276
 
277
+ After achieving strong results with FastText, the next stage of the project explored whether transformer architectures could further improve programming language classification performance.
278
 
279
+ Unlike FastText, transformer models can learn:
280
 
281
  - Long-range dependencies
282
+ - Global context
283
  - Structural relationships
284
+ - Context-aware representations
 
285
 
286
+ The goal was to determine whether additional model capacity translates into meaningful real-world gains for source code language identification.
287
 
288
  ---
289
 
 
297
 
298
  - Sequence Classification
299
 
300
+ ## Results
301
 
302
+ ### Approximate Test Accuracy
 
 
 
 
 
303
 
304
+ **~97–98%**
305
 
306
+ ### Improvement Over FastText
307
 
308
+ **~2–3 percentage points**
309
 
310
+ ---
311
 
312
+ ## Observations
 
 
313
 
314
+ ModernBERT achieved the highest overall accuracy among all models tested.
315
 
316
+ However, experimentation revealed that the improvement over FastText was relatively small considering the large increase in computational requirements.
317
 
318
+ Compared with FastText:
319
 
320
+ - Training time increased dramatically
321
+ - GPU memory usage increased significantly
322
+ - Inference became substantially slower
323
+ - Model size increased considerably
324
+ - Deployment became more complex
325
 
326
+ Although ModernBERT achieved higher accuracy, the gain remained limited relative to the increase in compute.
 
 
 
327
 
328
  ---
329
 
330
+ ## Key Finding
331
 
332
+ For programming language classification specifically:
333
 
334
+ > Transformer-based neural networks do not appear to be the most efficient solution for this task.
335
 
336
+ Programming languages contain strong lexical and structural signals that can already be captured extremely effectively using lightweight approaches.
337
+
338
+ FastText achieved performance surprisingly close to ModernBERT while requiring only a fraction of:
 
 
339
 
340
+ - Compute
341
+ - Training time
342
+ - Memory
343
+ - Storage
344
+ - Inference cost
345
+
346
+ ---
347
 
348
+ # Current Benchmark Summary
349
 
350
+ | Model | Test Accuracy | Relative Compute |
351
+ |--------|--------------:|-----------------:|
352
+ | SGD Logistic Regression | ~91.1% | Very Low |
353
+ | FastText | ~95.5% | Low |
354
+ | ModernBERT-base | ~97–98% | Extremely High |
355
 
356
  ---
357
 
358
+ # Current Conclusions
359
 
360
+ ## 1. Classical machine learning remains surprisingly competitive
361
 
362
+ Character-level linear models establish a strong baseline even at large scale.
 
 
 
 
363
 
364
  ---
365
 
366
+ ## 2. FastText provides the strongest accuracy-to-compute ratio
367
+
368
+ Current experiments indicate FastText delivers the best balance of:
369
+
370
+ - Accuracy
371
+ - Training speed
372
+ - Inference speed
373
+ - Memory efficiency
374
+ - Deployment simplicity
375
 
376
+ while remaining within only a few percentage points of transformer performance.
 
 
 
 
377
 
378
+ ---