kaushik-harsh-99
/

Code-Lang-Classifier

@@ -21,35 +21,32 @@ metrics:
 - recall
 - f1
 model-index:
-  - name: SGD Logistic Regression
-    results:
-      - task:
-          type: text-classification
-          name: Programming Language Classification
-        dataset:
-          type: custom
-          name: Code Language Classification Dataset
-        metrics:
-          - type: accuracy
-            value: 91.1
-            name: SGD Test Accuracy
-  - name: FastText
-    results:
-      - task:
-          type: text-classification
-          name: Programming Language Classification
-        dataset:
-          type: custom
-          name: Code Language Classification Dataset
-        metrics:
-          - type: accuracy
-            value: 95.5
-            name: FastText Test Accuracy
 datasets:
 - kaushik-harsh-99/Code-Language-Classification
-base_model:
-- answerdotai/ModernBERT-base
 ---
 # Experiment Timeline
@@ -277,16 +274,16 @@ FastText proved to be one of the strongest accuracy-to-compute trade-offs observ
 ## Motivation
-While FastText achieved strong results, it still relies primarily on local token and character patterns.
-Modern transformer architectures can model:
 - Long-range dependencies
 - Structural relationships
-- Contextual representations
-- Semantic information
-The next phase aims to determine the maximum achievable accuracy on the dataset.
 ---
@@ -300,82 +297,82 @@ The next phase aims to determine the maximum achievable accuracy on the dataset.
 - Sequence Classification
-### Training Features
-- Mixed Precision Training
-- Gradient Checkpointing
-- Dynamic Padding
-- Large Effective Batch Size
-- Validation Tracking Throughout Training
-- Automatic Best Checkpoint Selection
----
-## Current Status
-**Training In Progress**
-The dataset contains approximately:
-```text
-1.6 million training samples
-```
-Validation metrics are evaluated multiple times per epoch and checkpoints are saved throughout training to enable detailed learning curve analysis.
----
-## Objectives
-The ModernBERT experiments aim to answer:
-1. What is the maximum achievable accuracy on this dataset?
-2. Which language pairs remain difficult after FastText?
-3. How much improvement does contextual modeling provide?
-4. Is the improvement sufficient to justify the additional compute cost?
 ---
-# Planned Future Work
-## Knowledge Distillation
-After training the ModernBERT teacher model:
-```text
-ModernBERT Teacher
-        ↓
-Student Model
-```
-The goal is to transfer knowledge from the transformer into smaller models.
-### Potential Student Architectures
-- Distilled ModernBERT variants
-- Compact transformer models
-- FastText students
-- Lightweight deployment models
 ---
-# Current Benchmark Summary
-| Model | Accuracy |
-|---------|---------:|
-| SGD Logistic Regression | ~91.1% |
-| FastText | ~95.5% |
-| ModernBERT-base | Training |
 ---
-# Key Takeaways So Far
-- Character n-gram features provide a surprisingly strong baseline for programming language classification.
-- FastText delivers a substantial performance improvement while maintaining practical training and inference costs.
-- Careful preprocessing is critical, particularly when using FastText label prefixes.
-- Source code classification benefits heavily from character-level information.
-- Larger neural models should be evaluated not only on accuracy but also on deployment cost, memory footprint, and inference speed.
-The project continues to evolve toward a high-accuracy, deployment-friendly code language classifier capable of operating efficiently at large scale.

 - recall
 - f1
 model-index:
+- name: SGD Logistic Regression
+  results:
+  - task:
+      type: text-classification
+      name: Programming Language Classification
+    dataset:
+      type: custom
+      name: Code Language Classification Dataset
+    metrics:
+    - type: accuracy
+      value: 91.1
+      name: SGD Test Accuracy
+- name: FastText
+  results:
+  - task:
+      type: text-classification
+      name: Programming Language Classification
+    dataset:
+      type: custom
+      name: Code Language Classification Dataset
+    metrics:
+    - type: accuracy
+      value: 95.5
+      name: FastText Test Accuracy
 datasets:
 - kaushik-harsh-99/Code-Language-Classification
 ---
 # Experiment Timeline
 ## Motivation
+After achieving strong results with FastText, the next stage of the project explored whether transformer architectures could further improve programming language classification performance.
+Unlike FastText, transformer models can learn:
 - Long-range dependencies
+- Global context
 - Structural relationships
+- Context-aware representations
+The goal was to determine whether additional model capacity translates into meaningful real-world gains for source code language identification.
 ---
 - Sequence Classification
+## Results
+### Approximate Test Accuracy
+**~97–98%**
+### Improvement Over FastText
+**~2–3 percentage points**
+---
+## Observations
+ModernBERT achieved the highest overall accuracy among all models tested.
+However, experimentation revealed that the improvement over FastText was relatively small considering the large increase in computational requirements.
+Compared with FastText:
+- Training time increased dramatically
+- GPU memory usage increased significantly
+- Inference became substantially slower
+- Model size increased considerably
+- Deployment became more complex
+Although ModernBERT achieved higher accuracy, the gain remained limited relative to the increase in compute.
 ---
+## Key Finding
+For programming language classification specifically:
+> Transformer-based neural networks do not appear to be the most efficient solution for this task.
+Programming languages contain strong lexical and structural signals that can already be captured extremely effectively using lightweight approaches.
+FastText achieved performance surprisingly close to ModernBERT while requiring only a fraction of:
+- Compute
+- Training time
+- Memory
+- Storage
+- Inference cost
+---
+# Current Benchmark Summary
+| Model | Test Accuracy | Relative Compute |
+|--------|--------------:|-----------------:|
+| SGD Logistic Regression | ~91.1% | Very Low |
+| FastText | ~95.5% | Low |
+| ModernBERT-base | ~97–98% | Extremely High |
 ---
+# Current Conclusions
+## 1. Classical machine learning remains surprisingly competitive
+Character-level linear models establish a strong baseline even at large scale.
 ---
+## 2. FastText provides the strongest accuracy-to-compute ratio
+Current experiments indicate FastText delivers the best balance of:
+- Accuracy
+- Training speed
+- Inference speed
+- Memory efficiency
+- Deployment simplicity
+while remaining within only a few percentage points of transformer performance.
+---