kaushik-harsh-99
/

Code-Lang-Classifier

+---
+language:
+- en
+license: mit
+library_name: scikit-learn
+tags:
+- code-classification
+- programming-language-detection
+- source-code
+- machine-learning
+- fasttext
+- modernbert
+- classification
+- nlp
+- code-analysis
+- software-engineering
+pipeline_tag: text-classification
+metrics:
+- accuracy
+- precision
+- recall
+- f1
+model-index:
+- name: SGD Logistic Regression
+  results:
+  - task:
+      type: text-classification
+      name: Programming Language Classification
+    dataset:
+      type: custom
+      name: Code Language Classification Dataset
+    metrics:
+    - type: accuracy
+      value: 91.1
+      name: Test Accuracy
+- name: FastText
+  results:
+  - task:
+      type: text-classification
+      name: Programming Language Classification
+    dataset:
+      type: custom
+      name: Code Language Classification Dataset
+    metrics:
+    - type: accuracy
+      value: 95.5
+      name: Test Accuracy
+datasets:
+- kaushik-harsh-99/Code-Language-Classification
+base_model:
+- answerdotai/ModernBERT-base
+---
+# Experiment Timeline
+The primary objective of this project is to systematically explore different approaches to programming language classification, ranging from traditional machine learning methods to modern transformer architectures.
+Rather than immediately training a large neural network, the project follows a progressive benchmarking strategy. Each model serves as a baseline for the next stage, allowing direct comparison of accuracy, model size, training cost, inference speed, and deployment complexity.
+The experiments are designed to answer several questions:
+- How far can classical machine learning be pushed on source code classification?
+- How much improvement does FastText provide over linear models?
+- How much additional performance can transformer architectures achieve?
+- What is the optimal trade-off between accuracy and model size?
+- Can large transformer models later be distilled into smaller deployable models?
+---
+# Phase 1 — SGD Logistic Regression Baseline
+## Motivation
+The first goal was to establish a strong classical machine learning baseline.
+Programming languages contain many distinctive lexical and syntactic patterns:
+```text
+#include
+public class
+def
+fn
+let
+import
+```
+Character n-gram models are known to perform surprisingly well for language identification tasks because they capture these patterns directly without requiring deep semantic understanding.
+Because of this, a linear classifier using hashed character n-gram features was selected as the initial benchmark.
+---
+## Architecture
+### Feature Extraction
+- HashingVectorizer
+- Character-level features
+- Character n-grams: `(2, 6)`
+- 131,072 hashed dimensions
+- No vocabulary storage
+- Constant-memory feature extraction
+### Classifier
+- SGDClassifier
+- Logistic Regression objective (`log_loss`)
+- Incremental training using `partial_fit`
+- Streaming JSONL training pipeline
+---
+## Training Strategy
+The entire dataset was streamed from disk in batches.
+Benefits:
+- Constant RAM usage
+- Scalable to millions of samples
+- No need to load the entire dataset into memory
+- Fast experimentation
+The classifier was trained for multiple epochs while evaluating both validation and test performance after every epoch.
+---
+## Results
+### Test Accuracy
+**~91.1%**
+---
+## Observations
+The model performed significantly better than expected for such a simple architecture.
+### Strengths
+- Extremely fast training
+- Fast inference
+- Simple implementation
+- Excellent scalability
+### Weaknesses
+- Difficulty separating structurally similar languages
+- Limited contextual understanding
+- Large sparse parameter matrix
+- Performance ceiling reached relatively quickly
+### Common Confusion Pairs
+- C ↔ C++
+- JavaScript ↔ TypeScript
+- HTML ↔ Markdown
+---
+# Phase 2 — FastText
+## Motivation
+After establishing the linear baseline, the next objective was to evaluate FastText.
+FastText occupies an interesting position between classical machine learning and neural networks.
+It introduces:
+- Learned embeddings
+- Character-level subword information
+- Efficient training
+- Low inference latency
+while remaining dramatically smaller and faster than transformer models.
+---
+## Data Preparation
+FastText requires a custom supervised text format:
+```text
+__label__Python print("hello")
+```
+A dedicated conversion pipeline was created to transform JSONL datasets into FastText format.
+### Preventing Label Leakage
+During preprocessing, special care was taken to prevent accidental label leakage.
+Source code occasionally contained the token:
+```text
+__label__
+```
+which FastText interprets as a valid training label.
+To prevent this issue:
+```text
+__label__ → __lbl__
+```
+was applied during dataset conversion.
+This eliminated spurious classes and ensured correct training.
+---
+## Architecture
+### Configuration
+```text
+dim = 50
+wordNgrams = 3
+minn = 2
+maxn = 5
+minCount = 100
+bucket = 50000
+loss = softmax
+epoch = 25
+learning_rate = 0.7
+```
+---
+## Hyperparameter Exploration
+A significant amount of experimentation was performed around:
+- Embedding dimension
+- Character subword lengths
+- Vocabulary size
+- Bucket size
+- Epoch count
+- Learning rate
+- Model size reduction
+The goal was not merely to maximize accuracy, but also to produce a compact deployable model.
+---
+## Results
+### Test Accuracy
+**~95.5%**
+### Improvement Over SGD
+**+4.4 percentage points**
+---
+## Observations
+FastText substantially outperformed the linear baseline.
+### Key Findings
+- Character subwords are extremely powerful for source code.
+- Many language-specific keywords are captured effectively.
+- FastText dramatically reduced confusion between related languages.
+- Training remained relatively fast despite the dataset scale.
+FastText proved to be one of the strongest accuracy-to-compute trade-offs observed during the project.
+---
+# Phase 3 — ModernBERT
+## Motivation
+While FastText achieved strong results, it still relies primarily on local token and character patterns.
+Modern transformer architectures can model:
+- Long-range dependencies
+- Structural relationships
+- Contextual representations
+- Semantic information
+The next phase aims to determine the maximum achievable accuracy on the dataset.
+---
+## Architecture
+### Model
+- ModernBERT-base
+### Task
+- Sequence Classification
+### Training Features
+- Mixed Precision Training
+- Gradient Checkpointing
+- Dynamic Padding
+- Large Effective Batch Size
+- Validation Tracking Throughout Training
+- Automatic Best Checkpoint Selection
+---
+## Current Status
+**Training In Progress**
+The dataset contains approximately:
+```text
+1.6 million training samples
+```
+Validation metrics are evaluated multiple times per epoch and checkpoints are saved throughout training to enable detailed learning curve analysis.
+---
+## Objectives
+The ModernBERT experiments aim to answer:
+1. What is the maximum achievable accuracy on this dataset?
+2. Which language pairs remain difficult after FastText?
+3. How much improvement does contextual modeling provide?
+4. Is the improvement sufficient to justify the additional compute cost?
+---
+# Planned Future Work
+## Knowledge Distillation
+After training the ModernBERT teacher model:
+```text
+ModernBERT Teacher
+        ↓
+Student Model
+```
+The goal is to transfer knowledge from the transformer into smaller models.
+### Potential Student Architectures
+- Distilled ModernBERT variants
+- Compact transformer models
+- FastText students
+- Lightweight deployment models
+---
+# Current Benchmark Summary
+| Model | Accuracy |
+|---------|---------:|
+| SGD Logistic Regression | ~91.1% |
+| FastText | ~95.5% |
+| ModernBERT-base | Training |
+---
+# Key Takeaways So Far
+- Character n-gram features provide a surprisingly strong baseline for programming language classification.
+- FastText delivers a substantial performance improvement while maintaining practical training and inference costs.
+- Careful preprocessing is critical, particularly when using FastText label prefixes.
+- Source code classification benefits heavily from character-level information.
+- Larger neural models should be evaluated not only on accuracy but also on deployment cost, memory footprint, and inference speed.
+The project continues to evolve toward a high-accuracy, deployment-friendly code language classifier capable of operating efficiently at large scale.