Text Classification
fastText
English
scikit-learn
code-classification
programming-language-detection
source-code
machine-learning
modernbert
classification
nlp
code-analysis
software-engineering
Eval Results (legacy)
Instructions to use kaushik-harsh-99/Code-Lang-Classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- fastText
How to use kaushik-harsh-99/Code-Lang-Classifier with fastText:
from huggingface_hub import hf_hub_download import fasttext model = fasttext.load_model(hf_hub_download("kaushik-harsh-99/Code-Lang-Classifier", "model.bin")) - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| license: mit | |
| library_name: scikit-learn | |
| tags: | |
| - code-classification | |
| - programming-language-detection | |
| - source-code | |
| - machine-learning | |
| - fasttext | |
| - modernbert | |
| - classification | |
| - nlp | |
| - code-analysis | |
| - software-engineering | |
| pipeline_tag: text-classification | |
| metrics: | |
| - accuracy | |
| - precision | |
| - recall | |
| - f1 | |
| model-index: | |
| - name: SGD Logistic Regression | |
| results: | |
| - task: | |
| type: text-classification | |
| name: Programming Language Classification | |
| dataset: | |
| type: custom | |
| name: Code Language Classification Dataset | |
| metrics: | |
| - type: accuracy | |
| value: 91.1 | |
| name: SGD Test Accuracy | |
| - name: FastText | |
| results: | |
| - task: | |
| type: text-classification | |
| name: Programming Language Classification | |
| dataset: | |
| type: custom | |
| name: Code Language Classification Dataset | |
| metrics: | |
| - type: accuracy | |
| value: 95.5 | |
| name: FastText Test Accuracy | |
| datasets: | |
| - kaushik-harsh-99/Code-Language-Classification | |
| # Experiment Timeline | |
| The primary objective of this project is to systematically explore different approaches to programming language classification, ranging from traditional machine learning methods to modern transformer architectures. | |
| Rather than immediately training a large neural network, the project follows a progressive benchmarking strategy. Each model serves as a baseline for the next stage, allowing direct comparison of accuracy, model size, training cost, inference speed, and deployment complexity. | |
| The experiments are designed to answer several questions: | |
| - How far can classical machine learning be pushed on source code classification? | |
| - How much improvement does FastText provide over linear models? | |
| - How much additional performance can transformer architectures achieve? | |
| - What is the optimal trade-off between accuracy and model size? | |
| - Can large transformer models later be distilled into smaller deployable models? | |
| --- | |
| # Phase 1 — SGD Logistic Regression Baseline | |
| ## Motivation | |
| The first goal was to establish a strong classical machine learning baseline. | |
| Programming languages contain many distinctive lexical and syntactic patterns: | |
| ```text | |
| #include | |
| public class | |
| def | |
| fn | |
| let | |
| import | |
| ``` | |
| Character n-gram models are known to perform surprisingly well for language identification tasks because they capture these patterns directly without requiring deep semantic understanding. | |
| Because of this, a linear classifier using hashed character n-gram features was selected as the initial benchmark. | |
| --- | |
| ## Architecture | |
| ### Feature Extraction | |
| - HashingVectorizer | |
| - Character-level features | |
| - Character n-grams: `(2, 6)` | |
| - 131,072 hashed dimensions | |
| - No vocabulary storage | |
| - Constant-memory feature extraction | |
| ### Classifier | |
| - SGDClassifier | |
| - Logistic Regression objective (`log_loss`) | |
| - Incremental training using `partial_fit` | |
| - Streaming JSONL training pipeline | |
| --- | |
| ## Training Strategy | |
| The entire dataset was streamed from disk in batches. | |
| Benefits: | |
| - Constant RAM usage | |
| - Scalable to millions of samples | |
| - No need to load the entire dataset into memory | |
| - Fast experimentation | |
| The classifier was trained for multiple epochs while evaluating both validation and test performance after every epoch. | |
| --- | |
| ## Results | |
| ### Test Accuracy | |
| **~91.1%** | |
| --- | |
| ## Observations | |
| The model performed significantly better than expected for such a simple architecture. | |
| ### Strengths | |
| - Extremely fast training | |
| - Fast inference | |
| - Simple implementation | |
| - Excellent scalability | |
| ### Weaknesses | |
| - Difficulty separating structurally similar languages | |
| - Limited contextual understanding | |
| - Large sparse parameter matrix | |
| - Performance ceiling reached relatively quickly | |
| ### Common Confusion Pairs | |
| - C ↔ C++ | |
| - JavaScript ↔ TypeScript | |
| - HTML ↔ Markdown | |
| --- | |
| # Phase 2 — FastText | |
| ## Motivation | |
| After establishing the linear baseline, the next objective was to evaluate FastText. | |
| FastText occupies an interesting position between classical machine learning and neural networks. | |
| It introduces: | |
| - Learned embeddings | |
| - Character-level subword information | |
| - Efficient training | |
| - Low inference latency | |
| while remaining dramatically smaller and faster than transformer models. | |
| --- | |
| ## Data Preparation | |
| FastText requires a custom supervised text format: | |
| ```text | |
| __label__Python print("hello") | |
| ``` | |
| A dedicated conversion pipeline was created to transform JSONL datasets into FastText format. | |
| ### Preventing Label Leakage | |
| During preprocessing, special care was taken to prevent accidental label leakage. | |
| Source code occasionally contained the token: | |
| ```text | |
| __label__ | |
| ``` | |
| which FastText interprets as a valid training label. | |
| To prevent this issue: | |
| ```text | |
| __label__ → __lbl__ | |
| ``` | |
| was applied during dataset conversion. | |
| This eliminated spurious classes and ensured correct training. | |
| --- | |
| ## Architecture | |
| ### Configuration | |
| ```text | |
| dim = 50 | |
| wordNgrams = 3 | |
| minn = 2 | |
| maxn = 5 | |
| minCount = 100 | |
| bucket = 50000 | |
| loss = softmax | |
| epoch = 25 | |
| learning_rate = 0.7 | |
| ``` | |
| --- | |
| ## Hyperparameter Exploration | |
| A significant amount of experimentation was performed around: | |
| - Embedding dimension | |
| - Character subword lengths | |
| - Vocabulary size | |
| - Bucket size | |
| - Epoch count | |
| - Learning rate | |
| - Model size reduction | |
| The goal was not merely to maximize accuracy, but also to produce a compact deployable model. | |
| --- | |
| ## Results | |
| ### Test Accuracy | |
| **~95.5%** | |
| ### Improvement Over SGD | |
| **+4.4 percentage points** | |
| --- | |
| ## Observations | |
| FastText substantially outperformed the linear baseline. | |
| ### Key Findings | |
| - Character subwords are extremely powerful for source code. | |
| - Many language-specific keywords are captured effectively. | |
| - FastText dramatically reduced confusion between related languages. | |
| - Training remained relatively fast despite the dataset scale. | |
| FastText proved to be one of the strongest accuracy-to-compute trade-offs observed during the project. | |
| --- | |
| # Phase 3 — ModernBERT | |
| ## Motivation | |
| After achieving strong results with FastText, the next stage of the project explored whether transformer architectures could further improve programming language classification performance. | |
| Unlike FastText, transformer models can learn: | |
| - Long-range dependencies | |
| - Global context | |
| - Structural relationships | |
| - Context-aware representations | |
| The goal was to determine whether additional model capacity translates into meaningful real-world gains for source code language identification. | |
| --- | |
| ## Architecture | |
| ### Model | |
| - ModernBERT-base | |
| ### Task | |
| - Sequence Classification | |
| ## Results | |
| ### Approximate Test Accuracy | |
| **~97–98%** | |
| ### Improvement Over FastText | |
| **~2–3 percentage points** | |
| --- | |
| ## Observations | |
| ModernBERT achieved the highest overall accuracy among all models tested. | |
| However, experimentation revealed that the improvement over FastText was relatively small considering the large increase in computational requirements. | |
| Compared with FastText: | |
| - Training time increased dramatically | |
| - GPU memory usage increased significantly | |
| - Inference became substantially slower | |
| - Model size increased considerably | |
| - Deployment became more complex | |
| Although ModernBERT achieved higher accuracy, the gain remained limited relative to the increase in compute. | |
| --- | |
| ## Key Finding | |
| For programming language classification specifically: | |
| > Transformer-based neural networks do not appear to be the most efficient solution for this task. | |
| Programming languages contain strong lexical and structural signals that can already be captured extremely effectively using lightweight approaches. | |
| FastText achieved performance surprisingly close to ModernBERT while requiring only a fraction of: | |
| - Compute | |
| - Training time | |
| - Memory | |
| - Storage | |
| - Inference cost | |
| --- | |
| # Current Benchmark Summary | |
| | Model | Test Accuracy | Relative Compute | | |
| |--------|--------------:|-----------------:| | |
| | SGD Logistic Regression | ~91.1% | Very Low | | |
| | FastText | ~95.5% | Low | | |
| | ModernBERT-base | ~97–98% | Extremely High | | |
| --- | |
| # Current Conclusions | |
| ## 1. Classical machine learning remains surprisingly competitive | |
| Character-level linear models establish a strong baseline even at large scale. | |
| --- | |
| ## 2. FastText provides the strongest accuracy-to-compute ratio | |
| Current experiments indicate FastText delivers the best balance of: | |
| - Accuracy | |
| - Training speed | |
| - Inference speed | |
| - Memory efficiency | |
| - Deployment simplicity | |
| while remaining within only a few percentage points of transformer performance. | |
| --- |