Text Classification
fastText
English
scikit-learn
code-classification
programming-language-detection
source-code
machine-learning
modernbert
classification
nlp
code-analysis
software-engineering
Eval Results (legacy)
Instructions to use kaushik-harsh-99/Code-Lang-Classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- fastText
How to use kaushik-harsh-99/Code-Lang-Classifier with fastText:
from huggingface_hub import hf_hub_download import fasttext model = fasttext.load_model(hf_hub_download("kaushik-harsh-99/Code-Lang-Classifier", "model.bin")) - Notebooks
- Google Colab
- Kaggle
File size: 8,409 Bytes
bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 bc56b51 12f31d6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 | ---
language:
- en
license: mit
library_name: scikit-learn
tags:
- code-classification
- programming-language-detection
- source-code
- machine-learning
- fasttext
- modernbert
- classification
- nlp
- code-analysis
- software-engineering
pipeline_tag: text-classification
metrics:
- accuracy
- precision
- recall
- f1
model-index:
- name: SGD Logistic Regression
results:
- task:
type: text-classification
name: Programming Language Classification
dataset:
type: custom
name: Code Language Classification Dataset
metrics:
- type: accuracy
value: 91.1
name: SGD Test Accuracy
- name: FastText
results:
- task:
type: text-classification
name: Programming Language Classification
dataset:
type: custom
name: Code Language Classification Dataset
metrics:
- type: accuracy
value: 95.5
name: FastText Test Accuracy
datasets:
- kaushik-harsh-99/Code-Language-Classification
---
# Experiment Timeline
The primary objective of this project is to systematically explore different approaches to programming language classification, ranging from traditional machine learning methods to modern transformer architectures.
Rather than immediately training a large neural network, the project follows a progressive benchmarking strategy. Each model serves as a baseline for the next stage, allowing direct comparison of accuracy, model size, training cost, inference speed, and deployment complexity.
The experiments are designed to answer several questions:
- How far can classical machine learning be pushed on source code classification?
- How much improvement does FastText provide over linear models?
- How much additional performance can transformer architectures achieve?
- What is the optimal trade-off between accuracy and model size?
- Can large transformer models later be distilled into smaller deployable models?
---
# Phase 1 β SGD Logistic Regression Baseline
## Motivation
The first goal was to establish a strong classical machine learning baseline.
Programming languages contain many distinctive lexical and syntactic patterns:
```text
#include
public class
def
fn
let
import
```
Character n-gram models are known to perform surprisingly well for language identification tasks because they capture these patterns directly without requiring deep semantic understanding.
Because of this, a linear classifier using hashed character n-gram features was selected as the initial benchmark.
---
## Architecture
### Feature Extraction
- HashingVectorizer
- Character-level features
- Character n-grams: `(2, 6)`
- 131,072 hashed dimensions
- No vocabulary storage
- Constant-memory feature extraction
### Classifier
- SGDClassifier
- Logistic Regression objective (`log_loss`)
- Incremental training using `partial_fit`
- Streaming JSONL training pipeline
---
## Training Strategy
The entire dataset was streamed from disk in batches.
Benefits:
- Constant RAM usage
- Scalable to millions of samples
- No need to load the entire dataset into memory
- Fast experimentation
The classifier was trained for multiple epochs while evaluating both validation and test performance after every epoch.
---
## Results
### Test Accuracy
**~91.1%**
---
## Observations
The model performed significantly better than expected for such a simple architecture.
### Strengths
- Extremely fast training
- Fast inference
- Simple implementation
- Excellent scalability
### Weaknesses
- Difficulty separating structurally similar languages
- Limited contextual understanding
- Large sparse parameter matrix
- Performance ceiling reached relatively quickly
### Common Confusion Pairs
- C β C++
- JavaScript β TypeScript
- HTML β Markdown
---
# Phase 2 β FastText
## Motivation
After establishing the linear baseline, the next objective was to evaluate FastText.
FastText occupies an interesting position between classical machine learning and neural networks.
It introduces:
- Learned embeddings
- Character-level subword information
- Efficient training
- Low inference latency
while remaining dramatically smaller and faster than transformer models.
---
## Data Preparation
FastText requires a custom supervised text format:
```text
__label__Python print("hello")
```
A dedicated conversion pipeline was created to transform JSONL datasets into FastText format.
### Preventing Label Leakage
During preprocessing, special care was taken to prevent accidental label leakage.
Source code occasionally contained the token:
```text
__label__
```
which FastText interprets as a valid training label.
To prevent this issue:
```text
__label__ β __lbl__
```
was applied during dataset conversion.
This eliminated spurious classes and ensured correct training.
---
## Architecture
### Configuration
```text
dim = 50
wordNgrams = 3
minn = 2
maxn = 5
minCount = 100
bucket = 50000
loss = softmax
epoch = 25
learning_rate = 0.7
```
---
## Hyperparameter Exploration
A significant amount of experimentation was performed around:
- Embedding dimension
- Character subword lengths
- Vocabulary size
- Bucket size
- Epoch count
- Learning rate
- Model size reduction
The goal was not merely to maximize accuracy, but also to produce a compact deployable model.
---
## Results
### Test Accuracy
**~95.5%**
### Improvement Over SGD
**+4.4 percentage points**
---
## Observations
FastText substantially outperformed the linear baseline.
### Key Findings
- Character subwords are extremely powerful for source code.
- Many language-specific keywords are captured effectively.
- FastText dramatically reduced confusion between related languages.
- Training remained relatively fast despite the dataset scale.
FastText proved to be one of the strongest accuracy-to-compute trade-offs observed during the project.
---
# Phase 3 β ModernBERT
## Motivation
After achieving strong results with FastText, the next stage of the project explored whether transformer architectures could further improve programming language classification performance.
Unlike FastText, transformer models can learn:
- Long-range dependencies
- Global context
- Structural relationships
- Context-aware representations
The goal was to determine whether additional model capacity translates into meaningful real-world gains for source code language identification.
---
## Architecture
### Model
- ModernBERT-base
### Task
- Sequence Classification
## Results
### Approximate Test Accuracy
**~97β98%**
### Improvement Over FastText
**~2β3 percentage points**
---
## Observations
ModernBERT achieved the highest overall accuracy among all models tested.
However, experimentation revealed that the improvement over FastText was relatively small considering the large increase in computational requirements.
Compared with FastText:
- Training time increased dramatically
- GPU memory usage increased significantly
- Inference became substantially slower
- Model size increased considerably
- Deployment became more complex
Although ModernBERT achieved higher accuracy, the gain remained limited relative to the increase in compute.
---
## Key Finding
For programming language classification specifically:
> Transformer-based neural networks do not appear to be the most efficient solution for this task.
Programming languages contain strong lexical and structural signals that can already be captured extremely effectively using lightweight approaches.
FastText achieved performance surprisingly close to ModernBERT while requiring only a fraction of:
- Compute
- Training time
- Memory
- Storage
- Inference cost
---
# Current Benchmark Summary
| Model | Test Accuracy | Relative Compute |
|--------|--------------:|-----------------:|
| SGD Logistic Regression | ~91.1% | Very Low |
| FastText | ~95.5% | Low |
| ModernBERT-base | ~97β98% | Extremely High |
---
# Current Conclusions
## 1. Classical machine learning remains surprisingly competitive
Character-level linear models establish a strong baseline even at large scale.
---
## 2. FastText provides the strongest accuracy-to-compute ratio
Current experiments indicate FastText delivers the best balance of:
- Accuracy
- Training speed
- Inference speed
- Memory efficiency
- Deployment simplicity
while remaining within only a few percentage points of transformer performance.
--- |