Pankaj8922's picture
Create README.md
ffada80 verified
---
---
language:
- code
tags:
- code
- programming-language
- classification
- bert
- text-classification
license: apache-2.0
datasets:
- kaushik-harsh-99/Code-Language-Classification
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: code-lang-bert-small
results:
- task:
type: text-classification
name: Programming Language Identification
dataset:
type: kaushik-harsh-99/Code-Language-Classification
name: Code Language Classification
split: test
metrics:
- type: accuracy
value: 0.9663
- type: f1 (macro)
value: 0.9662
- type: f1 (weighted)
value: 0.9662
- type: precision (macro)
value: 0.9663
- type: recall (macro)
value: 0.9663
---
# Model Card for code-lang-bert-small
A fine-tuned BERT-small model for identifying programming languages from code snippets. The model classifies raw source code into one of 16 supported languages with high accuracy.
## Model Details
### Model Description
This model is a fine-tuned version of `prajjwal1/bert-small` (29M parameters) designed for the task of programming language identification. By analyzing the syntax, keywords, and structural patterns of source code, it accurately predicts the programming language of a given snippet.
- **Developed by:** Pankaj8922
- **Model type:** Encoder-only Transformer (BERT-small) for sequence classification
- **Language(s):** 16 programming and markup languages (see below)
- **License:** Apache 2.0
- **Finetuned from model:** [prajjwal1/bert-small](https://huggingface.co/prajjwal1/bert-small)
### Supported Languages
Rust, Java, Dart, Python, Go, HTML, JavaScript, Typescript, C, CSS, C#, Markdown, Assembly, Lua, C++, Kotlin
## Uses
### Direct Use
The model is intended for classifying code snippets. It can be used directly with the Hugging Face `pipeline` API or integrated into applications for code tagging, automated documentation, or content filtering.
```python
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="Pankaj8922/code-lang-bert-small"
)
code_snippet = """
def quicksort(arr):
if len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
return quicksort(left) + mid + quicksort(right)
"""
result = classifier(code_snippet)
print(result)
# [{'label': 'Python', 'score': 0.99}]
```
### Out-of-Scope Use
The model is trained to classify full files or substantial code snippets. It may not perform well on:
- Very short, ambiguous one-liners.
- Heavily obfuscated or minified code.
- Code containing multiple languages (e.g., a Python file with extensive embedded SQL).
- Languages not present in the 16 supported classes.
## Bias, Risks, and Limitations
The model may exhibit biases present in the training data distribution. Languages with syntactically similar constructs (e.g., C and C++, JavaScript and TypeScript) are the most common sources of confusion, as reflected in the confusion matrix. Performance on code from very niche or domain-specific libraries may be lower.
## Training Details
### Training Data
The model was trained on the [Code-Language-Classification](https://huggingface.co/datasets/kaushik-harsh-99/Code-Language-Classification) dataset. The official `train`, `validation`, and `test` splits were used.
- **Train samples:** 1,600,000
- **Validation samples:** 32,000
- **Test samples:** 32,000
- **Classes:** 16 (perfectly balanced, 2000 samples per class in test set)
### Training Procedure
The BERT-small model was fine-tuned on 2 x T4 GPUs with dynamic padding for efficiency. Training was configured for 5 epochs with early stopping, but was manually stopped after 4 epochs as the model had already converged.
- **Batch size:** 256 (128 per device x 2 GPUs)
- **Learning rate:** 3e-5
- **Optimizer:** AdamW (weight decay: 0.01)
- **Max sequence length:** 512 tokens
- **Early stopping patience:** 2 epochs
- **Checkpointing:** Best model based on validation accuracy saved to the Hub.
## Evaluation
The evaluation was performed on the held-out test set of 32,000 samples using the official script provided in the repository.
### Testing Metrics
| Metric | Value |
|------------------|----------|
| Accuracy | 96.63% |
| Macro F1 | 96.62% |
| Weighted F1 | 96.62% |
| Macro Precision | 96.63% |
| Macro Recall | 96.63% |
| Eval Loss | 0.1147 |
### Per-Class Performance
| Language | Precision | Recall | F1-Score |
|------------|-----------|--------|----------|
| Rust | 0.9885 | 0.9925 | 0.9905 |
| Java | 0.9731 | 0.9785 | 0.9758 |
| Dart | 0.9772 | 0.9850 | 0.9811 |
| Python | 0.9890 | 0.9880 | 0.9885 |
| Go | 0.9859 | 0.9800 | 0.9829 |
| HTML | 0.9279 | 0.8885 | 0.9078 |
| JavaScript | 0.8859 | 0.8930 | 0.8894 |
| TypeScript | 0.9466 | 0.9580 | 0.9523 |
| C | 0.9566 | 0.9375 | 0.9470 |
| CSS | 0.9728 | 0.9845 | 0.9786 |
| C# | 0.9895 | 0.9870 | 0.9882 |
| Markdown | 0.9671 | 0.9695 | 0.9683 |
| Assembly | 0.9935 | 0.9945 | 0.9940 |
| Lua | 0.9885 | 0.9915 | 0.9900 |
| C++ | 0.9770 | 0.9760 | 0.9765 |
| Kotlin | 0.9840 | 0.9870 | 0.9855 |
### Key Observations
- The model performs exceptionally well on most languages, with 11 of 16 classes achieving an F1-score of 97% or higher.
- **JavaScript** (F1: 0.89) and **HTML** (F1: 0.91) are the most challenging classes, commonly confused with each other and with TypeScript/CSS.
- The model is highly confident in distinguishing structurally unique languages like **Assembly** (F1: 0.994) and **Python** (F1: 0.989).
## Environmental Impact
- **Hardware Type:** 2 x NVIDIA T4 GPUs
- **Hours used:** Approx. 4 epochs of training
- **Cloud Provider:** Not specified
- **Compute Region:** Not specified
*Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).*