Create README.md

ffada80 verified 1 day ago

6.04 kB

	---

	---
	language:
	- code
	tags:
	- code
	- programming-language
	- classification
	- bert
	- text-classification
	license: apache-2.0
	datasets:
	- kaushik-harsh-99/Code-Language-Classification
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	model-index:
	- name: code-lang-bert-small
	results:
	- task:
	type: text-classification
	name: Programming Language Identification
	dataset:
	type: kaushik-harsh-99/Code-Language-Classification
	name: Code Language Classification
	split: test
	metrics:
	- type: accuracy
	value: 0.9663
	- type: f1 (macro)
	value: 0.9662
	- type: f1 (weighted)
	value: 0.9662
	- type: precision (macro)
	value: 0.9663
	- type: recall (macro)
	value: 0.9663
	---

	# Model Card for code-lang-bert-small

	A fine-tuned BERT-small model for identifying programming languages from code snippets. The model classifies raw source code into one of 16 supported languages with high accuracy.

	## Model Details

	### Model Description

	This model is a fine-tuned version of `prajjwal1/bert-small` (29M parameters) designed for the task of programming language identification. By analyzing the syntax, keywords, and structural patterns of source code, it accurately predicts the programming language of a given snippet.

	- Developed by: Pankaj8922
	- Model type: Encoder-only Transformer (BERT-small) for sequence classification
	- Language(s): 16 programming and markup languages (see below)
	- License: Apache 2.0
	- Finetuned from model: [prajjwal1/bert-small](https://huggingface.co/prajjwal1/bert-small)

	### Supported Languages

	Rust, Java, Dart, Python, Go, HTML, JavaScript, Typescript, C, CSS, C#, Markdown, Assembly, Lua, C++, Kotlin

	## Uses

	### Direct Use

	The model is intended for classifying code snippets. It can be used directly with the Hugging Face `pipeline` API or integrated into applications for code tagging, automated documentation, or content filtering.

	```python
	from transformers import pipeline

	classifier = pipeline(
	"text-classification",
	model="Pankaj8922/code-lang-bert-small"
	)

	code_snippet = """
	def quicksort(arr):
	if len(arr) <= 1:
	return arr
	pivot = arr[len(arr) // 2]
	return quicksort(left) + mid + quicksort(right)
	"""

	result = classifier(code_snippet)
	print(result)
	# [{'label': 'Python', 'score': 0.99}]
	```

	### Out-of-Scope Use

	The model is trained to classify full files or substantial code snippets. It may not perform well on:
	- Very short, ambiguous one-liners.
	- Heavily obfuscated or minified code.
	- Code containing multiple languages (e.g., a Python file with extensive embedded SQL).
	- Languages not present in the 16 supported classes.

	## Bias, Risks, and Limitations

	The model may exhibit biases present in the training data distribution. Languages with syntactically similar constructs (e.g., C and C++, JavaScript and TypeScript) are the most common sources of confusion, as reflected in the confusion matrix. Performance on code from very niche or domain-specific libraries may be lower.

	## Training Details

	### Training Data

	The model was trained on the [Code-Language-Classification](https://huggingface.co/datasets/kaushik-harsh-99/Code-Language-Classification) dataset. The official `train`, `validation`, and `test` splits were used.
	- Train samples: 1,600,000
	- Validation samples: 32,000
	- Test samples: 32,000
	- Classes: 16 (perfectly balanced, 2000 samples per class in test set)

	### Training Procedure

	The BERT-small model was fine-tuned on 2 x T4 GPUs with dynamic padding for efficiency. Training was configured for 5 epochs with early stopping, but was manually stopped after 4 epochs as the model had already converged.

	- Batch size: 256 (128 per device x 2 GPUs)
	- Learning rate: 3e-5
	- Optimizer: AdamW (weight decay: 0.01)
	- Max sequence length: 512 tokens
	- Early stopping patience: 2 epochs
	- Checkpointing: Best model based on validation accuracy saved to the Hub.

	## Evaluation

	The evaluation was performed on the held-out test set of 32,000 samples using the official script provided in the repository.

	### Testing Metrics

	\| Metric \| Value \|
	\|------------------\|----------\|
	\| Accuracy \| 96.63% \|
	\| Macro F1 \| 96.62% \|
	\| Weighted F1 \| 96.62% \|
	\| Macro Precision \| 96.63% \|
	\| Macro Recall \| 96.63% \|
	\| Eval Loss \| 0.1147 \|

	### Per-Class Performance

	\| Language \| Precision \| Recall \| F1-Score \|
	\|------------\|-----------\|--------\|----------\|
	\| Rust \| 0.9885 \| 0.9925 \| 0.9905 \|
	\| Java \| 0.9731 \| 0.9785 \| 0.9758 \|
	\| Dart \| 0.9772 \| 0.9850 \| 0.9811 \|
	\| Python \| 0.9890 \| 0.9880 \| 0.9885 \|
	\| Go \| 0.9859 \| 0.9800 \| 0.9829 \|
	\| HTML \| 0.9279 \| 0.8885 \| 0.9078 \|
	\| JavaScript \| 0.8859 \| 0.8930 \| 0.8894 \|
	\| TypeScript \| 0.9466 \| 0.9580 \| 0.9523 \|
	\| C \| 0.9566 \| 0.9375 \| 0.9470 \|
	\| CSS \| 0.9728 \| 0.9845 \| 0.9786 \|
	\| C# \| 0.9895 \| 0.9870 \| 0.9882 \|
	\| Markdown \| 0.9671 \| 0.9695 \| 0.9683 \|
	\| Assembly \| 0.9935 \| 0.9945 \| 0.9940 \|
	\| Lua \| 0.9885 \| 0.9915 \| 0.9900 \|
	\| C++ \| 0.9770 \| 0.9760 \| 0.9765 \|
	\| Kotlin \| 0.9840 \| 0.9870 \| 0.9855 \|

	### Key Observations
	- The model performs exceptionally well on most languages, with 11 of 16 classes achieving an F1-score of 97% or higher.
	- JavaScript (F1: 0.89) and HTML (F1: 0.91) are the most challenging classes, commonly confused with each other and with TypeScript/CSS.
	- The model is highly confident in distinguishing structurally unique languages like Assembly (F1: 0.994) and Python (F1: 0.989).

	## Environmental Impact

	- Hardware Type: 2 x NVIDIA T4 GPUs
	- Hours used: Approx. 4 epochs of training
	- Cloud Provider: Not specified
	- Compute Region: Not specified

	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).