Spaces:

seai2526-uniba-TheClouds
/

Code-Comment-Classification-Api

Running

App Files Files Community

Code-Comment-Classification-Api / models /model_cards /python /transformer /README.md

Sky-Blue-da-ba-dee

fixed a typo in the project name

9636971 6 days ago

preview code

raw

history blame contribute delete

10.7 kB

	---
	language:
	- en
	tags:
	- text-classification
	- code-comment-classification
	- transformers
	- codebert
	- python
	- software-engineering
	- multi-label
	license: mit
	datasets:
	- NLBSE/nlbse26-code-comment-classification
	metrics:
	- f1
	- precision
	- recall
	- subset_accuracy
	- runtime
	- gflops
	pipeline_tag: text-classification
	library_name: transformers
	inference: false
	base_model: microsoft/codebert-base
	model-index:
	- name: CodeBERT Transformer for Python Code Comment Classification
	results:
	- task:
	type: text-classification
	name: Multi-label Text Classification
	dataset:
	name: NLBSE Code Comment Classification Dataset (Python)
	type: NLBSE/nlbse26-code-comment-classification
	split: test
	metrics:
	- type: f1
	name: Macro F1
	value: 0.6385
	- type: f1
	name: Micro F1
	value: 0.6781
	- type: precision
	name: Macro Precision
	value: 0.5900
	- type: recall
	name: Macro Recall
	value: 0.7061
	- type: accuracy
	name: Subset Accuracy
	value: 0.5690
	---

	# Transformer Model (CodeBERT) for Python Code Comment Classification

	## Model Details

	- Model Type: Transformer-based multi-label classifier (sequence classification head)
	- Base Model: [`microsoft/codebert-base`](https://huggingface.co/microsoft/codebert-base)
	- Language: Python (code comments in English)
	- License: MIT
	- Developed by: TheClouds
	- Model Date: November 2025
	- Model Version: 1.0

	### Description

	This model fine-tunes CodeBERT on the Python subset of the NLBSE Code Comment Classification Dataset for multi-label classification. Each Python code comment sentence is mapped to one or more semantic categories describing the role and intent of the comment.

	The classifier operates on the project’s `combo` field (concatenation of the comment sentence with a compact context string) and produces a 5-dimensional binary label vector.

	### Label Set

	For Python, the model predicts the following 5 categories (fixed order in the classifier head):

	1. `Usage`
	2. `Parameters`
	3. `DevelopmentNotes`
	4. `Expand`
	5. `Summary`

	Each prediction is a length-5 vector of 0/1 decisions, obtained by applying a sigmoid activation to the logits and thresholding at 0.5 by default.

	---

	## Intended Use

	The model is intended for:

	- research on code comment classification in Python projects,
	- mining and analysis of Python documentation comments,
	- tooling that needs semantic tags for comments (e.g., documentation quality inspection, comment recommendation, navigation support).

	It is designed for Python code comments in English or English-like technical language.

	### Out-of-Scope Uses

	- Generic natural language classification outside software engineering.
	- Non-English comments without additional fine-tuning or adaptation.
	- Use in safety- or life-critical decision making.

	---

	## Data

	### Training Data

	- Dataset: NLBSE Code Comment Classification Dataset – Python train split
	- Size (train): ~1.4k original training examples (with optional supersampled expansion to ~2k examples, depending on the configuration)
	- Label Space: 5 multi-label categories (`Usage`, `Parameters`, `DevelopmentNotes`, `Expand`, `Summary`)
	- Preprocessing:
	- Comments extracted from open-source Python projects.
	- Each instance represented via the `combo` field: `"<comment_sentence> \| <class_context>"`.
	- The project’s preprocessing pipeline can generate balanced training CSVs (via supersampling) under `data/processed/transformer`. The metrics reported here correspond to the current transformer configuration logged in MLflow for Python.

	### Evaluation Data

	- Dataset: NLBSE Code Comment Classification Dataset – Python test split
	- Size (test): ~300 comment sentences
	- Evaluation Protocol: multi-label classification with micro and macro metrics, plus subset accuracy (exact match).

	---

	## Metrics

	### Core Evaluation Metrics (Python, test split)

	From the training/evaluation run logged in MLflow:

	\| lan \| cat \| precision \| recall \| f1 \|
	\|--------\|-----------------\|-----------\|---------\|---------\|
	\| python \| Usage \| 0.80 \| 0.76\| 0.78\|
	\| python \| Parameters \| 0.74 \| 0.86\| 0.79\|
	\| python \| DevelopmentNotes\| 0.41 \| 0.50\| 0.45\|
	\| python \| Expand \| 0.49 \| 0.67\| 0.57\|
	\| python \| Summary \| 0.63 \| 0.82\| 0.71\|


	- Micro F1: 0.6781
	- Macro F1: 0.6385
	- Micro Precision: 0.6230
	- Micro Recall: 0.7438
	- Macro Precision: 0.5900
	- Macro Recall: 0.7061
	- Subset Accuracy (exact match): 0.5690
	- Micro Accuracy (per-label): 0.8441
	- Eval Loss (BCE with logits): 0.6727
	- Train Loss (final epoch): 0.2937

	### Benchmarking Metrics

	Average performance for the Python transformer benchmark:

	- Average Macro F1: 0.6385
	- Average Precision (macro): 0.5900
	- Average Recall (macro): 0.7061
	- Average Runtime: ~0.94 seconds (benchmark configuration)
	- Average GFLOPs: ~1823.25

	These results indicate that the transformer captures useful patterns across all five Python comment categories, with stronger performance on frequent labels and reasonable performance on less frequent ones.

	---

	## Quantitative Analysis

	The model is evaluated in a multi-label setting:

	- Micro metrics emphasize the overall correctness across all label decisions.
	- Macro metrics treat all labels equally, highlighting the behaviour on minority classes (e.g., `DevelopmentNotes`).

	Per-class metrics (precision/recall/F1) can be inspected in the detailed classification report logged as an artifact in MLflow for the Python transformer run. In general, the model performs better on high-frequency labels such as `Usage` and `Summary`, while performance on rarer labels is more variable.

	---

	## Training Details

	### Objective and Architecture

	- Base model: `microsoft/codebert-base`
	- Head: linear classification head with `num_labels = 5`
	- Problem type: `multi_label_classification`
	- Loss function: `BCEWithLogitsLoss` with per-label positive class weights computed from training label frequencies.
	- Sampling: `WeightedRandomSampler` over training examples to reduce the impact of label imbalance.

	### Hyperparameters

	- Max sequence length: 128
	- Batch size: 16
	- Learning rate: 2e-5
	- Optimizer: AdamW
	- Scheduler: Linear warmup and decay
	- Warmup ratio: 0.1
	- Number of epochs: 5
	- Threshold for prediction: 0.5 (per-label on sigmoid probabilities)

	### Preprocessing and Balancing

	- Training uses the Python split prepared by the project’s preprocessing pipeline.
	- Optional supersampling (oversampling of underrepresented labels with a cap at the maximum label frequency) is available and can be enabled to improve macro performance.
	- The test split remains unchanged and corresponds to the original NLBSE Python test partition.

	### Hardware / Runtime

	The reported runtime and GFLOPs are based on the project’s benchmarking setup (single GPU, standard research workstation). Actual latency and throughput depend on hardware and batch size.

	---

	## How to Use

	Install `transformers` and `torch`:

	```bash
	pip install transformers torch
	```

	Then load the model and tokenizer (replace the model ID with your repository name):

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	MODEL_ID = "se4ai2526-uniba/python-transformer" # replace with actual ID

	tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
	model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
	model.eval()

	LABELS = [
	"Usage",
	"Parameters",
	"DevelopmentNotes",
	"Expand",
	"Summary",
	]

	def predict_labels(texts, threshold: float = 0.5):
	if isinstance(texts, str):
	texts = [texts]

	inputs = tokenizer(
	texts,
	padding=True,
	truncation=True,
	max_length=128,
	return_tensors="pt",
	)
	with torch.no_grad():
	logits = model(**inputs).logits
	probs = torch.sigmoid(logits)

	preds = (probs > threshold).int().cpu().numpy()
	results = []
	for row in preds:
	labels = [LABELS[i] for i, v in enumerate(row) if v == 1]
	results.append(labels)
	return results

	# Example
	comments = [
	"# Usage: call this function with a file path \| module.py",
	]
	print(predict_labels(comments))
	```

	For full reproducibility consistent with the project, use the `ModelPredictor` wrapper and the same preprocessing used during training.

	---

	## Limitations and Biases

	* Domain-limited: Trained only on Python code comments from open-source repositories.
	* Imbalanced labels: Some categories are relatively underrepresented; performance on these labels can lag behind frequent ones.
	* Robustness: Behavioral tests show that the current model:

	* is deterministic and stable on duplicate inputs,
	* aligns with several curated golden examples,
	* remains sensitive to some benign text changes (extra whitespace, case changes, typos) unless additional normalization/augmentation is introduced.

	---

	## Ethical Considerations

	* The model reflects the style and biases of the open-source Python projects it was trained on.
	* It does not filter offensive or inappropriate content in comments; it only predicts semantic categories.
	* Outputs should be treated as assistive signals, not as authoritative judgements.

	---

	## Citation

	If you use this model in academic work or derived systems, please cite:

	> TheClouds Team. "NLBSE'26 Code Comment Classification – Python Model." 2025.

	BibTeX:

	```bibtex
	@misc{theclouds_nlbse26_code_comment_classification_python,
	title = {NLBSE'26 Code Comment Classification: Python Model},
	author = {TheClouds Team},
	year = {2025},
	note = {Model available on Hugging Face},
	howpublished = {\url{To be published}}
	}
	```

	Contact:

	For questions, feedback, or collaboration requests related to this model, please contact:
	> Giacomo Signorile: g.signorile14@studenti.uniba.it
	> Davide Pio Posa: d.posa3@studenti.uniba.it
	> Marco Lillo: m.lillo21@studenti.uniba.it
	> Rebecca Margiotta: m.margiotta5@studenti.uniba.it
	> Adriano Gentile: a.gentile97@studenti.uniba.com

	Issue tracker: https://github.com/se4ai2526-uniba/TheClouds

	```

	## Acknowledgements

	This model was developed as part of research on Natural Language-Based Software Engineering (NLBSE) and the Code Comment Classification task, building on the NLBSE’26 competition data and earlier SetFit and Random Forest baselines.