Sky-Blue-da-ba-dee's picture
fixed a typo in the project name
9636971
---
language:
- en
tags:
- text-classification
- code-comment-classification
- transformers
- codebert
- python
- software-engineering
- multi-label
license: mit
datasets:
- NLBSE/nlbse26-code-comment-classification
metrics:
- f1
- precision
- recall
- subset_accuracy
- runtime
- gflops
pipeline_tag: text-classification
library_name: transformers
inference: false
base_model: microsoft/codebert-base
model-index:
- name: CodeBERT Transformer for Python Code Comment Classification
results:
- task:
type: text-classification
name: Multi-label Text Classification
dataset:
name: NLBSE Code Comment Classification Dataset (Python)
type: NLBSE/nlbse26-code-comment-classification
split: test
metrics:
- type: f1
name: Macro F1
value: 0.6385
- type: f1
name: Micro F1
value: 0.6781
- type: precision
name: Macro Precision
value: 0.5900
- type: recall
name: Macro Recall
value: 0.7061
- type: accuracy
name: Subset Accuracy
value: 0.5690
---
# Transformer Model (CodeBERT) for Python Code Comment Classification
## Model Details
- **Model Type:** Transformer-based multi-label classifier (sequence classification head)
- **Base Model:** [`microsoft/codebert-base`](https://huggingface.co/microsoft/codebert-base)
- **Language:** Python (code comments in English)
- **License:** MIT
- **Developed by:** TheClouds
- **Model Date:** November 2025
- **Model Version:** 1.0
### Description
This model fine-tunes **CodeBERT** on the **Python** subset of the **NLBSE Code Comment Classification Dataset** for **multi-label** classification. Each Python code comment sentence is mapped to one or more semantic categories describing the role and intent of the comment.
The classifier operates on the project’s `combo` field (concatenation of the comment sentence with a compact context string) and produces a 5-dimensional binary label vector.
### Label Set
For Python, the model predicts the following 5 categories (fixed order in the classifier head):
1. `Usage`
2. `Parameters`
3. `DevelopmentNotes`
4. `Expand`
5. `Summary`
Each prediction is a length-5 vector of 0/1 decisions, obtained by applying a sigmoid activation to the logits and thresholding at 0.5 by default.
---
## Intended Use
The model is intended for:
- research on **code comment classification** in Python projects,
- mining and analysis of Python documentation comments,
- tooling that needs semantic tags for comments (e.g., documentation quality inspection, comment recommendation, navigation support).
It is designed for **Python code comments** in English or English-like technical language.
### Out-of-Scope Uses
- Generic natural language classification outside software engineering.
- Non-English comments without additional fine-tuning or adaptation.
- Use in safety- or life-critical decision making.
---
## Data
### Training Data
- **Dataset:** NLBSE Code Comment Classification Dataset – Python train split
- **Size (train):** ~1.4k original training examples (with optional supersampled expansion to ~2k examples, depending on the configuration)
- **Label Space:** 5 multi-label categories (`Usage`, `Parameters`, `DevelopmentNotes`, `Expand`, `Summary`)
- **Preprocessing:**
- Comments extracted from open-source Python projects.
- Each instance represented via the `combo` field: `"<comment_sentence> | <class_context>"`.
- The project’s preprocessing pipeline can generate balanced training CSVs (via supersampling) under `data/processed/transformer`. The metrics reported here correspond to the current transformer configuration logged in MLflow for Python.
### Evaluation Data
- **Dataset:** NLBSE Code Comment Classification Dataset – Python test split
- **Size (test):** ~300 comment sentences
- **Evaluation Protocol:** multi-label classification with micro and macro metrics, plus subset accuracy (exact match).
---
## Metrics
### Core Evaluation Metrics (Python, test split)
From the training/evaluation run logged in MLflow:
| lan | cat | precision | recall | f1 |
|--------|-----------------|-----------|---------|---------|
| python | Usage | 0.80 | 0.76| 0.78|
| python | Parameters | 0.74 | 0.86| 0.79|
| python | DevelopmentNotes| 0.41 | 0.50| 0.45|
| python | Expand | 0.49 | 0.67| 0.57|
| python | Summary | 0.63 | 0.82| 0.71|
- **Micro F1:** 0.6781
- **Macro F1:** 0.6385
- **Micro Precision:** 0.6230
- **Micro Recall:** 0.7438
- **Macro Precision:** 0.5900
- **Macro Recall:** 0.7061
- **Subset Accuracy (exact match):** 0.5690
- **Micro Accuracy (per-label):** 0.8441
- **Eval Loss (BCE with logits):** 0.6727
- **Train Loss (final epoch):** 0.2937
### Benchmarking Metrics
Average performance for the Python transformer benchmark:
- **Average Macro F1:** 0.6385
- **Average Precision (macro):** 0.5900
- **Average Recall (macro):** 0.7061
- **Average Runtime:** ~0.94 seconds (benchmark configuration)
- **Average GFLOPs:** ~1823.25
These results indicate that the transformer captures useful patterns across all five Python comment categories, with stronger performance on frequent labels and reasonable performance on less frequent ones.
---
## Quantitative Analysis
The model is evaluated in a multi-label setting:
- **Micro metrics** emphasize the overall correctness across all label decisions.
- **Macro metrics** treat all labels equally, highlighting the behaviour on minority classes (e.g., `DevelopmentNotes`).
Per-class metrics (precision/recall/F1) can be inspected in the detailed classification report logged as an artifact in MLflow for the Python transformer run. In general, the model performs better on high-frequency labels such as `Usage` and `Summary`, while performance on rarer labels is more variable.
---
## Training Details
### Objective and Architecture
- **Base model:** `microsoft/codebert-base`
- **Head:** linear classification head with `num_labels = 5`
- **Problem type:** `multi_label_classification`
- **Loss function:** `BCEWithLogitsLoss` with per-label **positive class weights** computed from training label frequencies.
- **Sampling:** `WeightedRandomSampler` over training examples to reduce the impact of label imbalance.
### Hyperparameters
- **Max sequence length:** 128
- **Batch size:** 16
- **Learning rate:** 2e-5
- **Optimizer:** AdamW
- **Scheduler:** Linear warmup and decay
- **Warmup ratio:** 0.1
- **Number of epochs:** 5
- **Threshold for prediction:** 0.5 (per-label on sigmoid probabilities)
### Preprocessing and Balancing
- Training uses the **Python** split prepared by the project’s preprocessing pipeline.
- Optional supersampling (oversampling of underrepresented labels with a cap at the maximum label frequency) is available and can be enabled to improve macro performance.
- The test split remains unchanged and corresponds to the original NLBSE Python test partition.
### Hardware / Runtime
The reported runtime and GFLOPs are based on the project’s benchmarking setup (single GPU, standard research workstation). Actual latency and throughput depend on hardware and batch size.
---
## How to Use
Install `transformers` and `torch`:
```bash
pip install transformers torch
```
Then load the model and tokenizer (replace the model ID with your repository name):
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
MODEL_ID = "se4ai2526-uniba/python-transformer" # replace with actual ID
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
model.eval()
LABELS = [
"Usage",
"Parameters",
"DevelopmentNotes",
"Expand",
"Summary",
]
def predict_labels(texts, threshold: float = 0.5):
if isinstance(texts, str):
texts = [texts]
inputs = tokenizer(
texts,
padding=True,
truncation=True,
max_length=128,
return_tensors="pt",
)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.sigmoid(logits)
preds = (probs > threshold).int().cpu().numpy()
results = []
for row in preds:
labels = [LABELS[i] for i, v in enumerate(row) if v == 1]
results.append(labels)
return results
# Example
comments = [
"# Usage: call this function with a file path | module.py",
]
print(predict_labels(comments))
```
For full reproducibility consistent with the project, use the `ModelPredictor` wrapper and the same preprocessing used during training.
---
## Limitations and Biases
* **Domain-limited:** Trained only on Python code comments from open-source repositories.
* **Imbalanced labels:** Some categories are relatively underrepresented; performance on these labels can lag behind frequent ones.
* **Robustness:** Behavioral tests show that the current model:
* is deterministic and stable on duplicate inputs,
* aligns with several curated golden examples,
* remains sensitive to some benign text changes (extra whitespace, case changes, typos) unless additional normalization/augmentation is introduced.
---
## Ethical Considerations
* The model reflects the style and biases of the open-source Python projects it was trained on.
* It does not filter offensive or inappropriate content in comments; it only predicts semantic categories.
* Outputs should be treated as assistive signals, not as authoritative judgements.
---
## Citation
If you use this model in academic work or derived systems, please cite:
> TheClouds Team. "NLBSE'26 Code Comment Classification – Python Model." 2025.
BibTeX:
```bibtex
@misc{theclouds_nlbse26_code_comment_classification_python,
title = {NLBSE'26 Code Comment Classification: Python Model},
author = {TheClouds Team},
year = {2025},
note = {Model available on Hugging Face},
howpublished = {\url{To be published}}
}
```
Contact:
For questions, feedback, or collaboration requests related to this model, please contact:
> Giacomo Signorile: g.signorile14@studenti.uniba.it
> Davide Pio Posa: d.posa3@studenti.uniba.it
> Marco Lillo: m.lillo21@studenti.uniba.it
> Rebecca Margiotta: m.margiotta5@studenti.uniba.it
> Adriano Gentile: a.gentile97@studenti.uniba.com
Issue tracker: https://github.com/se4ai2526-uniba/TheClouds
```
## Acknowledgements
This model was developed as part of research on **Natural Language-Based Software Engineering (NLBSE)** and the **Code Comment Classification** task, building on the NLBSE’26 competition data and earlier SetFit and Random Forest baselines.