|
|
--- |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- text-classification |
|
|
- code-comment-classification |
|
|
- transformers |
|
|
- codebert |
|
|
- python |
|
|
- software-engineering |
|
|
- multi-label |
|
|
license: mit |
|
|
datasets: |
|
|
- NLBSE/nlbse26-code-comment-classification |
|
|
metrics: |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
- subset_accuracy |
|
|
- runtime |
|
|
- gflops |
|
|
pipeline_tag: text-classification |
|
|
library_name: transformers |
|
|
inference: false |
|
|
base_model: microsoft/codebert-base |
|
|
model-index: |
|
|
- name: CodeBERT Transformer for Python Code Comment Classification |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Multi-label Text Classification |
|
|
dataset: |
|
|
name: NLBSE Code Comment Classification Dataset (Python) |
|
|
type: NLBSE/nlbse26-code-comment-classification |
|
|
split: test |
|
|
metrics: |
|
|
- type: f1 |
|
|
name: Macro F1 |
|
|
value: 0.6385 |
|
|
- type: f1 |
|
|
name: Micro F1 |
|
|
value: 0.6781 |
|
|
- type: precision |
|
|
name: Macro Precision |
|
|
value: 0.5900 |
|
|
- type: recall |
|
|
name: Macro Recall |
|
|
value: 0.7061 |
|
|
- type: accuracy |
|
|
name: Subset Accuracy |
|
|
value: 0.5690 |
|
|
--- |
|
|
|
|
|
# Transformer Model (CodeBERT) for Python Code Comment Classification |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Type:** Transformer-based multi-label classifier (sequence classification head) |
|
|
- **Base Model:** [`microsoft/codebert-base`](https://huggingface.co/microsoft/codebert-base) |
|
|
- **Language:** Python (code comments in English) |
|
|
- **License:** MIT |
|
|
- **Developed by:** TheClouds |
|
|
- **Model Date:** November 2025 |
|
|
- **Model Version:** 1.0 |
|
|
|
|
|
### Description |
|
|
|
|
|
This model fine-tunes **CodeBERT** on the **Python** subset of the **NLBSE Code Comment Classification Dataset** for **multi-label** classification. Each Python code comment sentence is mapped to one or more semantic categories describing the role and intent of the comment. |
|
|
|
|
|
The classifier operates on the project’s `combo` field (concatenation of the comment sentence with a compact context string) and produces a 5-dimensional binary label vector. |
|
|
|
|
|
### Label Set |
|
|
|
|
|
For Python, the model predicts the following 5 categories (fixed order in the classifier head): |
|
|
|
|
|
1. `Usage` |
|
|
2. `Parameters` |
|
|
3. `DevelopmentNotes` |
|
|
4. `Expand` |
|
|
5. `Summary` |
|
|
|
|
|
Each prediction is a length-5 vector of 0/1 decisions, obtained by applying a sigmoid activation to the logits and thresholding at 0.5 by default. |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
The model is intended for: |
|
|
|
|
|
- research on **code comment classification** in Python projects, |
|
|
- mining and analysis of Python documentation comments, |
|
|
- tooling that needs semantic tags for comments (e.g., documentation quality inspection, comment recommendation, navigation support). |
|
|
|
|
|
It is designed for **Python code comments** in English or English-like technical language. |
|
|
|
|
|
### Out-of-Scope Uses |
|
|
|
|
|
- Generic natural language classification outside software engineering. |
|
|
- Non-English comments without additional fine-tuning or adaptation. |
|
|
- Use in safety- or life-critical decision making. |
|
|
|
|
|
--- |
|
|
|
|
|
## Data |
|
|
|
|
|
### Training Data |
|
|
|
|
|
- **Dataset:** NLBSE Code Comment Classification Dataset – Python train split |
|
|
- **Size (train):** ~1.4k original training examples (with optional supersampled expansion to ~2k examples, depending on the configuration) |
|
|
- **Label Space:** 5 multi-label categories (`Usage`, `Parameters`, `DevelopmentNotes`, `Expand`, `Summary`) |
|
|
- **Preprocessing:** |
|
|
- Comments extracted from open-source Python projects. |
|
|
- Each instance represented via the `combo` field: `"<comment_sentence> | <class_context>"`. |
|
|
- The project’s preprocessing pipeline can generate balanced training CSVs (via supersampling) under `data/processed/transformer`. The metrics reported here correspond to the current transformer configuration logged in MLflow for Python. |
|
|
|
|
|
### Evaluation Data |
|
|
|
|
|
- **Dataset:** NLBSE Code Comment Classification Dataset – Python test split |
|
|
- **Size (test):** ~300 comment sentences |
|
|
- **Evaluation Protocol:** multi-label classification with micro and macro metrics, plus subset accuracy (exact match). |
|
|
|
|
|
--- |
|
|
|
|
|
## Metrics |
|
|
|
|
|
### Core Evaluation Metrics (Python, test split) |
|
|
|
|
|
From the training/evaluation run logged in MLflow: |
|
|
|
|
|
| lan | cat | precision | recall | f1 | |
|
|
|--------|-----------------|-----------|---------|---------| |
|
|
| python | Usage | 0.80 | 0.76| 0.78| |
|
|
| python | Parameters | 0.74 | 0.86| 0.79| |
|
|
| python | DevelopmentNotes| 0.41 | 0.50| 0.45| |
|
|
| python | Expand | 0.49 | 0.67| 0.57| |
|
|
| python | Summary | 0.63 | 0.82| 0.71| |
|
|
|
|
|
|
|
|
- **Micro F1:** 0.6781 |
|
|
- **Macro F1:** 0.6385 |
|
|
- **Micro Precision:** 0.6230 |
|
|
- **Micro Recall:** 0.7438 |
|
|
- **Macro Precision:** 0.5900 |
|
|
- **Macro Recall:** 0.7061 |
|
|
- **Subset Accuracy (exact match):** 0.5690 |
|
|
- **Micro Accuracy (per-label):** 0.8441 |
|
|
- **Eval Loss (BCE with logits):** 0.6727 |
|
|
- **Train Loss (final epoch):** 0.2937 |
|
|
|
|
|
### Benchmarking Metrics |
|
|
|
|
|
Average performance for the Python transformer benchmark: |
|
|
|
|
|
- **Average Macro F1:** 0.6385 |
|
|
- **Average Precision (macro):** 0.5900 |
|
|
- **Average Recall (macro):** 0.7061 |
|
|
- **Average Runtime:** ~0.94 seconds (benchmark configuration) |
|
|
- **Average GFLOPs:** ~1823.25 |
|
|
|
|
|
These results indicate that the transformer captures useful patterns across all five Python comment categories, with stronger performance on frequent labels and reasonable performance on less frequent ones. |
|
|
|
|
|
--- |
|
|
|
|
|
## Quantitative Analysis |
|
|
|
|
|
The model is evaluated in a multi-label setting: |
|
|
|
|
|
- **Micro metrics** emphasize the overall correctness across all label decisions. |
|
|
- **Macro metrics** treat all labels equally, highlighting the behaviour on minority classes (e.g., `DevelopmentNotes`). |
|
|
|
|
|
Per-class metrics (precision/recall/F1) can be inspected in the detailed classification report logged as an artifact in MLflow for the Python transformer run. In general, the model performs better on high-frequency labels such as `Usage` and `Summary`, while performance on rarer labels is more variable. |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Objective and Architecture |
|
|
|
|
|
- **Base model:** `microsoft/codebert-base` |
|
|
- **Head:** linear classification head with `num_labels = 5` |
|
|
- **Problem type:** `multi_label_classification` |
|
|
- **Loss function:** `BCEWithLogitsLoss` with per-label **positive class weights** computed from training label frequencies. |
|
|
- **Sampling:** `WeightedRandomSampler` over training examples to reduce the impact of label imbalance. |
|
|
|
|
|
### Hyperparameters |
|
|
|
|
|
- **Max sequence length:** 128 |
|
|
- **Batch size:** 16 |
|
|
- **Learning rate:** 2e-5 |
|
|
- **Optimizer:** AdamW |
|
|
- **Scheduler:** Linear warmup and decay |
|
|
- **Warmup ratio:** 0.1 |
|
|
- **Number of epochs:** 5 |
|
|
- **Threshold for prediction:** 0.5 (per-label on sigmoid probabilities) |
|
|
|
|
|
### Preprocessing and Balancing |
|
|
|
|
|
- Training uses the **Python** split prepared by the project’s preprocessing pipeline. |
|
|
- Optional supersampling (oversampling of underrepresented labels with a cap at the maximum label frequency) is available and can be enabled to improve macro performance. |
|
|
- The test split remains unchanged and corresponds to the original NLBSE Python test partition. |
|
|
|
|
|
### Hardware / Runtime |
|
|
|
|
|
The reported runtime and GFLOPs are based on the project’s benchmarking setup (single GPU, standard research workstation). Actual latency and throughput depend on hardware and batch size. |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Use |
|
|
|
|
|
Install `transformers` and `torch`: |
|
|
|
|
|
```bash |
|
|
pip install transformers torch |
|
|
``` |
|
|
|
|
|
Then load the model and tokenizer (replace the model ID with your repository name): |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
|
|
MODEL_ID = "se4ai2526-uniba/python-transformer" # replace with actual ID |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID) |
|
|
model.eval() |
|
|
|
|
|
LABELS = [ |
|
|
"Usage", |
|
|
"Parameters", |
|
|
"DevelopmentNotes", |
|
|
"Expand", |
|
|
"Summary", |
|
|
] |
|
|
|
|
|
def predict_labels(texts, threshold: float = 0.5): |
|
|
if isinstance(texts, str): |
|
|
texts = [texts] |
|
|
|
|
|
inputs = tokenizer( |
|
|
texts, |
|
|
padding=True, |
|
|
truncation=True, |
|
|
max_length=128, |
|
|
return_tensors="pt", |
|
|
) |
|
|
with torch.no_grad(): |
|
|
logits = model(**inputs).logits |
|
|
probs = torch.sigmoid(logits) |
|
|
|
|
|
preds = (probs > threshold).int().cpu().numpy() |
|
|
results = [] |
|
|
for row in preds: |
|
|
labels = [LABELS[i] for i, v in enumerate(row) if v == 1] |
|
|
results.append(labels) |
|
|
return results |
|
|
|
|
|
# Example |
|
|
comments = [ |
|
|
"# Usage: call this function with a file path | module.py", |
|
|
] |
|
|
print(predict_labels(comments)) |
|
|
``` |
|
|
|
|
|
For full reproducibility consistent with the project, use the `ModelPredictor` wrapper and the same preprocessing used during training. |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations and Biases |
|
|
|
|
|
* **Domain-limited:** Trained only on Python code comments from open-source repositories. |
|
|
* **Imbalanced labels:** Some categories are relatively underrepresented; performance on these labels can lag behind frequent ones. |
|
|
* **Robustness:** Behavioral tests show that the current model: |
|
|
|
|
|
* is deterministic and stable on duplicate inputs, |
|
|
* aligns with several curated golden examples, |
|
|
* remains sensitive to some benign text changes (extra whitespace, case changes, typos) unless additional normalization/augmentation is introduced. |
|
|
|
|
|
--- |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
* The model reflects the style and biases of the open-source Python projects it was trained on. |
|
|
* It does not filter offensive or inappropriate content in comments; it only predicts semantic categories. |
|
|
* Outputs should be treated as assistive signals, not as authoritative judgements. |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in academic work or derived systems, please cite: |
|
|
|
|
|
> TheClouds Team. "NLBSE'26 Code Comment Classification – Python Model." 2025. |
|
|
|
|
|
BibTeX: |
|
|
|
|
|
```bibtex |
|
|
@misc{theclouds_nlbse26_code_comment_classification_python, |
|
|
title = {NLBSE'26 Code Comment Classification: Python Model}, |
|
|
author = {TheClouds Team}, |
|
|
year = {2025}, |
|
|
note = {Model available on Hugging Face}, |
|
|
howpublished = {\url{To be published}} |
|
|
} |
|
|
``` |
|
|
|
|
|
Contact: |
|
|
|
|
|
For questions, feedback, or collaboration requests related to this model, please contact: |
|
|
> Giacomo Signorile: g.signorile14@studenti.uniba.it |
|
|
> Davide Pio Posa: d.posa3@studenti.uniba.it |
|
|
> Marco Lillo: m.lillo21@studenti.uniba.it |
|
|
> Rebecca Margiotta: m.margiotta5@studenti.uniba.it |
|
|
> Adriano Gentile: a.gentile97@studenti.uniba.com |
|
|
|
|
|
Issue tracker: https://github.com/se4ai2526-uniba/TheClouds |
|
|
|
|
|
``` |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
This model was developed as part of research on **Natural Language-Based Software Engineering (NLBSE)** and the **Code Comment Classification** task, building on the NLBSE’26 competition data and earlier SetFit and Random Forest baselines. |
|
|
|