File size: 11,360 Bytes
9636971 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 |
---
language:
- en
tags:
- text-classification
- code-comment-classification
- transformers
- codebert
- pharo
- software-engineering
- multi-label
license: mit
datasets:
- NLBSE/nlbse26-code-comment-classification
metrics:
- f1
- precision
- recall
- subset_accuracy
- runtime
- gflops
pipeline_tag: text-classification
library_name: transformers
inference: false
base_model: microsoft/codebert-base
model-index:
- name: CodeBERT Transformer for Pharo Code Comment Classification
results:
- task:
type: text-classification
name: Multi-label Text Classification
dataset:
name: NLBSE Code Comment Classification Dataset (Pharo)
type: NLBSE/nlbse26-code-comment-classification
split: test
metrics:
- type: f1
name: Macro F1
value: 0.5980
- type: f1
name: Micro F1
value: 0.6720
- type: precision
name: Macro Precision
value: 0.5234
- type: recall
name: Macro Recall
value: 0.7157
- type: accuracy
name: Subset Accuracy
value: 0.5096
---
# Transformer Model (CodeBERT) for Pharo Code Comment Classification
## Model Details
- **Model Type:** Transformer-based multi-label classifier (sequence classification head)
- **Base Model:** [`microsoft/codebert-base`](https://huggingface.co/microsoft/codebert-base)
- **Language:** Pharo (code comments in English/technical English)
- **License:** MIT
- **Developed by:** TheClouds
- **Model Date:** November 2025
- **Model Version:** 1.0
### Description
This model fine-tunes **CodeBERT** on the **Pharo** subset of the **NLBSE Code Comment Classification Dataset** for **multi-label** classification. Each Pharo code comment sentence is mapped to one or more semantic categories that describe design intent and responsibilities of classes and methods.
The classifier operates on the `combo` field used in the project (comment sentence plus a compact context string) and produces a 6-dimensional binary label vector.
### Label Set
For Pharo, the model predicts the following 6 categories (fixed order in the classifier head):
1. `Keyimplementationpoints`
2. `Example`
3. `Responsibilities`
4. `Intent`
5. `Keymessages`
6. `Collaborators`
Each prediction is a length-6 vector of 0/1 decisions, obtained by applying a sigmoid activation to the logits and thresholding at 0.5 by default.
---
## Intended Use
The model is intended for:
- research on **code comment and design documentation classification** in Pharo projects,
- mining and analysis of Pharo method/class comments to extract design intent and responsibilities,
- tools that need semantic tags for Smalltalk/Pharo comments (e.g., navigation, documentation quality checks, design overview tools).
It is designed for **Pharo code comments** written in English or English-like technical language.
### Out-of-Scope Uses
- Generic text classification outside software engineering and Pharo/Smalltalk ecosystems.
- Non-English comments, or comments from unrelated programming languages, without additional fine-tuning.
- Any safety- or life-critical decision-making context.
---
## Data
### Training Data
- **Dataset:** NLBSE Code Comment Classification Dataset – Pharo train split
- **Size (train):** ~900 original training examples (expanded to ~1.9k via supersampling in the current configuration)
- **Label Space:** 6 multi-label categories (`Keyimplementationpoints`, `Example`, `Responsibilities`, `Intent`, `Keymessages`, `Collaborators`)
- **Preprocessing:**
- Comments extracted from real-world Pharo projects.
- Each sample represented using the `combo` field: `"<comment_sentence> | <class_context>"` (or similar contextual string).
- For this transformer configuration, the training data come from `data/processed/transformer`, where a supersampling procedure is applied to reduce label imbalance.
### Evaluation Data
- **Dataset:** NLBSE Code Comment Classification Dataset – Pharo test split
- **Size (test):** ~200 comment sentences
- **Evaluation Protocol:** multi-label classification with micro/macro metrics and subset accuracy (exact-match), evaluated on the original, non-supersampled test split.
---
## Metrics
### Core Evaluation Metrics (Pharo, test split)
From the training/evaluation run logged in MLflow:
| lan | cat | precision | recall | f1 |
|-------|------------------------|-----------|---------|---------|
| pharo | Keyimplementationpoints| 0.47 | 0.68| 0.56|
| pharo | Example | 0.89 | 0.83| 0.86|
| pharo | Responsibilities | 0.57 | 0.76| 0.65|
| pharo | Intent | 0.83 | 0.90| 0.86|
| pharo | Keymessages | 0.47 | 0.73| 0.57|
| pharo | Collaborators | 0.33 | 0.57| 0.42|
- **Micro F1:** 0.6720
- **Macro F1:** 0.5980
- **Micro Precision:** 0.5964
- **Micro Recall:** 0.7696
- **Macro Precision:** 0.5234
- **Macro Recall:** 0.7157
- **Subset Accuracy (exact match):** 0.5096
- **Micro Accuracy (per-label):** 0.8694
- **Eval Loss (BCE with logits):** 0.5889
- **Train Loss (final epoch):** 0.2149
### Benchmarking Metrics
Average performance over Pharo transformer benchmarking runs:
- **Average Macro F1:** 0.5980
- **Average Precision (macro):** 0.5234
- **Average Recall (macro):** 0.7157
- **Average Runtime:** ~1.35 seconds (benchmark configuration)
- **Average GFLOPs:** ~1943.77
These results indicate that the model captures the main Pharo comment categories reasonably well, with particularly strong recall, while macro precision and F1 reflect the dataset’s label imbalance and limited size.
---
## Quantitative Analysis
The evaluation is fully multi-label:
- **Micro metrics** reflect overall correctness across all label decisions.
- **Macro metrics** treat each of the six labels equally, exposing weaknesses on rarer categories (e.g., `Collaborators`, `Keymessages`).
A detailed per-class breakdown (precision/recall/F1 per label) is available in the classification report artifact for the Pharo transformer run in MLflow. High-level observations include:
- Better performance on frequent categories such as `Example` and `Responsibilities`.
- More variable performance on `Intent`, `Keymessages`, and `Collaborators`, due to fewer training examples.
---
## Training Details
### Objective and Architecture
- **Base model:** `microsoft/codebert-base`
- **Head:** linear classification head with `num_labels = 6`
- **Problem type:** `multi_label_classification`
- **Loss function:** `BCEWithLogitsLoss` with per-label **positive class weights** computed from training label frequencies.
- **Sampling:** `WeightedRandomSampler` over training samples to partially correct for label imbalance.
### Hyperparameters
- **Max sequence length:** 128
- **Batch size:** 16
- **Learning rate:** 2e-5
- **Optimizer:** AdamW
- **Scheduler:** Linear warmup and decay
- **Warmup ratio:** 0.1
- **Number of epochs:** 5
- **Prediction threshold:** 0.5 (per-label on sigmoid probabilities)
### Preprocessing and Balancing
- Training data for Pharo are produced by the project’s preprocessing module, which:
- ensures a `combo` text field is present,
- parses the label strings into binary vectors,
- applies **supersampling** on the train split only (up to a cap at the maximum original label frequency).
- The test split is not modified and corresponds to the original NLBSE Pharo test data.
### Hardware / Runtime
The runtime and GFLOPs figures are measured in the project’s benchmarking environment (single GPU, standard research workstation). Actual performance in deployment will depend on hardware and batch size.
---
## How to Use
Install dependencies:
```bash
pip install transformers torch
```
Then load the model and tokenizer (replace the model ID with the actual repository):
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
MODEL_ID = "se4ai2526-uniba/pharo-transformer" # replace with actual ID
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
model.eval()
LABELS = [
"Keyimplementationpoints",
"Example",
"Responsibilities",
"Intent",
"Keymessages",
"Collaborators",
]
def predict_labels(texts, threshold: float = 0.5):
if isinstance(texts, str):
texts = [texts]
inputs = tokenizer(
texts,
padding=True,
truncation=True,
max_length=128,
return_tensors="pt",
)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.sigmoid(logits)
preds = (probs > threshold).int().cpu().numpy()
results = []
for row in preds:
labels = [LABELS[i] for i, v in enumerate(row) if v == 1]
results.append(labels)
return results
# Example
comments = [
"\"The intent of this class is to manage UI events\" | MyWidget class",
]
print(predict_labels(comments))
```
For consistency with the rest of the project, you can also use the shared `ModelPredictor` wrapper and the same preprocessing normalization applied during training.
---
## Limitations and Biases
* **Limited data:** The Pharo split is relatively small compared to Java/Python; this constrains the model’s ability to generalize, especially for minority labels.
* **Imbalanced label distribution:** Despite supersampling and positive weights, some categories remain harder to predict reliably.
* **Sensitivity to perturbations:** Behavioral tests show:
* deterministic behaviour and stable predictions on duplicate inputs,
* alignment with several curated golden examples,
* sensitivity to some benign text changes (whitespace, case, punctuation, typos) unless additional normalization or data augmentation is introduced.
---
## Ethical Considerations
* The model is trained on comments from open-source Pharo projects and inherits their style, norms, and potential biases.
* It does not filter or sanitize comment content; it only categorizes comments into design/documentation categories.
* Outputs should be used as assistive signals in tooling and analysis, not as authoritative judgements.
---
## Citation
If you use this model in academic work or derived systems, please cite:
> TheClouds Team. "NLBSE'26 Code Comment Classification – Pharo Model." 2025.
BibTeX:
```bibtex
@misc{theclouds_nlbse26_code_comment_classification_pharo,
title = {NLBSE'26 Code Comment Classification: Pharo Model},
author = {TheClouds Team},
year = {2025},
note = {Model available on Hugging Face},
howpublished = {\url{To be published}}
}
```
Contact:
For questions, feedback, or collaboration requests related to this model, please contact:
> Giacomo Signorile: g.signorile14@studenti.uniba.it
> Davide Pio Posa: d.posa3@studenti.uniba.it
> Marco Lillo: m.lillo21@studenti.uniba.it
> Rebecca Margiotta: m.margiotta5@studenti.uniba.it
> Adriano Gentile: a.gentile97@studenti.uniba.com
Issue tracker: https://github.com/se4ai2526-uniba/TheClouds
```
## Acknowledgements
This model was developed as part of research on **Natural Language-Based Software Engineering (NLBSE)** and the **Code Comment Classification** task, using the NLBSE’26 competition data and building on earlier SetFit and Random Forest baselines.
|