File size: 11,360 Bytes
9636971
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
---
language:
- en
tags:
- text-classification
- code-comment-classification
- transformers
- codebert
- pharo
- software-engineering
- multi-label
license: mit
datasets:
- NLBSE/nlbse26-code-comment-classification
metrics:
- f1
- precision
- recall
- subset_accuracy
- runtime
- gflops
pipeline_tag: text-classification
library_name: transformers
inference: false
base_model: microsoft/codebert-base
model-index:
- name: CodeBERT Transformer for Pharo Code Comment Classification
  results:
  - task:
      type: text-classification
      name: Multi-label Text Classification
    dataset:
      name: NLBSE Code Comment Classification Dataset (Pharo)
      type: NLBSE/nlbse26-code-comment-classification
      split: test
    metrics:
    - type: f1
      name: Macro F1
      value: 0.5980
    - type: f1
      name: Micro F1
      value: 0.6720
    - type: precision
      name: Macro Precision
      value: 0.5234
    - type: recall
      name: Macro Recall
      value: 0.7157
    - type: accuracy
      name: Subset Accuracy
      value: 0.5096
---

# Transformer Model (CodeBERT) for Pharo Code Comment Classification

## Model Details

- **Model Type:** Transformer-based multi-label classifier (sequence classification head)
- **Base Model:** [`microsoft/codebert-base`](https://huggingface.co/microsoft/codebert-base)
- **Language:** Pharo (code comments in English/technical English)
- **License:** MIT
- **Developed by:** TheClouds
- **Model Date:** November 2025
- **Model Version:** 1.0

### Description

This model fine-tunes **CodeBERT** on the **Pharo** subset of the **NLBSE Code Comment Classification Dataset** for **multi-label** classification. Each Pharo code comment sentence is mapped to one or more semantic categories that describe design intent and responsibilities of classes and methods.

The classifier operates on the `combo` field used in the project (comment sentence plus a compact context string) and produces a 6-dimensional binary label vector.

### Label Set

For Pharo, the model predicts the following 6 categories (fixed order in the classifier head):

1. `Keyimplementationpoints`
2. `Example`
3. `Responsibilities`
4. `Intent`
5. `Keymessages`
6. `Collaborators`

Each prediction is a length-6 vector of 0/1 decisions, obtained by applying a sigmoid activation to the logits and thresholding at 0.5 by default.

---

## Intended Use

The model is intended for:

- research on **code comment and design documentation classification** in Pharo projects,
- mining and analysis of Pharo method/class comments to extract design intent and responsibilities,
- tools that need semantic tags for Smalltalk/Pharo comments (e.g., navigation, documentation quality checks, design overview tools).

It is designed for **Pharo code comments** written in English or English-like technical language.

### Out-of-Scope Uses

- Generic text classification outside software engineering and Pharo/Smalltalk ecosystems.
- Non-English comments, or comments from unrelated programming languages, without additional fine-tuning.
- Any safety- or life-critical decision-making context.

---

## Data

### Training Data

- **Dataset:** NLBSE Code Comment Classification Dataset – Pharo train split
- **Size (train):** ~900 original training examples (expanded to ~1.9k via supersampling in the current configuration)
- **Label Space:** 6 multi-label categories (`Keyimplementationpoints`, `Example`, `Responsibilities`, `Intent`, `Keymessages`, `Collaborators`)
- **Preprocessing:**
  - Comments extracted from real-world Pharo projects.
  - Each sample represented using the `combo` field: `"<comment_sentence> | <class_context>"` (or similar contextual string).
  - For this transformer configuration, the training data come from `data/processed/transformer`, where a supersampling procedure is applied to reduce label imbalance.

### Evaluation Data

- **Dataset:** NLBSE Code Comment Classification Dataset – Pharo test split
- **Size (test):** ~200 comment sentences
- **Evaluation Protocol:** multi-label classification with micro/macro metrics and subset accuracy (exact-match), evaluated on the original, non-supersampled test split.

---

## Metrics

### Core Evaluation Metrics (Pharo, test split)

From the training/evaluation run logged in MLflow:

| lan   | cat                    | precision | recall  | f1      |
|-------|------------------------|-----------|---------|---------|
| pharo | Keyimplementationpoints| 0.47  | 0.68| 0.56|
| pharo | Example                | 0.89  | 0.83| 0.86|
| pharo | Responsibilities       | 0.57  | 0.76| 0.65|
| pharo | Intent                 | 0.83  | 0.90| 0.86|
| pharo | Keymessages            | 0.47  | 0.73| 0.57|
| pharo | Collaborators          | 0.33  | 0.57| 0.42|

- **Micro F1:** 0.6720  
- **Macro F1:** 0.5980  
- **Micro Precision:** 0.5964  
- **Micro Recall:** 0.7696  
- **Macro Precision:** 0.5234  
- **Macro Recall:** 0.7157  
- **Subset Accuracy (exact match):** 0.5096  
- **Micro Accuracy (per-label):** 0.8694  
- **Eval Loss (BCE with logits):** 0.5889  
- **Train Loss (final epoch):** 0.2149  

### Benchmarking Metrics

Average performance over Pharo transformer benchmarking runs:

- **Average Macro F1:** 0.5980  
- **Average Precision (macro):** 0.5234  
- **Average Recall (macro):** 0.7157  
- **Average Runtime:** ~1.35 seconds (benchmark configuration)  
- **Average GFLOPs:** ~1943.77  

These results indicate that the model captures the main Pharo comment categories reasonably well, with particularly strong recall, while macro precision and F1 reflect the dataset’s label imbalance and limited size.

---

## Quantitative Analysis

The evaluation is fully multi-label:

- **Micro metrics** reflect overall correctness across all label decisions.
- **Macro metrics** treat each of the six labels equally, exposing weaknesses on rarer categories (e.g., `Collaborators`, `Keymessages`).

A detailed per-class breakdown (precision/recall/F1 per label) is available in the classification report artifact for the Pharo transformer run in MLflow. High-level observations include:

- Better performance on frequent categories such as `Example` and `Responsibilities`.
- More variable performance on `Intent`, `Keymessages`, and `Collaborators`, due to fewer training examples.

---

## Training Details

### Objective and Architecture

- **Base model:** `microsoft/codebert-base`
- **Head:** linear classification head with `num_labels = 6`
- **Problem type:** `multi_label_classification`
- **Loss function:** `BCEWithLogitsLoss` with per-label **positive class weights** computed from training label frequencies.
- **Sampling:** `WeightedRandomSampler` over training samples to partially correct for label imbalance.

### Hyperparameters

- **Max sequence length:** 128  
- **Batch size:** 16  
- **Learning rate:** 2e-5  
- **Optimizer:** AdamW  
- **Scheduler:** Linear warmup and decay  
- **Warmup ratio:** 0.1  
- **Number of epochs:** 5  
- **Prediction threshold:** 0.5 (per-label on sigmoid probabilities)

### Preprocessing and Balancing

- Training data for Pharo are produced by the project’s preprocessing module, which:
  - ensures a `combo` text field is present,
  - parses the label strings into binary vectors,
  - applies **supersampling** on the train split only (up to a cap at the maximum original label frequency).
- The test split is not modified and corresponds to the original NLBSE Pharo test data.

### Hardware / Runtime

The runtime and GFLOPs figures are measured in the project’s benchmarking environment (single GPU, standard research workstation). Actual performance in deployment will depend on hardware and batch size.

---

## How to Use

Install dependencies:

```bash
pip install transformers torch
```

Then load the model and tokenizer (replace the model ID with the actual repository):

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

MODEL_ID = "se4ai2526-uniba/pharo-transformer"  # replace with actual ID

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
model.eval()

LABELS = [
    "Keyimplementationpoints",
    "Example",
    "Responsibilities",
    "Intent",
    "Keymessages",
    "Collaborators",
]

def predict_labels(texts, threshold: float = 0.5):
    if isinstance(texts, str):
        texts = [texts]

    inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=128,
        return_tensors="pt",
    )
    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.sigmoid(logits)

    preds = (probs > threshold).int().cpu().numpy()
    results = []
    for row in preds:
        labels = [LABELS[i] for i, v in enumerate(row) if v == 1]
        results.append(labels)
    return results

# Example
comments = [
    "\"The intent of this class is to manage UI events\" | MyWidget class",
]
print(predict_labels(comments))
```

For consistency with the rest of the project, you can also use the shared `ModelPredictor` wrapper and the same preprocessing normalization applied during training.

---

## Limitations and Biases

* **Limited data:** The Pharo split is relatively small compared to Java/Python; this constrains the model’s ability to generalize, especially for minority labels.
* **Imbalanced label distribution:** Despite supersampling and positive weights, some categories remain harder to predict reliably.
* **Sensitivity to perturbations:** Behavioral tests show:

  * deterministic behaviour and stable predictions on duplicate inputs,
  * alignment with several curated golden examples,
  * sensitivity to some benign text changes (whitespace, case, punctuation, typos) unless additional normalization or data augmentation is introduced.

---

## Ethical Considerations

* The model is trained on comments from open-source Pharo projects and inherits their style, norms, and potential biases.
* It does not filter or sanitize comment content; it only categorizes comments into design/documentation categories.
* Outputs should be used as assistive signals in tooling and analysis, not as authoritative judgements.

---

## Citation

If you use this model in academic work or derived systems, please cite:

> TheClouds Team. "NLBSE'26 Code Comment Classification – Pharo Model." 2025.

BibTeX:

```bibtex
@misc{theclouds_nlbse26_code_comment_classification_pharo,
  title        = {NLBSE'26 Code Comment Classification: Pharo Model},
  author       = {TheClouds Team},
  year         = {2025},
  note         = {Model available on Hugging Face},
  howpublished = {\url{To be published}}
}
```

Contact:

For questions, feedback, or collaboration requests related to this model, please contact:
> Giacomo Signorile: g.signorile14@studenti.uniba.it
> Davide Pio Posa: d.posa3@studenti.uniba.it
> Marco Lillo: m.lillo21@studenti.uniba.it
> Rebecca Margiotta: m.margiotta5@studenti.uniba.it
> Adriano Gentile: a.gentile97@studenti.uniba.com

Issue tracker: https://github.com/se4ai2526-uniba/TheClouds

```
## Acknowledgements

This model was developed as part of research on **Natural Language-Based Software Engineering (NLBSE)** and the **Code Comment Classification** task, using the NLBSE’26 competition data and building on earlier SetFit and Random Forest baselines.