Sky-Blue-da-ba-dee commited on
Commit
9636971
·
1 Parent(s): b4c95e1

fixed a typo in the project name

Browse files
models/model_cards/java/setfit/README.md ADDED
@@ -0,0 +1,231 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - text-classification
6
+ - code-comment-classification
7
+ - setfit
8
+ - java
9
+ - software-engineering
10
+ - multi-label
11
+ - sentence-transformers
12
+ - generated_from_setfit_trainer
13
+ license: mit
14
+ datasets:
15
+ - NLBSE/nlbse26-code-comment-classification
16
+ metrics:
17
+ - f1
18
+ - precision
19
+ - recall
20
+ pipeline_tag: text-classification
21
+ library_name: setfit
22
+ inference: false
23
+ widget:
24
+ - text: '@link FSNamesystem#readLock() | FSPermissionChecker.java'
25
+ - text: previous^checkpoint li | TestSaveNamespace.java
26
+ - text: // the file doesn't have anything | TaskLog.java
27
+ - text: " @param file the file the include directives point to\n\t * @param depth\
28
+ \ depth to which includes are followed, should be one of\n\t * {@link #DEPTH_ZERO}\
29
+ \ or {@link #DEPTH_INFINITE}\n\t * @return an array of include relations\n\t *\
30
+ \ @throws CoreException | IIndex.java"
31
+ - text: // quotes are removed | ScannerUtility.java
32
+ base_model: sentence-transformers/paraphrase-MiniLM-L6-v2
33
+ model-index:
34
+ - name: SetFit with sentence-transformers/paraphrase-MiniLM-L6-v2
35
+ results:
36
+ - task:
37
+ type: text-classification
38
+ name: Text Classification
39
+ dataset:
40
+ name: NLBSE Code Comment Classification Dataset (Java)
41
+ type: NLBSE/nlbse26-code-comment-classification
42
+ split: test
43
+ metrics:
44
+ - type: accuracy
45
+ value: 0.7435
46
+ name: Accuracy
47
+ ---
48
+
49
+ # SetFit Model for Java Code Comment Classification
50
+
51
+ ## Model Details
52
+
53
+ - **Model Type:** SetFit (Sentence Transformer Fine-tuning)
54
+ - **Base Model:** [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2)
55
+ - **Language:** Java (Comments in English)
56
+ - **License:** MIT
57
+ - **Developed by:** TheClouds
58
+ - **Model Date:** November 4, 2025
59
+ - **Model Version:** 1.0
60
+ - **Contact:** For additional information contact team TheClouds on github
61
+
62
+
63
+ ### Description
64
+ This model is a SetFit model trained on the **Java** subset of the **NLBSE Code Comment Classification Dataset**. It is designed to classify code comments into one or more of **7 categories** that describe the semantic purpose of the comment.
65
+
66
+ The model uses a multi-label classification approach, where a single comment can belong to multiple categories.
67
+
68
+ ## Intended Use
69
+ This model has been created for the Code Comment Classification task, and trained specifically on code comments extracted from Java projects. As such, it is useful for research and development in code comment classification of projects made in Java or other object-oriented languages, or software documentation analysis tasks that perform supervised multi-label classification.
70
+
71
+ ### Out-of-Scope Use Cases
72
+ General text classification outside the domain of software engineering.
73
+
74
+ ## Factors
75
+
76
+ - **Programming Language:** The model is specifically trained on Java code comments.
77
+ - **Comment Types:** The model recognizes the following 7 categories specific to Java documentation:
78
+ 1. `summary`
79
+ 2. `Ownership`
80
+ 3. `Expand`
81
+ 4. `usage`
82
+ 5. `Pointer`
83
+ 6. `deprecation`
84
+ 7. `rational`
85
+
86
+ ## Metrics
87
+
88
+ - **Model Performance Measures:** The primary metrics used for evaluation are **Precision**, **Recall**, **F1-Score** and **Accuracy**.
89
+ - **Decision Threshold:** A probability threshold of 0.5 was used for classification.
90
+ - **Performance:** The model achieves an overall Accuracy of **0.7669** on the test set.
91
+
92
+ ### Dataset Summary
93
+ The **NLBSE Code Comment Classification Dataset** is a collection of code comment sentences accompanied by multi-label category annotations.
94
+
95
+ - **Java Labels (7):** `summary`, `Ownership`, `Expand`, `usage`, `Pointer`, `deprecation`, `rational`.
96
+
97
+ Each entry corresponds to a comment sentence extracted from real projects.
98
+
99
+ ### Motivation
100
+ The usage of this specific dataset was a requirement of the NLBSE 2025 code comment classification challenge.
101
+
102
+ ## Training Data
103
+
104
+ - **Dataset:** NLBSE Code Comment Classification Dataset (Java train split).
105
+ - **Size:** 5,390 rows.
106
+ - **Label Distribution:** The dataset contains 7 categories with varying frequencies. Common categories include "summary" and "usage".
107
+
108
+ ## Evaluation Data
109
+
110
+ - **Dataset:** NLBSE Code Comment Classification Dataset (Java test split).
111
+ - **Size:** 1,200 rows.
112
+ - **Preprocessing:** Comments were extracted from real-world open-source Java projects, split into sentences, and manually classified.
113
+
114
+ ## Quantitative Analyses
115
+
116
+ | lan | cat | precision | recall | f1 |
117
+ |---|---|---|---|---|
118
+ | java | summary | 0.871224 | 0.886731 | 0.878909 |
119
+ | java | Ownership | 1.000000 | 1.000000 | 1.000000 |
120
+ | java | Expand | 0.330097 | 0.430380 | 0.373626 |
121
+ | java | usage | 0.883803 | 0.850847 | 0.867012 |
122
+ | java | Pointer | 0.775641 | 0.968000 | 0.861210 |
123
+ | java | deprecation | 0.875000 | 0.700000 | 0.777778 |
124
+ | java | rational | 0.311688 | 0.413793 | 0.355556 |
125
+
126
+ ## Ethical Considerations
127
+
128
+ - **Biases:** The dataset is drawn from open-source software projects. The comments reflect the writing styles and norms of the open-source community, which may not be representative of all software development environments (e.g., proprietary software).
129
+ - **Content:** Comments are user-generated content and may contain informal language or jargon specific to the projects they were extracted from.
130
+
131
+ ## Caveats and Recommendations
132
+
133
+ - **Language Specificity:** The label set is specific to Java.
134
+ - **Context:** The model relies on text-only comment sentences. Surrounding code context is not included, which may limit the model's ability to resolve ambiguous comments.
135
+ - **Class Imbalance:** Some categories (e.g., `deprecation`, `Ownership`) may be underrepresented compared to `summary` or `usage`.
136
+
137
+ ## How to Use
138
+
139
+ First install the SetFit library:
140
+
141
+ ```bash
142
+ pip install setfit
143
+ ```
144
+
145
+ Then you can load this model and run inference:
146
+
147
+ ```python
148
+ from setfit import SetFitModel
149
+
150
+ # Download from the 🤗 Huggingface Hub
151
+ model = SetFitModel.from_pretrained("se4ai2526-uniba/setfit-java") # Replace with actual model ID if different
152
+
153
+ # Run inference
154
+ preds = model(["// quotes are removed | ScannerUtility.java"])
155
+ print(preds)
156
+ ```
157
+
158
+ ## Training Details
159
+
160
+ ### Training Hyperparameters
161
+ - batch_size: (32, 32)
162
+ - num_epochs: (2, 2)
163
+ - max_steps: -1
164
+ - sampling_strategy: oversampling
165
+ - num_iterations: 5
166
+ - body_learning_rate: (2e-05, 1e-05)
167
+ - head_learning_rate: 0.01
168
+ - loss: CosineSimilarityLoss
169
+ - distance_metric: cosine_distance
170
+ - margin: 0.25
171
+ - end_to_end: False
172
+ - use_amp: False
173
+ - warmup_proportion: 0.1
174
+ - l2_weight: 0.01
175
+ - seed: 42
176
+ - eval_max_steps: -1
177
+ - load_best_model_at_end: False
178
+ - probability_threshold: 0.5
179
+
180
+ ### Training Results
181
+ | Metric | Value |
182
+ |:-------|:------|
183
+ | **Accuracy** | 0.7669 |
184
+ | **Embedding Loss** | 0.0239 |
185
+ | **Training Loss** | 0.0587 |
186
+ | **Training Runtime** | 1515.40 s |
187
+ | **Training Samples/Sec** | 71.189 |
188
+ | **Training Steps/Sec** | 2.225 |
189
+
190
+ ### Framework Versions
191
+ - Python: 3.11.9
192
+ - SetFit: 1.1.2
193
+ - Sentence Transformers: 5.1.2
194
+ - Transformers: 4.57.1
195
+ - PyTorch: 2.7.1
196
+ - Datasets: 3.6.0
197
+ - Tokenizers: 0.22.1
198
+
199
+ ## Citation
200
+
201
+ If you use this model in academic work or derived systems, please cite:
202
+
203
+ > TheClouds Team. "NLBSE'26 Code Comment Classification – Java Model." 2025.
204
+
205
+ BibTeX:
206
+
207
+ ```bibtex
208
+ @misc{theclouds_nlbse26_code_comment_classification_java,
209
+ title = {NLBSE'26 Code Comment Classification: Java Model},
210
+ author = {TheClouds Team},
211
+ year = {2025},
212
+ note = {Model available on Hugging Face},
213
+ howpublished = {\url{To be published}}
214
+ }
215
+ ```
216
+
217
+ Contact:
218
+
219
+ For questions, feedback, or collaboration requests related to this model, please contact:
220
+ > Giacomo Signorile: g.signorile14@studenti.uniba.it
221
+ > Davide Pio Posa: d.posa3@studenti.uniba.it
222
+ > Marco Lillo: m.lillo21@studenti.uniba.it
223
+ > Rebecca Margiotta: m.margiotta5@studenti.uniba.it
224
+ > Adriano Gentile: a.gentile97@studenti.uniba.com
225
+
226
+ Issue tracker: https://github.com/se4ai2526-uniba/TheClouds
227
+
228
+
229
+ ## Acknowledgements
230
+
231
+ This model was created for research in the context of **NLBSE (Natural Language-Based Software Engineering)**.
models/model_cards/java/transformer/README.md ADDED
@@ -0,0 +1,328 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - text-classification
6
+ - code-comment-classification
7
+ - transformers
8
+ - codebert
9
+ - java
10
+ - software-engineering
11
+ - multi-label
12
+ license: mit
13
+ datasets:
14
+ - NLBSE/nlbse26-code-comment-classification
15
+ metrics:
16
+ - f1
17
+ - precision
18
+ - recall
19
+ - subset_accuracy
20
+ - runtime
21
+ - gflops
22
+ pipeline_tag: text-classification
23
+ library_name: transformers
24
+ inference: false
25
+ base_model: microsoft/codebert-base
26
+ model-index:
27
+ - name: CodeBERT Transformer for Java Code Comment Classification
28
+ results:
29
+ - task:
30
+ type: text-classification
31
+ name: Multi-label Text Classification
32
+ dataset:
33
+ name: NLBSE Code Comment Classification Dataset (Java)
34
+ type: NLBSE/nlbse26-code-comment-classification
35
+ split: test
36
+ metrics:
37
+ - type: f1
38
+ name: Macro F1
39
+ value: 0.7457
40
+ - type: f1
41
+ name: Micro F1
42
+ value: 0.8364
43
+ - type: precision
44
+ name: Macro Precision
45
+ value: 0.7307
46
+ - type: recall
47
+ name: Macro Recall
48
+ value: 0.7658
49
+ - type: accuracy
50
+ name: Subset Accuracy
51
+ value: 0.8085
52
+ ---
53
+
54
+ # Transformer Model (CodeBERT) for Java Code Comment Classification
55
+
56
+ ## Model Details
57
+
58
+ - **Model Type:** Transformer-based multi-label classifier (sequence classification head)
59
+ - **Base Model:** [`microsoft/codebert-base`](https://huggingface.co/microsoft/codebert-base)
60
+ - **Language:** Java (code comments in English)
61
+ - **License:** MIT
62
+ - **Developed by:** TheClouds
63
+ - **Model Date:** November 2025
64
+ - **Model Version:** 1.0
65
+
66
+ ### Description
67
+
68
+ This model fine-tunes **CodeBERT** on the **Java** subset of the **NLBSE Code Comment Classification Dataset** for **multi-label** classification. Each Java code comment sentence is mapped to one or more semantic categories describing the intent and role of the comment.
69
+
70
+ The classifier operates directly on the concatenated `combo` field used in the project (comment sentence plus file/method context string), and produces a 7-dimensional binary label vector.
71
+
72
+ ### Label Set
73
+
74
+ For Java, the model predicts the following 7 categories (fixed order in the classifier head):
75
+
76
+ 1. `summary`
77
+ 2. `Ownership`
78
+ 3. `Expand`
79
+ 4. `usage`
80
+ 5. `Pointer`
81
+ 6. `deprecation`
82
+ 7. `rational`
83
+
84
+ Each prediction is a length-7 vector of 0/1 decisions, obtained by applying a sigmoid activation to the logits and thresholding at 0.5 by default.
85
+
86
+ ---
87
+
88
+ ## Intended Use
89
+
90
+ The model is intended for:
91
+
92
+ - research on **code comment classification** in Java projects,
93
+ - analysis and mining of Java documentation comments,
94
+ - downstream tools that need a multi-label semantic categorization of comments (e.g., documentation quality checks, comment recommendation, refactoring assistants).
95
+
96
+ It is designed for **Java code comments** and similar documentation-style text from software projects.
97
+
98
+ ### Out-of-Scope Uses
99
+
100
+ - Generic natural language classification outside the software engineering domain.
101
+ - Non-English comments or comments from programming languages with substantially different documentation conventions, without additional fine-tuning.
102
+ - Safety- or life-critical decision making.
103
+
104
+ ---
105
+
106
+ ## Data
107
+
108
+ ### Training Data
109
+
110
+ - **Dataset:** NLBSE Code Comment Classification Dataset – Java train split
111
+ - **Size (train):** ~5.4k comment sentences
112
+ - **Label Space:** 7 multi-label categories (`summary`, `Ownership`, `Expand`, `usage`, `Pointer`, `deprecation`, `rational`)
113
+ - **Preprocessing:**
114
+ - Comments extracted from real-world open-source Java projects.
115
+ - Split into comment sentences and associated with class annotations.
116
+ - Project-specific preprocessing uses a `combo` field (`"<comment_sentence> | <class_context>"`).
117
+ - For this transformer model, training uses the preprocessed CSVs under `data/processed/transformer`, including synthetic oversampling (supersampling) for label balancing.
118
+
119
+ ### Evaluation Data
120
+
121
+ - **Dataset:** NLBSE Code Comment Classification Dataset – Java test split
122
+ - **Size (test):** ~1.2k comment sentences
123
+ - **Evaluation Protocol:** multi-label classification with micro and macro metrics; subset accuracy (exact match) is also reported.
124
+
125
+ ---
126
+
127
+ ## Metrics
128
+
129
+ ### Core Evaluation Metrics (Java, test split)
130
+
131
+ From the training/evaluation run logged in MLflow:
132
+
133
+ | lan | cat | precision | recall | f1 |
134
+ |------|-------------|-----------|---------|---------|
135
+ | java | summary | 0.88 | 0.92| 0.90|
136
+ | java | Ownership | 1.00 | 1.00| 1.00|
137
+ | java | Expand | 0.41 | 0.44| 0.42|
138
+ | java | usage | 0.89 | 0.85| 0.87|
139
+ | java | Pointer | 0.75 | 0.98| 0.85|
140
+ | java | deprecation | 0.89 | 0.80| 0.84|
141
+ | java | rational | 0.40 | 0.41| 0.41|
142
+
143
+
144
+
145
+ - **Micro F1:** 0.8364
146
+ - **Macro F1:** 0.7457
147
+ - **Micro Precision:** 0.8142
148
+ - **Micro Recall:** 0.8599
149
+ - **Macro Precision:** 0.7307
150
+ - **Macro Recall:** 0.7658
151
+ - **Subset Accuracy (exact match):** 0.8085
152
+ - **Micro Accuracy (per-label):** 0.9515
153
+ - **Eval Loss (BCE with logits):** 0.6207
154
+ - **Train Loss (final epoch):** 0.0291
155
+
156
+ ### Benchmarking Metrics
157
+
158
+ Average performance over the Java benchmarking runs:
159
+
160
+ - **Average Macro F1:** 0.7457
161
+ - **Average Precision (macro):** 0.7307
162
+ - **Average Recall (macro):** 0.7658
163
+ - **Average Runtime (sec per run):** 168.53
164
+ - **Average GFLOPs (inference benchmark):** 26118.05
165
+
166
+ These metrics indicate that the transformer model improves over earlier baselines in terms of both micro and macro F1, while maintaining reasonable runtime characteristics for research workloads.
167
+
168
+ ---
169
+
170
+ ## Quantitative Analysis
171
+
172
+ The model is evaluated in a strictly multi-label setting:
173
+
174
+ - **Micro metrics** emphasize overall correctness across all label decisions.
175
+ - **Macro metrics** average performance across labels, giving more visibility into underrepresented classes (e.g., `Ownership`, `deprecation`, `rational`).
176
+
177
+ Per-class precision/recall/F1 can be inspected in the saved classification report for the Java transformer run (logged as an artifact in MLflow). These results show good performance on frequent categories such as `summary`, `usage`, and `Pointer`, with weaker but still meaningful performance on minority labels.
178
+
179
+ ---
180
+
181
+ ## Training Details
182
+
183
+ ### Objective and Architecture
184
+
185
+ - **Base model:** `microsoft/codebert-base`
186
+ - **Head:** linear classification head with `num_labels = 7`
187
+ - **Problem type:** `multi_label_classification`
188
+ - **Loss function:** `BCEWithLogitsLoss` with **per-label positive class weights** computed from training label frequencies.
189
+ - **Sampling:** `WeightedRandomSampler` on training instances to mitigate class imbalance.
190
+
191
+ ### Hyperparameters
192
+
193
+ - **Max sequence length:** 128
194
+ - **Batch size:** 16
195
+ - **Learning rate:** 2e-5
196
+ - **Optimizer:** AdamW
197
+ - **Scheduler:** Linear warmup and decay
198
+ - **Warmup ratio:** 0.1
199
+ - **Number of epochs:** 5
200
+ - **Threshold for prediction:** 0.5 (applied to sigmoid probabilities)
201
+
202
+ ### Preprocessing and Balancing
203
+
204
+ - Training uses the **preprocessed and supersampled** Java CSVs from `data/processed/transformer`.
205
+ - Supersampling is applied only to the training split to upsample underrepresented labels while capping each label’s frequency at the original maximum, to avoid extreme duplication.
206
+ - The test split remains untouched and corresponds to the original NLBSE Java test data.
207
+
208
+ ### Hardware / Runtime
209
+
210
+ The reported average runtime (~168.5 seconds) and average GFLOPs (~26k) refer to the evaluation/benchmarking setup used in the project (single GPU, typical research hardware). Exact throughput and latency depend on the deployment environment and batch size.
211
+
212
+ ---
213
+
214
+ ## How to Use
215
+
216
+ Install `transformers` and `torch`:
217
+
218
+ ```bash
219
+ pip install transformers torch
220
+ ```
221
+
222
+ Then load the model and tokenizer (replace the model ID with your repository name):
223
+
224
+ ```python
225
+ import torch
226
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
227
+
228
+ MODEL_ID = "se4ai2526-uniba/java-transformer" # replace with actual ID
229
+
230
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
231
+ model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
232
+ model.eval()
233
+
234
+ LABELS = [
235
+ "summary",
236
+ "Ownership",
237
+ "Expand",
238
+ "usage",
239
+ "Pointer",
240
+ "deprecation",
241
+ "rational",
242
+ ]
243
+
244
+ def predict_labels(text, threshold: float = 0.5):
245
+ inputs = tokenizer(
246
+ text,
247
+ padding=True,
248
+ truncation=True,
249
+ max_length=128,
250
+ return_tensors="pt",
251
+ )
252
+ with torch.no_grad():
253
+ logits = model(**inputs).logits
254
+ probs = torch.sigmoid(logits)
255
+
256
+ preds = (probs > threshold).int().cpu().numpy()
257
+ results = []
258
+ for row in preds:
259
+ labels = [LABELS[i] for i, v in enumerate(row) if v == 1]
260
+ results.append(labels)
261
+ return results
262
+
263
+ # Example
264
+ comments = [
265
+ "// quotes are removed | ScannerUtility.java",
266
+ ]
267
+ print(predict_labels(comments))
268
+ ```
269
+
270
+ If you want to reproduce the project’s behaviour end-to-end, you can wrap this transformer in the same `ModelPredictor` utility used by the codebase.
271
+
272
+ ---
273
+
274
+ ## Limitations and Biases
275
+
276
+ * **Domain specificity:** The model is trained only on Java code comments from open-source projects. It may not generalize perfectly to other languages, domains, or proprietary codebases.
277
+ * **Imbalanced labels:** Some categories are relatively rare; even with supersampling and positive class weights, performance on minority labels may be unstable compared to frequent ones.
278
+ * **Sensitivity to perturbations:** Behavioral tests show that the current model is:
279
+
280
+ * deterministic and stable on duplicate inputs,
281
+ * reasonably aligned with curated golden examples,
282
+ * still sensitive to certain benign changes (extra whitespace, case changes, typos) unless additional normalization/augmentation is applied.
283
+
284
+ ---
285
+
286
+ ## Ethical Considerations
287
+
288
+ * The training data consists of comments from open-source repositories. These may reflect cultural norms, jargon, and biases of the corresponding communities.
289
+ * The model does not attempt to filter offensive or inappropriate content in comments; it only assigns category labels for documentation-related classes.
290
+ * Use in downstream applications should account for potential biases and limitations and avoid presenting outputs as authoritative or error-free.
291
+
292
+ ---
293
+
294
+ ## Citation
295
+
296
+ If you use this model in academic work or derived systems, please cite:
297
+
298
+ > TheClouds Team. "NLBSE'26 Code Comment Classification – Java Model." 2025.
299
+
300
+ BibTeX:
301
+
302
+ ```bibtex
303
+ @misc{theclouds_nlbse26_code_comment_classification_java,
304
+ title = {NLBSE'26 Code Comment Classification: Java Model},
305
+ author = {TheClouds Team},
306
+ year = {2025},
307
+ note = {Model available on Hugging Face},
308
+ howpublished = {\url{To be published}}
309
+ }
310
+ ```
311
+
312
+ Contact:
313
+
314
+ For questions, feedback, or collaboration requests related to this model, please contact:
315
+ > Giacomo Signorile: g.signorile14@studenti.uniba.it
316
+ > Davide Pio Posa: d.posa3@studenti.uniba.it
317
+ > Marco Lillo: m.lillo21@studenti.uniba.it
318
+ > Rebecca Margiotta: m.margiotta5@studenti.uniba.it
319
+ > Adriano Gentile: a.gentile97@studenti.uniba.com
320
+
321
+ Issue tracker: https://github.com/se4ai2526-uniba/TheClouds
322
+
323
+ ---
324
+
325
+ ## Acknowledgements
326
+
327
+ This model was developed as part of research on **Natural Language-Based Software Engineering (NLBSE)** and the **Code Comment Classification** task, building on the NLBSE’26 competition data and prior SetFit baselines.
328
+
models/model_cards/pharo/setfit/README.md ADDED
@@ -0,0 +1,223 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - setfit
6
+ - sentence-transformers
7
+ - text-classification
8
+ - generated_from_setfit_trainer
9
+ license:
10
+ - mit
11
+ datasets:
12
+ - NLBSE/nlbse26-code-comment-classification
13
+ metrics:
14
+ - f1
15
+ - precision
16
+ - recall
17
+ - accuracy
18
+ pipeline_tag: text-classification
19
+ library_name: setfit
20
+ inference: false
21
+ base_model: sentence-transformers/paraphrase-MiniLM-L6-v2
22
+ model-index:
23
+ - name: SetFit with sentence-transformers/paraphrase-MiniLM-L6-v2
24
+ results:
25
+ - task:
26
+ type: text-classification
27
+ name: Text Classification
28
+ dataset:
29
+ name: NLBSE Code Comment Classification Dataset (Pharo)
30
+ type: NLBSE/nlbse26-code-comment-classification
31
+ split: test
32
+ metrics:
33
+ - type: accuracy
34
+ value: 0.5673
35
+ name: Accuracy
36
+ ---
37
+
38
+ # SetFit Model for Pharo Code Comment Classification
39
+
40
+ ## Model Details
41
+
42
+ - **Model Type:** SetFit (Sentence Transformer Fine-tuning)
43
+ - **Base Model:** [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2)
44
+ - **Language:** Pharo (Comments in English)
45
+ - **License:** MIT
46
+ - **Developed by:** TheClouds
47
+ - **Model Date:** November 17, 2025
48
+ - **Model Version:** 1.0
49
+ - **Maximum Sequence Length:** 128 tokens
50
+ - **Contact:** For questions or comments about this model, please contact us via GitHub or email.
51
+
52
+ ### Description
53
+ This model is a SetFit model trained on the **Pharo** subset of the **NLBSE Code Comment Classification Dataset**. It is designed to classify code comments into one or more of **6 categories** that describe the semantic purpose of the comment.
54
+
55
+ The model uses a multi-label classification approach, where a single comment can belong to multiple categories.
56
+
57
+ ## Intended Use
58
+ This model has been created for the Code Comment Classification task, and trained specifically on code comments extracted from Pharo projects. As such, it is useful for research and development in code comment classification of projects made in Pharo, or software documentation analysis tasks.
59
+
60
+ ### Out-of-Scope Use Cases
61
+ General text classification outside the domain of software engineering (e.g., social media sentiment analysis) is out of scope.
62
+
63
+ ## Factors
64
+
65
+ - **Programming Language:** The model is specifically trained on Pharo code comments.
66
+ - **Comment Types:** The model recognizes the following 6 categories specific to Pharo documentation:
67
+ 1. `Keyimplementationpoints`
68
+ 2. `Example`
69
+ 3. `Responsibilities`
70
+ 4. `Intent`
71
+ 5. `Keymessages`
72
+ 6. `Collaborators`
73
+
74
+ ## Metrics
75
+
76
+ - **Model Performance Measures:** The primary metrics used for evaluation are **Precision**, **Recall**, and **F1-Score**.
77
+ - **Performance:** The model achieves an average F1-Score of 0.4628 on the test set.
78
+
79
+ ## Evaluation Data
80
+
81
+ - **Dataset:** NLBSE Code Comment Classification Dataset (Pharo test split).
82
+ - **Size:** 208 rows.
83
+ - **Preprocessing:** Comments were extracted from real-world open-source Pharo projects, split into sentences, and manually classified.
84
+
85
+ ## Training Data
86
+
87
+ - **Dataset:** NLBSE Code Comment Classification Dataset (Pharo train split).
88
+ - **Size:** 900 rows.
89
+ - **Label Distribution:** The dataset contains 6 categories with varying frequencies.
90
+
91
+ ### Dataset Summary
92
+ The **NLBSE Code Comment Classification Dataset** is a collection of code comment sentences accompanied by multi-label category annotations.
93
+
94
+ - **Pharo Labels (6):** `collaborators`, `example`, `intent`, `keyimplementationpoints`, `keymessages`, `responsibilities`.
95
+
96
+ Each entry corresponds to a comment sentence extracted from real projects.
97
+
98
+ ## Quantitative Analyses
99
+ The following table shows the performance breakdown per category on the Pharo test set:
100
+
101
+ | lan | cat | precision | recall | f1 |
102
+ | ----- | --------------------------- | --------- | -------- | -------- |
103
+ | pharo | **Keyimplementationpoints** | 0.562500 | 0.642857 | 0.600000 |
104
+ | pharo | **Example** | 0.886364 | 0.876404 | 0.881356 |
105
+ | pharo | **Responsibilities** | 0.632653 | 0.738095 | 0.681319 |
106
+ | pharo | **Intent** | 0.720000 | 0.857143 | 0.782609 |
107
+ | pharo | **Keymessages** | 0.478261 | 0.733333 | 0.578947 |
108
+ | pharo | **Collaborators** | 0.103448 | 0.428571 | 0.166667 |
109
+
110
+ ## Ethical Considerations
111
+
112
+ - **Biases:** The dataset is drawn from open-source software projects. The comments reflect the writing styles and norms of the open-source community, which may not be representative of all software development environments (e.g., proprietary software).
113
+ - **Content:** Comments are user-generated content and may contain informal language or jargon specific to the projects they were extracted from.
114
+
115
+ ## Caveats and Recommendations
116
+
117
+ - **Performance Variation:** The model performs well on `example` comments (F1 0.881) and `intent` comments (F1 0.782) but struggles significantly with all the other categories. Users should exercise caution when relying on the model for identifying development notes or rationale.
118
+ - **Context:** The model relies on text-only comment sentences. Surrounding code context is not included.
119
+
120
+ ## How to Use
121
+
122
+ First install the SetFit library:
123
+
124
+ ```bash
125
+ pip install setfit
126
+ ```
127
+
128
+ Then you can load this model and run inference:
129
+
130
+ ```python
131
+ from setfit import SetFitModel
132
+
133
+ # Download from the 🤗 Huggingface Hub
134
+ model = SetFitModel.from_pretrained("se4ai2526-uniba/setfit-pharo") # Replace with actual model ID if different
135
+
136
+ # Run inference
137
+ preds = model(["each phase knows about its start time and send a corresponding event once the phase is completed. | BlSpaceFramePhase"])
138
+ print(preds)
139
+ ```
140
+
141
+ ## Training Details
142
+
143
+ ### Training Hyperparameters
144
+
145
+ - batch_size: (32, 32)
146
+ - body_learning_rate: (2e-05, 1e-05)
147
+ - distance_metric: cosine_distance
148
+ - end_to_end: False
149
+ - eval_delay: False
150
+ - eval_max_steps: -1
151
+ - eval_steps: None
152
+ - eval_strategy: IntervalStrategy.NO
153
+ - evaluation_strategy: None
154
+ - greater_is_better: False
155
+ - head_learning_rate: 0.01
156
+ - l2_weight: 0.01
157
+ - load_best_model_at_end: False
158
+ - loss: CosineSimilarityLoss
159
+ - margin: 0.25
160
+ - max_length: None
161
+ - max_steps: -1
162
+ - metric_for_best_model: embedding_loss
163
+ - num_epochs: (2, 2)
164
+ - num_iterations: 5
165
+ - samples_per_label: 2
166
+ - sampling_strategy: oversampling
167
+ - save_steps: 500
168
+ - save_strategy: steps
169
+ - save_total_limit: 1
170
+ - seed: 42
171
+ - use_amp: False
172
+ - warmup_proportion: 0.1
173
+
174
+ ### Training Results
175
+
176
+ | Metric | Value |
177
+ | :----------------------- | :--------- |
178
+ | **Accuracy** | 0.5673 |
179
+ | **Embedding Loss** | 0.105 |
180
+ | **Training Loss** | 0.1566 |
181
+ | **Training Runtime** | 161.2121 s |
182
+ | **Training Samples/Sec** | 111.654 |
183
+ | **Training Steps/Sec** | 3.498 |
184
+
185
+ ### Framework Versions
186
+ - Python: 3.11.9
187
+ - SetFit: 1.1.2
188
+ - Sentence Transformers: 5.1.2
189
+ - Transformers: 4.57.1
190
+ - PyTorch: 2.7.1
191
+ - Datasets: 3.6.0
192
+ - Tokenizers: 0.22.1
193
+
194
+ ## Citation
195
+
196
+ If you use this model in academic work or derived systems, please cite:
197
+
198
+ > TheClouds Team. "NLBSE'26 Code Comment Classification – Pharo Model." 2025.
199
+
200
+ BibTeX:
201
+
202
+ ```bibtex
203
+ @misc{theclouds_nlbse26_code_comment_classification_pharo,
204
+ title = {NLBSE'26 Code Comment Classification: Pharo Model},
205
+ author = {TheClouds Team},
206
+ year = {2025},
207
+ note = {Model available on Hugging Face},
208
+ howpublished = {\url{To be published}}
209
+ }
210
+ ```
211
+
212
+ Contact:
213
+
214
+ For questions, feedback, or collaboration requests related to this model, please contact:
215
+ > Giacomo Signorile: g.signorile14@studenti.uniba.it
216
+ > Davide Pio Posa: d.posa3@studenti.uniba.it
217
+ > Marco Lillo: m.lillo21@studenti.uniba.it
218
+ > Rebecca Margiotta: m.margiotta5@studenti.uniba.it
219
+ > Adriano Gentile: a.gentile97@studenti.uniba.com
220
+
221
+ Issue tracker: https://github.com/se4ai2526-uniba/TheClouds
222
+
223
+ ```
models/model_cards/pharo/transformer/README.md ADDED
@@ -0,0 +1,329 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - text-classification
6
+ - code-comment-classification
7
+ - transformers
8
+ - codebert
9
+ - pharo
10
+ - software-engineering
11
+ - multi-label
12
+ license: mit
13
+ datasets:
14
+ - NLBSE/nlbse26-code-comment-classification
15
+ metrics:
16
+ - f1
17
+ - precision
18
+ - recall
19
+ - subset_accuracy
20
+ - runtime
21
+ - gflops
22
+ pipeline_tag: text-classification
23
+ library_name: transformers
24
+ inference: false
25
+ base_model: microsoft/codebert-base
26
+ model-index:
27
+ - name: CodeBERT Transformer for Pharo Code Comment Classification
28
+ results:
29
+ - task:
30
+ type: text-classification
31
+ name: Multi-label Text Classification
32
+ dataset:
33
+ name: NLBSE Code Comment Classification Dataset (Pharo)
34
+ type: NLBSE/nlbse26-code-comment-classification
35
+ split: test
36
+ metrics:
37
+ - type: f1
38
+ name: Macro F1
39
+ value: 0.5980
40
+ - type: f1
41
+ name: Micro F1
42
+ value: 0.6720
43
+ - type: precision
44
+ name: Macro Precision
45
+ value: 0.5234
46
+ - type: recall
47
+ name: Macro Recall
48
+ value: 0.7157
49
+ - type: accuracy
50
+ name: Subset Accuracy
51
+ value: 0.5096
52
+ ---
53
+
54
+ # Transformer Model (CodeBERT) for Pharo Code Comment Classification
55
+
56
+ ## Model Details
57
+
58
+ - **Model Type:** Transformer-based multi-label classifier (sequence classification head)
59
+ - **Base Model:** [`microsoft/codebert-base`](https://huggingface.co/microsoft/codebert-base)
60
+ - **Language:** Pharo (code comments in English/technical English)
61
+ - **License:** MIT
62
+ - **Developed by:** TheClouds
63
+ - **Model Date:** November 2025
64
+ - **Model Version:** 1.0
65
+
66
+ ### Description
67
+
68
+ This model fine-tunes **CodeBERT** on the **Pharo** subset of the **NLBSE Code Comment Classification Dataset** for **multi-label** classification. Each Pharo code comment sentence is mapped to one or more semantic categories that describe design intent and responsibilities of classes and methods.
69
+
70
+ The classifier operates on the `combo` field used in the project (comment sentence plus a compact context string) and produces a 6-dimensional binary label vector.
71
+
72
+ ### Label Set
73
+
74
+ For Pharo, the model predicts the following 6 categories (fixed order in the classifier head):
75
+
76
+ 1. `Keyimplementationpoints`
77
+ 2. `Example`
78
+ 3. `Responsibilities`
79
+ 4. `Intent`
80
+ 5. `Keymessages`
81
+ 6. `Collaborators`
82
+
83
+ Each prediction is a length-6 vector of 0/1 decisions, obtained by applying a sigmoid activation to the logits and thresholding at 0.5 by default.
84
+
85
+ ---
86
+
87
+ ## Intended Use
88
+
89
+ The model is intended for:
90
+
91
+ - research on **code comment and design documentation classification** in Pharo projects,
92
+ - mining and analysis of Pharo method/class comments to extract design intent and responsibilities,
93
+ - tools that need semantic tags for Smalltalk/Pharo comments (e.g., navigation, documentation quality checks, design overview tools).
94
+
95
+ It is designed for **Pharo code comments** written in English or English-like technical language.
96
+
97
+ ### Out-of-Scope Uses
98
+
99
+ - Generic text classification outside software engineering and Pharo/Smalltalk ecosystems.
100
+ - Non-English comments, or comments from unrelated programming languages, without additional fine-tuning.
101
+ - Any safety- or life-critical decision-making context.
102
+
103
+ ---
104
+
105
+ ## Data
106
+
107
+ ### Training Data
108
+
109
+ - **Dataset:** NLBSE Code Comment Classification Dataset – Pharo train split
110
+ - **Size (train):** ~900 original training examples (expanded to ~1.9k via supersampling in the current configuration)
111
+ - **Label Space:** 6 multi-label categories (`Keyimplementationpoints`, `Example`, `Responsibilities`, `Intent`, `Keymessages`, `Collaborators`)
112
+ - **Preprocessing:**
113
+ - Comments extracted from real-world Pharo projects.
114
+ - Each sample represented using the `combo` field: `"<comment_sentence> | <class_context>"` (or similar contextual string).
115
+ - For this transformer configuration, the training data come from `data/processed/transformer`, where a supersampling procedure is applied to reduce label imbalance.
116
+
117
+ ### Evaluation Data
118
+
119
+ - **Dataset:** NLBSE Code Comment Classification Dataset – Pharo test split
120
+ - **Size (test):** ~200 comment sentences
121
+ - **Evaluation Protocol:** multi-label classification with micro/macro metrics and subset accuracy (exact-match), evaluated on the original, non-supersampled test split.
122
+
123
+ ---
124
+
125
+ ## Metrics
126
+
127
+ ### Core Evaluation Metrics (Pharo, test split)
128
+
129
+ From the training/evaluation run logged in MLflow:
130
+
131
+ | lan | cat | precision | recall | f1 |
132
+ |-------|------------------------|-----------|---------|---------|
133
+ | pharo | Keyimplementationpoints| 0.47 | 0.68| 0.56|
134
+ | pharo | Example | 0.89 | 0.83| 0.86|
135
+ | pharo | Responsibilities | 0.57 | 0.76| 0.65|
136
+ | pharo | Intent | 0.83 | 0.90| 0.86|
137
+ | pharo | Keymessages | 0.47 | 0.73| 0.57|
138
+ | pharo | Collaborators | 0.33 | 0.57| 0.42|
139
+
140
+ - **Micro F1:** 0.6720
141
+ - **Macro F1:** 0.5980
142
+ - **Micro Precision:** 0.5964
143
+ - **Micro Recall:** 0.7696
144
+ - **Macro Precision:** 0.5234
145
+ - **Macro Recall:** 0.7157
146
+ - **Subset Accuracy (exact match):** 0.5096
147
+ - **Micro Accuracy (per-label):** 0.8694
148
+ - **Eval Loss (BCE with logits):** 0.5889
149
+ - **Train Loss (final epoch):** 0.2149
150
+
151
+ ### Benchmarking Metrics
152
+
153
+ Average performance over Pharo transformer benchmarking runs:
154
+
155
+ - **Average Macro F1:** 0.5980
156
+ - **Average Precision (macro):** 0.5234
157
+ - **Average Recall (macro):** 0.7157
158
+ - **Average Runtime:** ~1.35 seconds (benchmark configuration)
159
+ - **Average GFLOPs:** ~1943.77
160
+
161
+ These results indicate that the model captures the main Pharo comment categories reasonably well, with particularly strong recall, while macro precision and F1 reflect the dataset’s label imbalance and limited size.
162
+
163
+ ---
164
+
165
+ ## Quantitative Analysis
166
+
167
+ The evaluation is fully multi-label:
168
+
169
+ - **Micro metrics** reflect overall correctness across all label decisions.
170
+ - **Macro metrics** treat each of the six labels equally, exposing weaknesses on rarer categories (e.g., `Collaborators`, `Keymessages`).
171
+
172
+ A detailed per-class breakdown (precision/recall/F1 per label) is available in the classification report artifact for the Pharo transformer run in MLflow. High-level observations include:
173
+
174
+ - Better performance on frequent categories such as `Example` and `Responsibilities`.
175
+ - More variable performance on `Intent`, `Keymessages`, and `Collaborators`, due to fewer training examples.
176
+
177
+ ---
178
+
179
+ ## Training Details
180
+
181
+ ### Objective and Architecture
182
+
183
+ - **Base model:** `microsoft/codebert-base`
184
+ - **Head:** linear classification head with `num_labels = 6`
185
+ - **Problem type:** `multi_label_classification`
186
+ - **Loss function:** `BCEWithLogitsLoss` with per-label **positive class weights** computed from training label frequencies.
187
+ - **Sampling:** `WeightedRandomSampler` over training samples to partially correct for label imbalance.
188
+
189
+ ### Hyperparameters
190
+
191
+ - **Max sequence length:** 128
192
+ - **Batch size:** 16
193
+ - **Learning rate:** 2e-5
194
+ - **Optimizer:** AdamW
195
+ - **Scheduler:** Linear warmup and decay
196
+ - **Warmup ratio:** 0.1
197
+ - **Number of epochs:** 5
198
+ - **Prediction threshold:** 0.5 (per-label on sigmoid probabilities)
199
+
200
+ ### Preprocessing and Balancing
201
+
202
+ - Training data for Pharo are produced by the project’s preprocessing module, which:
203
+ - ensures a `combo` text field is present,
204
+ - parses the label strings into binary vectors,
205
+ - applies **supersampling** on the train split only (up to a cap at the maximum original label frequency).
206
+ - The test split is not modified and corresponds to the original NLBSE Pharo test data.
207
+
208
+ ### Hardware / Runtime
209
+
210
+ The runtime and GFLOPs figures are measured in the project’s benchmarking environment (single GPU, standard research workstation). Actual performance in deployment will depend on hardware and batch size.
211
+
212
+ ---
213
+
214
+ ## How to Use
215
+
216
+ Install dependencies:
217
+
218
+ ```bash
219
+ pip install transformers torch
220
+ ```
221
+
222
+ Then load the model and tokenizer (replace the model ID with the actual repository):
223
+
224
+ ```python
225
+ import torch
226
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
227
+
228
+ MODEL_ID = "se4ai2526-uniba/pharo-transformer" # replace with actual ID
229
+
230
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
231
+ model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
232
+ model.eval()
233
+
234
+ LABELS = [
235
+ "Keyimplementationpoints",
236
+ "Example",
237
+ "Responsibilities",
238
+ "Intent",
239
+ "Keymessages",
240
+ "Collaborators",
241
+ ]
242
+
243
+ def predict_labels(texts, threshold: float = 0.5):
244
+ if isinstance(texts, str):
245
+ texts = [texts]
246
+
247
+ inputs = tokenizer(
248
+ texts,
249
+ padding=True,
250
+ truncation=True,
251
+ max_length=128,
252
+ return_tensors="pt",
253
+ )
254
+ with torch.no_grad():
255
+ logits = model(**inputs).logits
256
+ probs = torch.sigmoid(logits)
257
+
258
+ preds = (probs > threshold).int().cpu().numpy()
259
+ results = []
260
+ for row in preds:
261
+ labels = [LABELS[i] for i, v in enumerate(row) if v == 1]
262
+ results.append(labels)
263
+ return results
264
+
265
+ # Example
266
+ comments = [
267
+ "\"The intent of this class is to manage UI events\" | MyWidget class",
268
+ ]
269
+ print(predict_labels(comments))
270
+ ```
271
+
272
+ For consistency with the rest of the project, you can also use the shared `ModelPredictor` wrapper and the same preprocessing normalization applied during training.
273
+
274
+ ---
275
+
276
+ ## Limitations and Biases
277
+
278
+ * **Limited data:** The Pharo split is relatively small compared to Java/Python; this constrains the model’s ability to generalize, especially for minority labels.
279
+ * **Imbalanced label distribution:** Despite supersampling and positive weights, some categories remain harder to predict reliably.
280
+ * **Sensitivity to perturbations:** Behavioral tests show:
281
+
282
+ * deterministic behaviour and stable predictions on duplicate inputs,
283
+ * alignment with several curated golden examples,
284
+ * sensitivity to some benign text changes (whitespace, case, punctuation, typos) unless additional normalization or data augmentation is introduced.
285
+
286
+ ---
287
+
288
+ ## Ethical Considerations
289
+
290
+ * The model is trained on comments from open-source Pharo projects and inherits their style, norms, and potential biases.
291
+ * It does not filter or sanitize comment content; it only categorizes comments into design/documentation categories.
292
+ * Outputs should be used as assistive signals in tooling and analysis, not as authoritative judgements.
293
+
294
+ ---
295
+
296
+ ## Citation
297
+
298
+ If you use this model in academic work or derived systems, please cite:
299
+
300
+ > TheClouds Team. "NLBSE'26 Code Comment Classification – Pharo Model." 2025.
301
+
302
+ BibTeX:
303
+
304
+ ```bibtex
305
+ @misc{theclouds_nlbse26_code_comment_classification_pharo,
306
+ title = {NLBSE'26 Code Comment Classification: Pharo Model},
307
+ author = {TheClouds Team},
308
+ year = {2025},
309
+ note = {Model available on Hugging Face},
310
+ howpublished = {\url{To be published}}
311
+ }
312
+ ```
313
+
314
+ Contact:
315
+
316
+ For questions, feedback, or collaboration requests related to this model, please contact:
317
+ > Giacomo Signorile: g.signorile14@studenti.uniba.it
318
+ > Davide Pio Posa: d.posa3@studenti.uniba.it
319
+ > Marco Lillo: m.lillo21@studenti.uniba.it
320
+ > Rebecca Margiotta: m.margiotta5@studenti.uniba.it
321
+ > Adriano Gentile: a.gentile97@studenti.uniba.com
322
+
323
+ Issue tracker: https://github.com/se4ai2526-uniba/TheClouds
324
+
325
+ ```
326
+ ## Acknowledgements
327
+
328
+ This model was developed as part of research on **Natural Language-Based Software Engineering (NLBSE)** and the **Code Comment Classification** task, using the NLBSE’26 competition data and building on earlier SetFit and Random Forest baselines.
329
+
models/model_cards/python/setfit/README.md ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - setfit
6
+ - sentence-transformers
7
+ - text-classification
8
+ - generated_from_setfit_trainer
9
+ license:
10
+ - mit
11
+ datasets:
12
+ - NLBSE/nlbse26-code-comment-classification
13
+ widget:
14
+ - text: dataright np^sin 2 np^pi 224 t | Audio
15
+ - text: robust way to ask the database for its current transaction state. | AtomicTests
16
+ - text: the string marking the beginning of a print statement. | Environment
17
+ - text: handled otherwise by a particular method. | StringMethods
18
+ - text: table. | PlotAccessor
19
+ metrics:
20
+ - accuracy
21
+ pipeline_tag: text-classification
22
+ library_name: setfit
23
+ inference: false
24
+ base_model: sentence-transformers/paraphrase-MiniLM-L6-v2
25
+ model-index:
26
+ - name: SetFit with sentence-transformers/paraphrase-MiniLM-L6-v2
27
+ results:
28
+ - task:
29
+ type: text-classification
30
+ name: Text Classification
31
+ dataset:
32
+ name: NLBSE Code Comment Classification Dataset (Python)
33
+ type: NLBSE/nlbse26-code-comment-classification
34
+ split: test
35
+ metrics:
36
+ - type: accuracy
37
+ value: 0.4482758620689655
38
+ name: Accuracy
39
+ ---
40
+
41
+ # SetFit Model for Python Code Comment Classification
42
+
43
+ ## Model Details
44
+
45
+ - **Model Type:** SetFit (Sentence Transformer Fine-tuning)
46
+ - **Base Model:** [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2)
47
+ - **Language:** Python (Comments in English)
48
+ - **License:** MIT
49
+ - **Developed by:** TheClouds
50
+ - **Model Date:** November 17, 2025
51
+ - **Model Version:** 1.0
52
+ - **Maximum Sequence Length:** 128 tokens
53
+ - **Contact:** For questions or comments about this model, please contact us via GitHub or email.
54
+
55
+ ### Description
56
+ This model is a SetFit model trained on the **Python** subset of the **NLBSE Code Comment Classification Dataset**. It is designed to classify code comments into categories that describe the semantic purpose of the comment (e.g., Summary, Usage, Parameters).
57
+
58
+ The model uses a multi-label classification approach, where a single comment can belong to multiple categories.
59
+
60
+ ## Intended Use
61
+ This model has been created for the Code Comment Classification task, and trained specifically on code comments extracted from Python projects. As such, it is useful for research and development in code comment classification of projects made in Python, or software documentation analysis tasks.
62
+
63
+ ### Out-of-Scope Use Cases
64
+ General text classification outside the domain of software engineering (e.g., social media sentiment analysis) is out of scope.
65
+
66
+ ## Factors
67
+
68
+ - **Programming Language:** The model is specifically trained on Python code comments (including inline comments `#` and docstrings `"""`).
69
+ - **Comment Types:** The model has been evaluated on the following categories specific to software documentation:
70
+ 1. `Summary`
71
+ 2. `Usage`
72
+ 3. `Parameters`
73
+ 4. `Expand`
74
+ 5. `DevelopmentNotes`
75
+
76
+ ## Metrics
77
+
78
+ - **Model Performance Measures:** The primary metrics used for evaluation are **Precision**, **Recall**, **F1-Score**, and **Accuracy**.
79
+ - **Decision Thresholds:** A probability threshold of **0.5** was used for classification.
80
+ - **Global Performance:** The model achieves an overall Accuracy of **0.4483** on the test set.
81
+
82
+ ## Evaluation Data
83
+
84
+ - **Dataset:** NLBSE Code Comment Classification Dataset (Python test split).
85
+ - **Motivation:** This dataset was chosen because it is the established benchmark for the NLBSE (Natural Language-Based Software Engineering) workshop.
86
+ - **Size** 290 rows.
87
+ - **Preprocessing:** Comments were extracted from real-world open-source Python projects, split into sentences, and manually classified.
88
+
89
+ ## Training Data
90
+
91
+ - **Dataset:** NLBSE Code Comment Classification Dataset (Python train split).
92
+ - **Dataset Stats:**
93
+ | Training set | Min | Median | Max |
94
+ |:-------------|:----|:--------|:----|
95
+ | Word count | 3 | 15.5217 | 299 |
96
+
97
+ ## Quantitative Analyses
98
+
99
+ The following table shows the performance breakdown per category on the Python test set:
100
+
101
+ | Language | Category | Precision | Recall | F1-Score |
102
+ |---|---|---|---|---|
103
+ | python | **Summary** | 0.6897 | 0.6557 | 0.6723 |
104
+ | python | **Usage** | 0.6667 | 0.6813 | 0.6739 |
105
+ | python | **Parameters** | 0.6882 | 0.7529 | 0.7191 |
106
+ | python | **Expand** | 0.4533 | 0.6667 | 0.5397 |
107
+ | python | **DevelopmentNotes** | 0.2192 | 0.5000 | 0.3048 |
108
+
109
+ ## Ethical Considerations
110
+
111
+ - **Biases:** The dataset is drawn from open-source software projects. The comments reflect the writing styles and norms of the open-source Python community.
112
+ - **Content:** Comments are user-generated content and may contain informal language or jargon.
113
+
114
+ ## Caveats and Recommendations
115
+
116
+ - **Performance Variation:** The model performs well on structural comments like `Parameters` (F1 0.72) but struggles significantly with `DevelopmentNotes` (F1 0.30). Users should exercise caution when relying on the model for identifying development notes or rationale.
117
+ - **Context:** The model relies on text-only comment sentences. Surrounding code context is not included.
118
+
119
+ ## How to Use
120
+
121
+ First install the SetFit library:
122
+
123
+ ```bash
124
+ pip install setfit
125
+ ```
126
+
127
+ Then you can load this model and run inference:
128
+
129
+ ```python
130
+ from setfit import SetFitModel
131
+
132
+ # Download from the 🤗 Hub
133
+ model = SetFitModel.from_pretrained("se4ai2526-uniba/setfit-python")
134
+
135
+ # Run inference
136
+ preds = model(["# yields the next value | generator.py"])
137
+ print(preds)
138
+ ```
139
+
140
+ ## Training Details
141
+
142
+ ### Training Hyperparameters
143
+ - batch_size: (32, 32)
144
+ - num_epochs: (2, 2)
145
+ - max_steps: -1
146
+ - sampling_strategy: oversampling
147
+ - num_iterations: 5
148
+ - body_learning_rate: (2e-05, 1e-05)
149
+ - head_learning_rate: 0.01
150
+ - loss: CosineSimilarityLoss
151
+ - distance_metric: cosine_distance
152
+ - margin: 0.25
153
+ - end_to_end: False
154
+ - use_amp: False
155
+ - warmup_proportion: 0.1
156
+ - l2_weight: 0.01
157
+ - seed: 42
158
+ - eval_max_steps: -1
159
+ - load_best_model_at_end: False
160
+
161
+ ### Training Results
162
+
163
+ | Metric | Value |
164
+ |:-------|:------|
165
+ | **Accuracy** | 0.4482758620689655 |
166
+ | **Embedding Loss** | 0.177 |
167
+ | **Training Loss** | 0.215 |
168
+ | **Training Runtime** | 137.40 s |
169
+ | **Training Samples/Sec** | 198.189 |
170
+ | **Training Steps/Sec** | 6.213 |
171
+
172
+ ### Framework Versions
173
+ - Python: 3.11.9
174
+ - SetFit: 1.1.2
175
+ - Sentence Transformers: 5.1.2
176
+ - Transformers: 4.57.1
177
+ - PyTorch: 2.7.1
178
+ - Datasets: 3.6.0
179
+ - Tokenizers: 0.22.1
180
+
181
+ ## Citation
182
+
183
+ If you use this model in academic work or derived systems, please cite:
184
+
185
+ > TheClouds Team. "NLBSE'26 Code Comment Classification – Python Model." 2025.
186
+
187
+ BibTeX:
188
+
189
+ ```bibtex
190
+ @misc{theclouds_nlbse26_code_comment_classification_python,
191
+ title = {NLBSE'26 Code Comment Classification: Python Model},
192
+ author = {TheClouds Team},
193
+ year = {2025},
194
+ note = {Model available on Hugging Face},
195
+ howpublished = {\url{To be published}}
196
+ }
197
+ ```
198
+
199
+ Contact:
200
+
201
+ For questions, feedback, or collaboration requests related to this model, please contact:
202
+ > Giacomo Signorile: g.signorile14@studenti.uniba.it
203
+ > Davide Pio Posa: d.posa3@studenti.uniba.it
204
+ > Marco Lillo: m.lillo21@studenti.uniba.it
205
+ > Rebecca Margiotta: m.margiotta5@studenti.uniba.it
206
+ > Adriano Gentile: a.gentile97@studenti.uniba.com
207
+
208
+ Issue tracker: https://github.com/se4ai2526-uniba/TheClouds
209
+
210
+ ```
models/model_cards/python/transformer/README.md ADDED
@@ -0,0 +1,322 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - text-classification
6
+ - code-comment-classification
7
+ - transformers
8
+ - codebert
9
+ - python
10
+ - software-engineering
11
+ - multi-label
12
+ license: mit
13
+ datasets:
14
+ - NLBSE/nlbse26-code-comment-classification
15
+ metrics:
16
+ - f1
17
+ - precision
18
+ - recall
19
+ - subset_accuracy
20
+ - runtime
21
+ - gflops
22
+ pipeline_tag: text-classification
23
+ library_name: transformers
24
+ inference: false
25
+ base_model: microsoft/codebert-base
26
+ model-index:
27
+ - name: CodeBERT Transformer for Python Code Comment Classification
28
+ results:
29
+ - task:
30
+ type: text-classification
31
+ name: Multi-label Text Classification
32
+ dataset:
33
+ name: NLBSE Code Comment Classification Dataset (Python)
34
+ type: NLBSE/nlbse26-code-comment-classification
35
+ split: test
36
+ metrics:
37
+ - type: f1
38
+ name: Macro F1
39
+ value: 0.6385
40
+ - type: f1
41
+ name: Micro F1
42
+ value: 0.6781
43
+ - type: precision
44
+ name: Macro Precision
45
+ value: 0.5900
46
+ - type: recall
47
+ name: Macro Recall
48
+ value: 0.7061
49
+ - type: accuracy
50
+ name: Subset Accuracy
51
+ value: 0.5690
52
+ ---
53
+
54
+ # Transformer Model (CodeBERT) for Python Code Comment Classification
55
+
56
+ ## Model Details
57
+
58
+ - **Model Type:** Transformer-based multi-label classifier (sequence classification head)
59
+ - **Base Model:** [`microsoft/codebert-base`](https://huggingface.co/microsoft/codebert-base)
60
+ - **Language:** Python (code comments in English)
61
+ - **License:** MIT
62
+ - **Developed by:** TheClouds
63
+ - **Model Date:** November 2025
64
+ - **Model Version:** 1.0
65
+
66
+ ### Description
67
+
68
+ This model fine-tunes **CodeBERT** on the **Python** subset of the **NLBSE Code Comment Classification Dataset** for **multi-label** classification. Each Python code comment sentence is mapped to one or more semantic categories describing the role and intent of the comment.
69
+
70
+ The classifier operates on the project’s `combo` field (concatenation of the comment sentence with a compact context string) and produces a 5-dimensional binary label vector.
71
+
72
+ ### Label Set
73
+
74
+ For Python, the model predicts the following 5 categories (fixed order in the classifier head):
75
+
76
+ 1. `Usage`
77
+ 2. `Parameters`
78
+ 3. `DevelopmentNotes`
79
+ 4. `Expand`
80
+ 5. `Summary`
81
+
82
+ Each prediction is a length-5 vector of 0/1 decisions, obtained by applying a sigmoid activation to the logits and thresholding at 0.5 by default.
83
+
84
+ ---
85
+
86
+ ## Intended Use
87
+
88
+ The model is intended for:
89
+
90
+ - research on **code comment classification** in Python projects,
91
+ - mining and analysis of Python documentation comments,
92
+ - tooling that needs semantic tags for comments (e.g., documentation quality inspection, comment recommendation, navigation support).
93
+
94
+ It is designed for **Python code comments** in English or English-like technical language.
95
+
96
+ ### Out-of-Scope Uses
97
+
98
+ - Generic natural language classification outside software engineering.
99
+ - Non-English comments without additional fine-tuning or adaptation.
100
+ - Use in safety- or life-critical decision making.
101
+
102
+ ---
103
+
104
+ ## Data
105
+
106
+ ### Training Data
107
+
108
+ - **Dataset:** NLBSE Code Comment Classification Dataset – Python train split
109
+ - **Size (train):** ~1.4k original training examples (with optional supersampled expansion to ~2k examples, depending on the configuration)
110
+ - **Label Space:** 5 multi-label categories (`Usage`, `Parameters`, `DevelopmentNotes`, `Expand`, `Summary`)
111
+ - **Preprocessing:**
112
+ - Comments extracted from open-source Python projects.
113
+ - Each instance represented via the `combo` field: `"<comment_sentence> | <class_context>"`.
114
+ - The project’s preprocessing pipeline can generate balanced training CSVs (via supersampling) under `data/processed/transformer`. The metrics reported here correspond to the current transformer configuration logged in MLflow for Python.
115
+
116
+ ### Evaluation Data
117
+
118
+ - **Dataset:** NLBSE Code Comment Classification Dataset – Python test split
119
+ - **Size (test):** ~300 comment sentences
120
+ - **Evaluation Protocol:** multi-label classification with micro and macro metrics, plus subset accuracy (exact match).
121
+
122
+ ---
123
+
124
+ ## Metrics
125
+
126
+ ### Core Evaluation Metrics (Python, test split)
127
+
128
+ From the training/evaluation run logged in MLflow:
129
+
130
+ | lan | cat | precision | recall | f1 |
131
+ |--------|-----------------|-----------|---------|---------|
132
+ | python | Usage | 0.80 | 0.76| 0.78|
133
+ | python | Parameters | 0.74 | 0.86| 0.79|
134
+ | python | DevelopmentNotes| 0.41 | 0.50| 0.45|
135
+ | python | Expand | 0.49 | 0.67| 0.57|
136
+ | python | Summary | 0.63 | 0.82| 0.71|
137
+
138
+
139
+ - **Micro F1:** 0.6781
140
+ - **Macro F1:** 0.6385
141
+ - **Micro Precision:** 0.6230
142
+ - **Micro Recall:** 0.7438
143
+ - **Macro Precision:** 0.5900
144
+ - **Macro Recall:** 0.7061
145
+ - **Subset Accuracy (exact match):** 0.5690
146
+ - **Micro Accuracy (per-label):** 0.8441
147
+ - **Eval Loss (BCE with logits):** 0.6727
148
+ - **Train Loss (final epoch):** 0.2937
149
+
150
+ ### Benchmarking Metrics
151
+
152
+ Average performance for the Python transformer benchmark:
153
+
154
+ - **Average Macro F1:** 0.6385
155
+ - **Average Precision (macro):** 0.5900
156
+ - **Average Recall (macro):** 0.7061
157
+ - **Average Runtime:** ~0.94 seconds (benchmark configuration)
158
+ - **Average GFLOPs:** ~1823.25
159
+
160
+ These results indicate that the transformer captures useful patterns across all five Python comment categories, with stronger performance on frequent labels and reasonable performance on less frequent ones.
161
+
162
+ ---
163
+
164
+ ## Quantitative Analysis
165
+
166
+ The model is evaluated in a multi-label setting:
167
+
168
+ - **Micro metrics** emphasize the overall correctness across all label decisions.
169
+ - **Macro metrics** treat all labels equally, highlighting the behaviour on minority classes (e.g., `DevelopmentNotes`).
170
+
171
+ Per-class metrics (precision/recall/F1) can be inspected in the detailed classification report logged as an artifact in MLflow for the Python transformer run. In general, the model performs better on high-frequency labels such as `Usage` and `Summary`, while performance on rarer labels is more variable.
172
+
173
+ ---
174
+
175
+ ## Training Details
176
+
177
+ ### Objective and Architecture
178
+
179
+ - **Base model:** `microsoft/codebert-base`
180
+ - **Head:** linear classification head with `num_labels = 5`
181
+ - **Problem type:** `multi_label_classification`
182
+ - **Loss function:** `BCEWithLogitsLoss` with per-label **positive class weights** computed from training label frequencies.
183
+ - **Sampling:** `WeightedRandomSampler` over training examples to reduce the impact of label imbalance.
184
+
185
+ ### Hyperparameters
186
+
187
+ - **Max sequence length:** 128
188
+ - **Batch size:** 16
189
+ - **Learning rate:** 2e-5
190
+ - **Optimizer:** AdamW
191
+ - **Scheduler:** Linear warmup and decay
192
+ - **Warmup ratio:** 0.1
193
+ - **Number of epochs:** 5
194
+ - **Threshold for prediction:** 0.5 (per-label on sigmoid probabilities)
195
+
196
+ ### Preprocessing and Balancing
197
+
198
+ - Training uses the **Python** split prepared by the project’s preprocessing pipeline.
199
+ - Optional supersampling (oversampling of underrepresented labels with a cap at the maximum label frequency) is available and can be enabled to improve macro performance.
200
+ - The test split remains unchanged and corresponds to the original NLBSE Python test partition.
201
+
202
+ ### Hardware / Runtime
203
+
204
+ The reported runtime and GFLOPs are based on the project’s benchmarking setup (single GPU, standard research workstation). Actual latency and throughput depend on hardware and batch size.
205
+
206
+ ---
207
+
208
+ ## How to Use
209
+
210
+ Install `transformers` and `torch`:
211
+
212
+ ```bash
213
+ pip install transformers torch
214
+ ```
215
+
216
+ Then load the model and tokenizer (replace the model ID with your repository name):
217
+
218
+ ```python
219
+ import torch
220
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
221
+
222
+ MODEL_ID = "se4ai2526-uniba/python-transformer" # replace with actual ID
223
+
224
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
225
+ model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
226
+ model.eval()
227
+
228
+ LABELS = [
229
+ "Usage",
230
+ "Parameters",
231
+ "DevelopmentNotes",
232
+ "Expand",
233
+ "Summary",
234
+ ]
235
+
236
+ def predict_labels(texts, threshold: float = 0.5):
237
+ if isinstance(texts, str):
238
+ texts = [texts]
239
+
240
+ inputs = tokenizer(
241
+ texts,
242
+ padding=True,
243
+ truncation=True,
244
+ max_length=128,
245
+ return_tensors="pt",
246
+ )
247
+ with torch.no_grad():
248
+ logits = model(**inputs).logits
249
+ probs = torch.sigmoid(logits)
250
+
251
+ preds = (probs > threshold).int().cpu().numpy()
252
+ results = []
253
+ for row in preds:
254
+ labels = [LABELS[i] for i, v in enumerate(row) if v == 1]
255
+ results.append(labels)
256
+ return results
257
+
258
+ # Example
259
+ comments = [
260
+ "# Usage: call this function with a file path | module.py",
261
+ ]
262
+ print(predict_labels(comments))
263
+ ```
264
+
265
+ For full reproducibility consistent with the project, use the `ModelPredictor` wrapper and the same preprocessing used during training.
266
+
267
+ ---
268
+
269
+ ## Limitations and Biases
270
+
271
+ * **Domain-limited:** Trained only on Python code comments from open-source repositories.
272
+ * **Imbalanced labels:** Some categories are relatively underrepresented; performance on these labels can lag behind frequent ones.
273
+ * **Robustness:** Behavioral tests show that the current model:
274
+
275
+ * is deterministic and stable on duplicate inputs,
276
+ * aligns with several curated golden examples,
277
+ * remains sensitive to some benign text changes (extra whitespace, case changes, typos) unless additional normalization/augmentation is introduced.
278
+
279
+ ---
280
+
281
+ ## Ethical Considerations
282
+
283
+ * The model reflects the style and biases of the open-source Python projects it was trained on.
284
+ * It does not filter offensive or inappropriate content in comments; it only predicts semantic categories.
285
+ * Outputs should be treated as assistive signals, not as authoritative judgements.
286
+
287
+ ---
288
+
289
+ ## Citation
290
+
291
+ If you use this model in academic work or derived systems, please cite:
292
+
293
+ > TheClouds Team. "NLBSE'26 Code Comment Classification – Python Model." 2025.
294
+
295
+ BibTeX:
296
+
297
+ ```bibtex
298
+ @misc{theclouds_nlbse26_code_comment_classification_python,
299
+ title = {NLBSE'26 Code Comment Classification: Python Model},
300
+ author = {TheClouds Team},
301
+ year = {2025},
302
+ note = {Model available on Hugging Face},
303
+ howpublished = {\url{To be published}}
304
+ }
305
+ ```
306
+
307
+ Contact:
308
+
309
+ For questions, feedback, or collaboration requests related to this model, please contact:
310
+ > Giacomo Signorile: g.signorile14@studenti.uniba.it
311
+ > Davide Pio Posa: d.posa3@studenti.uniba.it
312
+ > Marco Lillo: m.lillo21@studenti.uniba.it
313
+ > Rebecca Margiotta: m.margiotta5@studenti.uniba.it
314
+ > Adriano Gentile: a.gentile97@studenti.uniba.com
315
+
316
+ Issue tracker: https://github.com/se4ai2526-uniba/TheClouds
317
+
318
+ ```
319
+
320
+ ## Acknowledgements
321
+
322
+ This model was developed as part of research on **Natural Language-Based Software Engineering (NLBSE)** and the **Code Comment Classification** task, building on the NLBSE’26 competition data and earlier SetFit and Random Forest baselines.