Spaces:

seai2526-uniba-TheClouds
/

Code-Comment-Classification-Api

Running

App Files Files Community

Sky-Blue-da-ba-dee commited on Dec 11, 2025

Commit

9636971

1 Parent(s): b4c95e1

fixed a typo in the project name

Browse files

Files changed (6) hide show

models/model_cards/java/setfit/README.md +231 -0
models/model_cards/java/transformer/README.md +328 -0
models/model_cards/pharo/setfit/README.md +223 -0
models/model_cards/pharo/transformer/README.md +329 -0
models/model_cards/python/setfit/README.md +210 -0
models/model_cards/python/transformer/README.md +322 -0

models/model_cards/java/setfit/README.md ADDED Viewed

	@@ -0,0 +1,231 @@

+---
+language:
+- en
+tags:
+- text-classification
+- code-comment-classification
+- setfit
+- java
+- software-engineering
+- multi-label
+- sentence-transformers
+- generated_from_setfit_trainer
+license: mit
+datasets:
+- NLBSE/nlbse26-code-comment-classification
+metrics:
+- f1
+- precision
+- recall
+pipeline_tag: text-classification
+library_name: setfit
+inference: false
+widget:
+- text: '@link FSNamesystem#readLock() | FSPermissionChecker.java'
+- text: previous^checkpoint li | TestSaveNamespace.java
+- text: // the file doesn't have anything | TaskLog.java
+- text: " @param file the file the include directives point to\n\t * @param depth\
+    \ depth to which includes are followed, should be one of\n\t * {@link #DEPTH_ZERO}\
+    \ or {@link #DEPTH_INFINITE}\n\t * @return an array of include relations\n\t *\
+    \ @throws CoreException | IIndex.java"
+- text: // quotes are removed | ScannerUtility.java
+base_model: sentence-transformers/paraphrase-MiniLM-L6-v2
+model-index:
+- name: SetFit with sentence-transformers/paraphrase-MiniLM-L6-v2
+  results:
+  - task:
+      type: text-classification
+      name: Text Classification
+    dataset:
+      name: NLBSE Code Comment Classification Dataset (Java)
+      type: NLBSE/nlbse26-code-comment-classification
+      split: test
+    metrics:
+    - type: accuracy
+      value: 0.7435
+      name: Accuracy
+---
+# SetFit Model for Java Code Comment Classification
+## Model Details
+- **Model Type:** SetFit (Sentence Transformer Fine-tuning)
+- **Base Model:** [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2)
+- **Language:** Java (Comments in English)
+- **License:** MIT
+- **Developed by:** TheClouds
+- **Model Date:** November 4, 2025
+- **Model Version:** 1.0
+- **Contact:** For additional information contact team TheClouds on github
+### Description
+This model is a SetFit model trained on the **Java** subset of the **NLBSE Code Comment Classification Dataset**. It is designed to classify code comments into one or more of **7 categories** that describe the semantic purpose of the comment.
+The model uses a multi-label classification approach, where a single comment can belong to multiple categories.
+## Intended Use
+This model has been created for the Code Comment Classification task, and trained specifically on code comments extracted from Java projects. As such, it is useful for research and development in code comment classification of projects made in Java or other object-oriented languages, or software documentation analysis tasks that perform supervised multi-label classification.
+### Out-of-Scope Use Cases
+General text classification outside the domain of software engineering.
+## Factors
+- **Programming Language:** The model is specifically trained on Java code comments.
+- **Comment Types:** The model recognizes the following 7 categories specific to Java documentation:
+    1. `summary`
+    2. `Ownership`
+    3. `Expand`
+    4. `usage`
+    5. `Pointer`
+    6. `deprecation`
+    7. `rational`
+## Metrics
+- **Model Performance Measures:** The primary metrics used for evaluation are **Precision**, **Recall**, **F1-Score** and **Accuracy**.
+- **Decision Threshold:** A probability threshold of 0.5 was used for classification.
+- **Performance:** The model achieves an overall Accuracy of **0.7669** on the test set.
+### Dataset Summary
+The **NLBSE Code Comment Classification Dataset** is a collection of code comment sentences accompanied by multi-label category annotations.
+- **Java Labels (7):** `summary`, `Ownership`, `Expand`, `usage`, `Pointer`, `deprecation`, `rational`.
+Each entry corresponds to a comment sentence extracted from real projects.
+### Motivation
+The usage of this specific dataset was a requirement of the NLBSE 2025 code comment classification challenge.
+## Training Data
+- **Dataset:** NLBSE Code Comment Classification Dataset (Java train split).
+- **Size:** 5,390 rows.
+- **Label Distribution:** The dataset contains 7 categories with varying frequencies. Common categories include "summary" and "usage".
+## Evaluation Data
+- **Dataset:** NLBSE Code Comment Classification Dataset (Java test split).
+- **Size:** 1,200 rows.
+- **Preprocessing:** Comments were extracted from real-world open-source Java projects, split into sentences, and manually classified.
+## Quantitative Analyses
+| lan | cat | precision | recall | f1 |
+|---|---|---|---|---|
+| java | summary | 0.871224 | 0.886731 | 0.878909 |
+| java | Ownership | 1.000000 | 1.000000 | 1.000000 |
+| java | Expand | 0.330097 | 0.430380 | 0.373626 |
+| java | usage | 0.883803 | 0.850847 | 0.867012 |
+| java | Pointer | 0.775641 | 0.968000 | 0.861210 |
+| java | deprecation | 0.875000 | 0.700000 | 0.777778 |
+| java | rational | 0.311688 | 0.413793 | 0.355556 |
+## Ethical Considerations
+- **Biases:** The dataset is drawn from open-source software projects. The comments reflect the writing styles and norms of the open-source community, which may not be representative of all software development environments (e.g., proprietary software).
+- **Content:** Comments are user-generated content and may contain informal language or jargon specific to the projects they were extracted from.
+## Caveats and Recommendations
+- **Language Specificity:** The label set is specific to Java.
+- **Context:** The model relies on text-only comment sentences. Surrounding code context is not included, which may limit the model's ability to resolve ambiguous comments.
+- **Class Imbalance:** Some categories (e.g., `deprecation`, `Ownership`) may be underrepresented compared to `summary` or `usage`.
+## How to Use
+First install the SetFit library:
+```bash
+pip install setfit
+```
+Then you can load this model and run inference:
+```python
+from setfit import SetFitModel
+# Download from the 🤗 Huggingface Hub
+model = SetFitModel.from_pretrained("se4ai2526-uniba/setfit-java") # Replace with actual model ID if different
+# Run inference
+preds = model(["// quotes are removed | ScannerUtility.java"])
+print(preds)
+```
+## Training Details
+### Training Hyperparameters
+- batch_size: (32, 32)
+- num_epochs: (2, 2)
+- max_steps: -1
+- sampling_strategy: oversampling
+- num_iterations: 5
+- body_learning_rate: (2e-05, 1e-05)
+- head_learning_rate: 0.01
+- loss: CosineSimilarityLoss
+- distance_metric: cosine_distance
+- margin: 0.25
+- end_to_end: False
+- use_amp: False
+- warmup_proportion: 0.1
+- l2_weight: 0.01
+- seed: 42
+- eval_max_steps: -1
+- load_best_model_at_end: False
+- probability_threshold: 0.5
+### Training Results
+| Metric | Value |
+|:-------|:------|
+| **Accuracy** | 0.7669 |
+| **Embedding Loss** | 0.0239 |
+| **Training Loss** | 0.0587 |
+| **Training Runtime** | 1515.40 s |
+| **Training Samples/Sec** | 71.189 |
+| **Training Steps/Sec** | 2.225 |
+### Framework Versions
+- Python: 3.11.9
+- SetFit: 1.1.2
+- Sentence Transformers: 5.1.2
+- Transformers: 4.57.1
+- PyTorch: 2.7.1
+- Datasets: 3.6.0
+- Tokenizers: 0.22.1
+## Citation
+If you use this model in academic work or derived systems, please cite:
+> TheClouds Team. "NLBSE'26 Code Comment Classification – Java Model." 2025.
+BibTeX:
+```bibtex
+@misc{theclouds_nlbse26_code_comment_classification_java,
+  title        = {NLBSE'26 Code Comment Classification: Java Model},
+  author       = {TheClouds Team},
+  year         = {2025},
+  note         = {Model available on Hugging Face},
+  howpublished = {\url{To be published}}
+}
+```
+Contact:
+For questions, feedback, or collaboration requests related to this model, please contact:
+> Giacomo Signorile: g.signorile14@studenti.uniba.it
+> Davide Pio Posa: d.posa3@studenti.uniba.it
+> Marco Lillo: m.lillo21@studenti.uniba.it
+> Rebecca Margiotta: m.margiotta5@studenti.uniba.it
+> Adriano Gentile: a.gentile97@studenti.uniba.com
+Issue tracker: https://github.com/se4ai2526-uniba/TheClouds
+## Acknowledgements
+This model was created for research in the context of **NLBSE (Natural Language-Based Software Engineering)**.

models/model_cards/java/transformer/README.md ADDED Viewed

	@@ -0,0 +1,328 @@

+---
+language:
+- en
+tags:
+- text-classification
+- code-comment-classification
+- transformers
+- codebert
+- java
+- software-engineering
+- multi-label
+license: mit
+datasets:
+- NLBSE/nlbse26-code-comment-classification
+metrics:
+- f1
+- precision
+- recall
+- subset_accuracy
+- runtime
+- gflops
+pipeline_tag: text-classification
+library_name: transformers
+inference: false
+base_model: microsoft/codebert-base
+model-index:
+- name: CodeBERT Transformer for Java Code Comment Classification
+  results:
+  - task:
+      type: text-classification
+      name: Multi-label Text Classification
+    dataset:
+      name: NLBSE Code Comment Classification Dataset (Java)
+      type: NLBSE/nlbse26-code-comment-classification
+      split: test
+    metrics:
+    - type: f1
+      name: Macro F1
+      value: 0.7457
+    - type: f1
+      name: Micro F1
+      value: 0.8364
+    - type: precision
+      name: Macro Precision
+      value: 0.7307
+    - type: recall
+      name: Macro Recall
+      value: 0.7658
+    - type: accuracy
+      name: Subset Accuracy
+      value: 0.8085
+---
+# Transformer Model (CodeBERT) for Java Code Comment Classification
+## Model Details
+- **Model Type:** Transformer-based multi-label classifier (sequence classification head)
+- **Base Model:** [`microsoft/codebert-base`](https://huggingface.co/microsoft/codebert-base)
+- **Language:** Java (code comments in English)
+- **License:** MIT
+- **Developed by:** TheClouds
+- **Model Date:** November 2025
+- **Model Version:** 1.0
+### Description
+This model fine-tunes **CodeBERT** on the **Java** subset of the **NLBSE Code Comment Classification Dataset** for **multi-label** classification. Each Java code comment sentence is mapped to one or more semantic categories describing the intent and role of the comment.
+The classifier operates directly on the concatenated `combo` field used in the project (comment sentence plus file/method context string), and produces a 7-dimensional binary label vector.
+### Label Set
+For Java, the model predicts the following 7 categories (fixed order in the classifier head):
+1. `summary`
+2. `Ownership`
+3. `Expand`
+4. `usage`
+5. `Pointer`
+6. `deprecation`
+7. `rational`
+Each prediction is a length-7 vector of 0/1 decisions, obtained by applying a sigmoid activation to the logits and thresholding at 0.5 by default.
+---
+## Intended Use
+The model is intended for:
+- research on **code comment classification** in Java projects,
+- analysis and mining of Java documentation comments,
+- downstream tools that need a multi-label semantic categorization of comments (e.g., documentation quality checks, comment recommendation, refactoring assistants).
+It is designed for **Java code comments** and similar documentation-style text from software projects.
+### Out-of-Scope Uses
+- Generic natural language classification outside the software engineering domain.
+- Non-English comments or comments from programming languages with substantially different documentation conventions, without additional fine-tuning.
+- Safety- or life-critical decision making.
+---
+## Data
+### Training Data
+- **Dataset:** NLBSE Code Comment Classification Dataset – Java train split
+- **Size (train):** ~5.4k comment sentences
+- **Label Space:** 7 multi-label categories (`summary`, `Ownership`, `Expand`, `usage`, `Pointer`, `deprecation`, `rational`)
+- **Preprocessing:**
+  - Comments extracted from real-world open-source Java projects.
+  - Split into comment sentences and associated with class annotations.
+  - Project-specific preprocessing uses a `combo` field (`"<comment_sentence> | <class_context>"`).
+  - For this transformer model, training uses the preprocessed CSVs under `data/processed/transformer`, including synthetic oversampling (supersampling) for label balancing.
+### Evaluation Data
+- **Dataset:** NLBSE Code Comment Classification Dataset – Java test split
+- **Size (test):** ~1.2k comment sentences
+- **Evaluation Protocol:** multi-label classification with micro and macro metrics; subset accuracy (exact match) is also reported.
+---
+## Metrics
+### Core Evaluation Metrics (Java, test split)
+From the training/evaluation run logged in MLflow:
+| lan  | cat         | precision | recall  | f1      |
+|------|-------------|-----------|---------|---------|
+| java | summary     | 0.88  | 0.92| 0.90|
+| java | Ownership   | 1.00  | 1.00| 1.00|
+| java | Expand      | 0.41  | 0.44| 0.42|
+| java | usage       | 0.89  | 0.85| 0.87|
+| java | Pointer     | 0.75  | 0.98| 0.85|
+| java | deprecation | 0.89  | 0.80| 0.84|
+| java | rational    | 0.40  | 0.41| 0.41|
+- **Micro F1:** 0.8364
+- **Macro F1:** 0.7457
+- **Micro Precision:** 0.8142
+- **Micro Recall:** 0.8599
+- **Macro Precision:** 0.7307
+- **Macro Recall:** 0.7658
+- **Subset Accuracy (exact match):** 0.8085
+- **Micro Accuracy (per-label):** 0.9515
+- **Eval Loss (BCE with logits):** 0.6207
+- **Train Loss (final epoch):** 0.0291
+### Benchmarking Metrics
+Average performance over the Java benchmarking runs:
+- **Average Macro F1:** 0.7457
+- **Average Precision (macro):** 0.7307
+- **Average Recall (macro):** 0.7658
+- **Average Runtime (sec per run):** 168.53
+- **Average GFLOPs (inference benchmark):** 26118.05
+These metrics indicate that the transformer model improves over earlier baselines in terms of both micro and macro F1, while maintaining reasonable runtime characteristics for research workloads.
+---
+## Quantitative Analysis
+The model is evaluated in a strictly multi-label setting:
+- **Micro metrics** emphasize overall correctness across all label decisions.
+- **Macro metrics** average performance across labels, giving more visibility into underrepresented classes (e.g., `Ownership`, `deprecation`, `rational`).
+Per-class precision/recall/F1 can be inspected in the saved classification report for the Java transformer run (logged as an artifact in MLflow). These results show good performance on frequent categories such as `summary`, `usage`, and `Pointer`, with weaker but still meaningful performance on minority labels.
+---
+## Training Details
+### Objective and Architecture
+- **Base model:** `microsoft/codebert-base`
+- **Head:** linear classification head with `num_labels = 7`
+- **Problem type:** `multi_label_classification`
+- **Loss function:** `BCEWithLogitsLoss` with **per-label positive class weights** computed from training label frequencies.
+- **Sampling:** `WeightedRandomSampler` on training instances to mitigate class imbalance.
+### Hyperparameters
+- **Max sequence length:** 128
+- **Batch size:** 16
+- **Learning rate:** 2e-5
+- **Optimizer:** AdamW
+- **Scheduler:** Linear warmup and decay
+- **Warmup ratio:** 0.1
+- **Number of epochs:** 5
+- **Threshold for prediction:** 0.5 (applied to sigmoid probabilities)
+### Preprocessing and Balancing
+- Training uses the **preprocessed and supersampled** Java CSVs from `data/processed/transformer`.
+- Supersampling is applied only to the training split to upsample underrepresented labels while capping each label’s frequency at the original maximum, to avoid extreme duplication.
+- The test split remains untouched and corresponds to the original NLBSE Java test data.
+### Hardware / Runtime
+The reported average runtime (~168.5 seconds) and average GFLOPs (~26k) refer to the evaluation/benchmarking setup used in the project (single GPU, typical research hardware). Exact throughput and latency depend on the deployment environment and batch size.
+---
+## How to Use
+Install `transformers` and `torch`:
+```bash
+pip install transformers torch
+```
+Then load the model and tokenizer (replace the model ID with your repository name):
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+MODEL_ID = "se4ai2526-uniba/java-transformer"  # replace with actual ID
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
+model.eval()
+LABELS = [
+    "summary",
+    "Ownership",
+    "Expand",
+    "usage",
+    "Pointer",
+    "deprecation",
+    "rational",
+]
+def predict_labels(text, threshold: float = 0.5):
+    inputs = tokenizer(
+        text,
+        padding=True,
+        truncation=True,
+        max_length=128,
+        return_tensors="pt",
+    )
+    with torch.no_grad():
+        logits = model(**inputs).logits
+        probs = torch.sigmoid(logits)
+    preds = (probs > threshold).int().cpu().numpy()
+    results = []
+    for row in preds:
+        labels = [LABELS[i] for i, v in enumerate(row) if v == 1]
+        results.append(labels)
+    return results
+# Example
+comments = [
+    "// quotes are removed | ScannerUtility.java",
+]
+print(predict_labels(comments))
+```
+If you want to reproduce the project’s behaviour end-to-end, you can wrap this transformer in the same `ModelPredictor` utility used by the codebase.
+---
+## Limitations and Biases
+* **Domain specificity:** The model is trained only on Java code comments from open-source projects. It may not generalize perfectly to other languages, domains, or proprietary codebases.
+* **Imbalanced labels:** Some categories are relatively rare; even with supersampling and positive class weights, performance on minority labels may be unstable compared to frequent ones.
+* **Sensitivity to perturbations:** Behavioral tests show that the current model is:
+  * deterministic and stable on duplicate inputs,
+  * reasonably aligned with curated golden examples,
+  * still sensitive to certain benign changes (extra whitespace, case changes, typos) unless additional normalization/augmentation is applied.
+---
+## Ethical Considerations
+* The training data consists of comments from open-source repositories. These may reflect cultural norms, jargon, and biases of the corresponding communities.
+* The model does not attempt to filter offensive or inappropriate content in comments; it only assigns category labels for documentation-related classes.
+* Use in downstream applications should account for potential biases and limitations and avoid presenting outputs as authoritative or error-free.
+---
+## Citation
+If you use this model in academic work or derived systems, please cite:
+> TheClouds Team. "NLBSE'26 Code Comment Classification – Java Model." 2025.
+BibTeX:
+```bibtex
+@misc{theclouds_nlbse26_code_comment_classification_java,
+  title        = {NLBSE'26 Code Comment Classification: Java Model},
+  author       = {TheClouds Team},
+  year         = {2025},
+  note         = {Model available on Hugging Face},
+  howpublished = {\url{To be published}}
+}
+```
+Contact:
+For questions, feedback, or collaboration requests related to this model, please contact:
+> Giacomo Signorile: g.signorile14@studenti.uniba.it
+> Davide Pio Posa: d.posa3@studenti.uniba.it
+> Marco Lillo: m.lillo21@studenti.uniba.it
+> Rebecca Margiotta: m.margiotta5@studenti.uniba.it
+> Adriano Gentile: a.gentile97@studenti.uniba.com
+Issue tracker: https://github.com/se4ai2526-uniba/TheClouds
+---
+## Acknowledgements
+This model was developed as part of research on **Natural Language-Based Software Engineering (NLBSE)** and the **Code Comment Classification** task, building on the NLBSE’26 competition data and prior SetFit baselines.

models/model_cards/pharo/setfit/README.md ADDED Viewed

	@@ -0,0 +1,223 @@

+---
+language:
+- en
+tags:
+- setfit
+- sentence-transformers
+- text-classification
+- generated_from_setfit_trainer
+license:
+- mit
+datasets:
+- NLBSE/nlbse26-code-comment-classification
+metrics:
+- f1
+- precision
+- recall
+- accuracy
+pipeline_tag: text-classification
+library_name: setfit
+inference: false
+base_model: sentence-transformers/paraphrase-MiniLM-L6-v2
+model-index:
+- name: SetFit with sentence-transformers/paraphrase-MiniLM-L6-v2
+  results:
+  - task:
+      type: text-classification
+      name: Text Classification
+    dataset:
+      name: NLBSE Code Comment Classification Dataset (Pharo)
+      type: NLBSE/nlbse26-code-comment-classification
+      split: test
+    metrics:
+    - type: accuracy
+      value: 0.5673
+      name: Accuracy
+---
+# SetFit Model for Pharo Code Comment Classification
+## Model Details
+- **Model Type:** SetFit (Sentence Transformer Fine-tuning)
+- **Base Model:** [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2)
+- **Language:** Pharo (Comments in English)
+- **License:** MIT
+- **Developed by:** TheClouds
+- **Model Date:** November 17, 2025
+- **Model Version:** 1.0
+- **Maximum Sequence Length:** 128 tokens
+- **Contact:** For questions or comments about this model, please contact us via GitHub or email.
+### Description
+This model is a SetFit model trained on the **Pharo** subset of the **NLBSE Code Comment Classification Dataset**. It is designed to classify code comments into one or more of **6 categories** that describe the semantic purpose of the comment.
+The model uses a multi-label classification approach, where a single comment can belong to multiple categories.
+## Intended Use
+This model has been created for the Code Comment Classification task, and trained specifically on code comments extracted from Pharo projects. As such, it is useful for research and development in code comment classification of projects made in Pharo, or software documentation analysis tasks.
+### Out-of-Scope Use Cases
+General text classification outside the domain of software engineering (e.g., social media sentiment analysis) is out of scope.
+## Factors
+- **Programming Language:** The model is specifically trained on Pharo code comments.
+- **Comment Types:** The model recognizes the following 6 categories specific to Pharo documentation:
+    1. `Keyimplementationpoints`
+    2. `Example`
+    3. `Responsibilities`
+    4. `Intent`
+    5. `Keymessages`
+    6. `Collaborators`
+## Metrics
+- **Model Performance Measures:** The primary metrics used for evaluation are **Precision**, **Recall**, and **F1-Score**.
+- **Performance:** The model achieves an average F1-Score of 0.4628 on the test set.
+## Evaluation Data
+- **Dataset:** NLBSE Code Comment Classification Dataset (Pharo test split).
+- **Size:** 208 rows.
+- **Preprocessing:** Comments were extracted from real-world open-source Pharo projects, split into sentences, and manually classified.
+## Training Data
+- **Dataset:** NLBSE Code Comment Classification Dataset (Pharo train split).
+- **Size:** 900 rows.
+- **Label Distribution:** The dataset contains 6 categories with varying frequencies.
+### Dataset Summary
+The **NLBSE Code Comment Classification Dataset** is a collection of code comment sentences accompanied by multi-label category annotations.
+- **Pharo Labels (6):** `collaborators`, `example`, `intent`, `keyimplementationpoints`, `keymessages`, `responsibilities`.
+Each entry corresponds to a comment sentence extracted from real projects.
+## Quantitative Analyses
+The following table shows the performance breakdown per category on the Pharo test set:
+| lan   | cat                         | precision | recall   | f1       |
+| ----- | --------------------------- | --------- | -------- | -------- |
+| pharo | **Keyimplementationpoints** | 0.562500  | 0.642857 | 0.600000 |
+| pharo | **Example**                 | 0.886364  | 0.876404 | 0.881356 |
+| pharo | **Responsibilities**        | 0.632653  | 0.738095 | 0.681319 |
+| pharo | **Intent**                  | 0.720000  | 0.857143 | 0.782609 |
+| pharo | **Keymessages**             | 0.478261  | 0.733333 | 0.578947 |
+| pharo | **Collaborators**           | 0.103448  | 0.428571 | 0.166667 |
+## Ethical Considerations
+- **Biases:** The dataset is drawn from open-source software projects. The comments reflect the writing styles and norms of the open-source community, which may not be representative of all software development environments (e.g., proprietary software).
+- **Content:** Comments are user-generated content and may contain informal language or jargon specific to the projects they were extracted from.
+## Caveats and Recommendations
+- **Performance Variation:** The model performs well on `example` comments (F1 0.881) and `intent` comments (F1 0.782) but struggles significantly with all the other categories. Users should exercise caution when relying on the model for identifying development notes or rationale.
+- **Context:** The model relies on text-only comment sentences. Surrounding code context is not included.
+## How to Use
+First install the SetFit library:
+```bash
+pip install setfit
+```
+Then you can load this model and run inference:
+```python
+from setfit import SetFitModel
+# Download from the 🤗 Huggingface Hub
+model = SetFitModel.from_pretrained("se4ai2526-uniba/setfit-pharo") # Replace with actual model ID if different
+# Run inference
+preds = model(["each phase knows about its start time and send a corresponding event once the phase is completed. | BlSpaceFramePhase"])
+print(preds)
+```
+## Training Details
+### Training Hyperparameters
+- batch_size: (32, 32)
+- body_learning_rate: (2e-05, 1e-05)
+- distance_metric: cosine_distance
+- end_to_end: False
+- eval_delay: False
+- eval_max_steps: -1
+- eval_steps: None
+- eval_strategy: IntervalStrategy.NO
+- evaluation_strategy: None
+- greater_is_better: False
+- head_learning_rate: 0.01
+- l2_weight: 0.01
+- load_best_model_at_end: False
+- loss: CosineSimilarityLoss
+- margin: 0.25
+- max_length: None
+- max_steps: -1
+- metric_for_best_model: embedding_loss
+- num_epochs: (2, 2)
+- num_iterations: 5
+- samples_per_label: 2
+- sampling_strategy: oversampling
+- save_steps: 500
+- save_strategy: steps
+- save_total_limit: 1
+- seed: 42
+- use_amp: False
+- warmup_proportion: 0.1
+### Training Results
+| Metric                   | Value      |
+| :----------------------- | :--------- |
+| **Accuracy**             | 0.5673     |
+| **Embedding Loss**       | 0.105      |
+| **Training Loss**        | 0.1566     |
+| **Training Runtime**     | 161.2121 s |
+| **Training Samples/Sec** | 111.654    |
+| **Training Steps/Sec**   | 3.498      |
+### Framework Versions
+- Python: 3.11.9
+- SetFit: 1.1.2
+- Sentence Transformers: 5.1.2
+- Transformers: 4.57.1
+- PyTorch: 2.7.1
+- Datasets: 3.6.0
+- Tokenizers: 0.22.1
+## Citation
+If you use this model in academic work or derived systems, please cite:
+> TheClouds Team. "NLBSE'26 Code Comment Classification – Pharo Model." 2025.
+BibTeX:
+```bibtex
+@misc{theclouds_nlbse26_code_comment_classification_pharo,
+  title        = {NLBSE'26 Code Comment Classification: Pharo Model},
+  author       = {TheClouds Team},
+  year         = {2025},
+  note         = {Model available on Hugging Face},
+  howpublished = {\url{To be published}}
+}
+```
+Contact:
+For questions, feedback, or collaboration requests related to this model, please contact:
+> Giacomo Signorile: g.signorile14@studenti.uniba.it
+> Davide Pio Posa: d.posa3@studenti.uniba.it
+> Marco Lillo: m.lillo21@studenti.uniba.it
+> Rebecca Margiotta: m.margiotta5@studenti.uniba.it
+> Adriano Gentile: a.gentile97@studenti.uniba.com
+Issue tracker: https://github.com/se4ai2526-uniba/TheClouds
+```

models/model_cards/pharo/transformer/README.md ADDED Viewed

	@@ -0,0 +1,329 @@

+---
+language:
+- en
+tags:
+- text-classification
+- code-comment-classification
+- transformers
+- codebert
+- pharo
+- software-engineering
+- multi-label
+license: mit
+datasets:
+- NLBSE/nlbse26-code-comment-classification
+metrics:
+- f1
+- precision
+- recall
+- subset_accuracy
+- runtime
+- gflops
+pipeline_tag: text-classification
+library_name: transformers
+inference: false
+base_model: microsoft/codebert-base
+model-index:
+- name: CodeBERT Transformer for Pharo Code Comment Classification
+  results:
+  - task:
+      type: text-classification
+      name: Multi-label Text Classification
+    dataset:
+      name: NLBSE Code Comment Classification Dataset (Pharo)
+      type: NLBSE/nlbse26-code-comment-classification
+      split: test
+    metrics:
+    - type: f1
+      name: Macro F1
+      value: 0.5980
+    - type: f1
+      name: Micro F1
+      value: 0.6720
+    - type: precision
+      name: Macro Precision
+      value: 0.5234
+    - type: recall
+      name: Macro Recall
+      value: 0.7157
+    - type: accuracy
+      name: Subset Accuracy
+      value: 0.5096
+---
+# Transformer Model (CodeBERT) for Pharo Code Comment Classification
+## Model Details
+- **Model Type:** Transformer-based multi-label classifier (sequence classification head)
+- **Base Model:** [`microsoft/codebert-base`](https://huggingface.co/microsoft/codebert-base)
+- **Language:** Pharo (code comments in English/technical English)
+- **License:** MIT
+- **Developed by:** TheClouds
+- **Model Date:** November 2025
+- **Model Version:** 1.0
+### Description
+This model fine-tunes **CodeBERT** on the **Pharo** subset of the **NLBSE Code Comment Classification Dataset** for **multi-label** classification. Each Pharo code comment sentence is mapped to one or more semantic categories that describe design intent and responsibilities of classes and methods.
+The classifier operates on the `combo` field used in the project (comment sentence plus a compact context string) and produces a 6-dimensional binary label vector.
+### Label Set
+For Pharo, the model predicts the following 6 categories (fixed order in the classifier head):
+1. `Keyimplementationpoints`
+2. `Example`
+3. `Responsibilities`
+4. `Intent`
+5. `Keymessages`
+6. `Collaborators`
+Each prediction is a length-6 vector of 0/1 decisions, obtained by applying a sigmoid activation to the logits and thresholding at 0.5 by default.
+---
+## Intended Use
+The model is intended for:
+- research on **code comment and design documentation classification** in Pharo projects,
+- mining and analysis of Pharo method/class comments to extract design intent and responsibilities,
+- tools that need semantic tags for Smalltalk/Pharo comments (e.g., navigation, documentation quality checks, design overview tools).
+It is designed for **Pharo code comments** written in English or English-like technical language.
+### Out-of-Scope Uses
+- Generic text classification outside software engineering and Pharo/Smalltalk ecosystems.
+- Non-English comments, or comments from unrelated programming languages, without additional fine-tuning.
+- Any safety- or life-critical decision-making context.
+---
+## Data
+### Training Data
+- **Dataset:** NLBSE Code Comment Classification Dataset – Pharo train split
+- **Size (train):** ~900 original training examples (expanded to ~1.9k via supersampling in the current configuration)
+- **Label Space:** 6 multi-label categories (`Keyimplementationpoints`, `Example`, `Responsibilities`, `Intent`, `Keymessages`, `Collaborators`)
+- **Preprocessing:**
+  - Comments extracted from real-world Pharo projects.
+  - Each sample represented using the `combo` field: `"<comment_sentence> | <class_context>"` (or similar contextual string).
+  - For this transformer configuration, the training data come from `data/processed/transformer`, where a supersampling procedure is applied to reduce label imbalance.
+### Evaluation Data
+- **Dataset:** NLBSE Code Comment Classification Dataset – Pharo test split
+- **Size (test):** ~200 comment sentences
+- **Evaluation Protocol:** multi-label classification with micro/macro metrics and subset accuracy (exact-match), evaluated on the original, non-supersampled test split.
+---
+## Metrics
+### Core Evaluation Metrics (Pharo, test split)
+From the training/evaluation run logged in MLflow:
+| lan   | cat                    | precision | recall  | f1      |
+|-------|------------------------|-----------|---------|---------|
+| pharo | Keyimplementationpoints| 0.47  | 0.68| 0.56|
+| pharo | Example                | 0.89  | 0.83| 0.86|
+| pharo | Responsibilities       | 0.57  | 0.76| 0.65|
+| pharo | Intent                 | 0.83  | 0.90| 0.86|
+| pharo | Keymessages            | 0.47  | 0.73| 0.57|
+| pharo | Collaborators          | 0.33  | 0.57| 0.42|
+- **Micro F1:** 0.6720
+- **Macro F1:** 0.5980
+- **Micro Precision:** 0.5964
+- **Micro Recall:** 0.7696
+- **Macro Precision:** 0.5234
+- **Macro Recall:** 0.7157
+- **Subset Accuracy (exact match):** 0.5096
+- **Micro Accuracy (per-label):** 0.8694
+- **Eval Loss (BCE with logits):** 0.5889
+- **Train Loss (final epoch):** 0.2149
+### Benchmarking Metrics
+Average performance over Pharo transformer benchmarking runs:
+- **Average Macro F1:** 0.5980
+- **Average Precision (macro):** 0.5234
+- **Average Recall (macro):** 0.7157
+- **Average Runtime:** ~1.35 seconds (benchmark configuration)
+- **Average GFLOPs:** ~1943.77
+These results indicate that the model captures the main Pharo comment categories reasonably well, with particularly strong recall, while macro precision and F1 reflect the dataset’s label imbalance and limited size.
+---
+## Quantitative Analysis
+The evaluation is fully multi-label:
+- **Micro metrics** reflect overall correctness across all label decisions.
+- **Macro metrics** treat each of the six labels equally, exposing weaknesses on rarer categories (e.g., `Collaborators`, `Keymessages`).
+A detailed per-class breakdown (precision/recall/F1 per label) is available in the classification report artifact for the Pharo transformer run in MLflow. High-level observations include:
+- Better performance on frequent categories such as `Example` and `Responsibilities`.
+- More variable performance on `Intent`, `Keymessages`, and `Collaborators`, due to fewer training examples.
+---
+## Training Details
+### Objective and Architecture
+- **Base model:** `microsoft/codebert-base`
+- **Head:** linear classification head with `num_labels = 6`
+- **Problem type:** `multi_label_classification`
+- **Loss function:** `BCEWithLogitsLoss` with per-label **positive class weights** computed from training label frequencies.
+- **Sampling:** `WeightedRandomSampler` over training samples to partially correct for label imbalance.
+### Hyperparameters
+- **Max sequence length:** 128
+- **Batch size:** 16
+- **Learning rate:** 2e-5
+- **Optimizer:** AdamW
+- **Scheduler:** Linear warmup and decay
+- **Warmup ratio:** 0.1
+- **Number of epochs:** 5
+- **Prediction threshold:** 0.5 (per-label on sigmoid probabilities)
+### Preprocessing and Balancing
+- Training data for Pharo are produced by the project’s preprocessing module, which:
+  - ensures a `combo` text field is present,
+  - parses the label strings into binary vectors,
+  - applies **supersampling** on the train split only (up to a cap at the maximum original label frequency).
+- The test split is not modified and corresponds to the original NLBSE Pharo test data.
+### Hardware / Runtime
+The runtime and GFLOPs figures are measured in the project’s benchmarking environment (single GPU, standard research workstation). Actual performance in deployment will depend on hardware and batch size.
+---
+## How to Use
+Install dependencies:
+```bash
+pip install transformers torch
+```
+Then load the model and tokenizer (replace the model ID with the actual repository):
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+MODEL_ID = "se4ai2526-uniba/pharo-transformer"  # replace with actual ID
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
+model.eval()
+LABELS = [
+    "Keyimplementationpoints",
+    "Example",
+    "Responsibilities",
+    "Intent",
+    "Keymessages",
+    "Collaborators",
+]
+def predict_labels(texts, threshold: float = 0.5):
+    if isinstance(texts, str):
+        texts = [texts]
+    inputs = tokenizer(
+        texts,
+        padding=True,
+        truncation=True,
+        max_length=128,
+        return_tensors="pt",
+    )
+    with torch.no_grad():
+        logits = model(**inputs).logits
+        probs = torch.sigmoid(logits)
+    preds = (probs > threshold).int().cpu().numpy()
+    results = []
+    for row in preds:
+        labels = [LABELS[i] for i, v in enumerate(row) if v == 1]
+        results.append(labels)
+    return results
+# Example
+comments = [
+    "\"The intent of this class is to manage UI events\" | MyWidget class",
+]
+print(predict_labels(comments))
+```
+For consistency with the rest of the project, you can also use the shared `ModelPredictor` wrapper and the same preprocessing normalization applied during training.
+---
+## Limitations and Biases
+* **Limited data:** The Pharo split is relatively small compared to Java/Python; this constrains the model’s ability to generalize, especially for minority labels.
+* **Imbalanced label distribution:** Despite supersampling and positive weights, some categories remain harder to predict reliably.
+* **Sensitivity to perturbations:** Behavioral tests show:
+  * deterministic behaviour and stable predictions on duplicate inputs,
+  * alignment with several curated golden examples,
+  * sensitivity to some benign text changes (whitespace, case, punctuation, typos) unless additional normalization or data augmentation is introduced.
+---
+## Ethical Considerations
+* The model is trained on comments from open-source Pharo projects and inherits their style, norms, and potential biases.
+* It does not filter or sanitize comment content; it only categorizes comments into design/documentation categories.
+* Outputs should be used as assistive signals in tooling and analysis, not as authoritative judgements.
+---
+## Citation
+If you use this model in academic work or derived systems, please cite:
+> TheClouds Team. "NLBSE'26 Code Comment Classification – Pharo Model." 2025.
+BibTeX:
+```bibtex
+@misc{theclouds_nlbse26_code_comment_classification_pharo,
+  title        = {NLBSE'26 Code Comment Classification: Pharo Model},
+  author       = {TheClouds Team},
+  year         = {2025},
+  note         = {Model available on Hugging Face},
+  howpublished = {\url{To be published}}
+}
+```
+Contact:
+For questions, feedback, or collaboration requests related to this model, please contact:
+> Giacomo Signorile: g.signorile14@studenti.uniba.it
+> Davide Pio Posa: d.posa3@studenti.uniba.it
+> Marco Lillo: m.lillo21@studenti.uniba.it
+> Rebecca Margiotta: m.margiotta5@studenti.uniba.it
+> Adriano Gentile: a.gentile97@studenti.uniba.com
+Issue tracker: https://github.com/se4ai2526-uniba/TheClouds
+```
+## Acknowledgements
+This model was developed as part of research on **Natural Language-Based Software Engineering (NLBSE)** and the **Code Comment Classification** task, using the NLBSE’26 competition data and building on earlier SetFit and Random Forest baselines.

models/model_cards/python/setfit/README.md ADDED Viewed

	@@ -0,0 +1,210 @@

+---
+language:
+- en
+tags:
+- setfit
+- sentence-transformers
+- text-classification
+- generated_from_setfit_trainer
+license:
+- mit
+datasets:
+- NLBSE/nlbse26-code-comment-classification
+widget:
+- text: dataright np^sin 2 np^pi 224 t | Audio
+- text: robust way to ask the database for its current transaction state. | AtomicTests
+- text: the string marking the beginning of a print statement. | Environment
+- text: handled otherwise by a particular method. | StringMethods
+- text: table. | PlotAccessor
+metrics:
+- accuracy
+pipeline_tag: text-classification
+library_name: setfit
+inference: false
+base_model: sentence-transformers/paraphrase-MiniLM-L6-v2
+model-index:
+- name: SetFit with sentence-transformers/paraphrase-MiniLM-L6-v2
+  results:
+  - task:
+      type: text-classification
+      name: Text Classification
+    dataset:
+      name: NLBSE Code Comment Classification Dataset (Python)
+      type: NLBSE/nlbse26-code-comment-classification
+      split: test
+    metrics:
+    - type: accuracy
+      value: 0.4482758620689655
+      name: Accuracy
+---
+# SetFit Model for Python Code Comment Classification
+## Model Details
+- **Model Type:** SetFit (Sentence Transformer Fine-tuning)
+- **Base Model:** [sentence-transformers/paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2)
+- **Language:** Python (Comments in English)
+- **License:** MIT
+- **Developed by:** TheClouds
+- **Model Date:** November 17, 2025
+- **Model Version:** 1.0
+- **Maximum Sequence Length:** 128 tokens
+- **Contact:** For questions or comments about this model, please contact us via GitHub or email.
+### Description
+This model is a SetFit model trained on the **Python** subset of the **NLBSE Code Comment Classification Dataset**. It is designed to classify code comments into categories that describe the semantic purpose of the comment (e.g., Summary, Usage, Parameters).
+The model uses a multi-label classification approach, where a single comment can belong to multiple categories.
+## Intended Use
+This model has been created for the Code Comment Classification task, and trained specifically on code comments extracted from Python projects. As such, it is useful for research and development in code comment classification of projects made in Python, or software documentation analysis tasks.
+### Out-of-Scope Use Cases
+General text classification outside the domain of software engineering (e.g., social media sentiment analysis) is out of scope.
+## Factors
+- **Programming Language:** The model is specifically trained on Python code comments (including inline comments `#` and docstrings `"""`).
+- **Comment Types:** The model has been evaluated on the following categories specific to software documentation:
+    1. `Summary`
+    2. `Usage`
+    3. `Parameters`
+    4. `Expand`
+    5. `DevelopmentNotes`
+## Metrics
+- **Model Performance Measures:** The primary metrics used for evaluation are **Precision**, **Recall**, **F1-Score**, and **Accuracy**.
+- **Decision Thresholds:** A probability threshold of **0.5** was used for classification.
+- **Global Performance:** The model achieves an overall Accuracy of **0.4483** on the test set.
+## Evaluation Data
+- **Dataset:** NLBSE Code Comment Classification Dataset (Python test split).
+- **Motivation:** This dataset was chosen because it is the established benchmark for the NLBSE (Natural Language-Based Software Engineering) workshop.
+- **Size** 290 rows.
+- **Preprocessing:** Comments were extracted from real-world open-source Python projects, split into sentences, and manually classified.
+## Training Data
+- **Dataset:** NLBSE Code Comment Classification Dataset (Python train split).
+- **Dataset Stats:**
+  | Training set | Min | Median  | Max |
+  |:-------------|:----|:--------|:----|
+  | Word count   | 3   | 15.5217 | 299 |
+## Quantitative Analyses
+The following table shows the performance breakdown per category on the Python test set:
+| Language | Category | Precision | Recall | F1-Score |
+|---|---|---|---|---|
+| python | **Summary** | 0.6897 | 0.6557 | 0.6723 |
+| python | **Usage** | 0.6667 | 0.6813 | 0.6739 |
+| python | **Parameters** | 0.6882 | 0.7529 | 0.7191 |
+| python | **Expand** | 0.4533 | 0.6667 | 0.5397 |
+| python | **DevelopmentNotes** | 0.2192 | 0.5000 | 0.3048 |
+## Ethical Considerations
+- **Biases:** The dataset is drawn from open-source software projects. The comments reflect the writing styles and norms of the open-source Python community.
+- **Content:** Comments are user-generated content and may contain informal language or jargon.
+## Caveats and Recommendations
+- **Performance Variation:** The model performs well on structural comments like `Parameters` (F1 0.72) but struggles significantly with `DevelopmentNotes` (F1 0.30). Users should exercise caution when relying on the model for identifying development notes or rationale.
+- **Context:** The model relies on text-only comment sentences. Surrounding code context is not included.
+## How to Use
+First install the SetFit library:
+```bash
+pip install setfit
+```
+Then you can load this model and run inference:
+```python
+from setfit import SetFitModel
+# Download from the 🤗 Hub
+model = SetFitModel.from_pretrained("se4ai2526-uniba/setfit-python")
+# Run inference
+preds = model(["# yields the next value | generator.py"])
+print(preds)
+```
+## Training Details
+### Training Hyperparameters
+- batch_size: (32, 32)
+- num_epochs: (2, 2)
+- max_steps: -1
+- sampling_strategy: oversampling
+- num_iterations: 5
+- body_learning_rate: (2e-05, 1e-05)
+- head_learning_rate: 0.01
+- loss: CosineSimilarityLoss
+- distance_metric: cosine_distance
+- margin: 0.25
+- end_to_end: False
+- use_amp: False
+- warmup_proportion: 0.1
+- l2_weight: 0.01
+- seed: 42
+- eval_max_steps: -1
+- load_best_model_at_end: False
+### Training Results
+| Metric | Value |
+|:-------|:------|
+| **Accuracy** | 0.4482758620689655 |
+| **Embedding Loss** | 0.177 |
+| **Training Loss** | 0.215 |
+| **Training Runtime** | 137.40 s |
+| **Training Samples/Sec** | 198.189 |
+| **Training Steps/Sec** | 6.213 |
+### Framework Versions
+- Python: 3.11.9
+- SetFit: 1.1.2
+- Sentence Transformers: 5.1.2
+- Transformers: 4.57.1
+- PyTorch: 2.7.1
+- Datasets: 3.6.0
+- Tokenizers: 0.22.1
+## Citation
+If you use this model in academic work or derived systems, please cite:
+> TheClouds Team. "NLBSE'26 Code Comment Classification – Python Model." 2025.
+BibTeX:
+```bibtex
+@misc{theclouds_nlbse26_code_comment_classification_python,
+  title        = {NLBSE'26 Code Comment Classification: Python Model},
+  author       = {TheClouds Team},
+  year         = {2025},
+  note         = {Model available on Hugging Face},
+  howpublished = {\url{To be published}}
+}
+```
+Contact:
+For questions, feedback, or collaboration requests related to this model, please contact:
+> Giacomo Signorile: g.signorile14@studenti.uniba.it
+> Davide Pio Posa: d.posa3@studenti.uniba.it
+> Marco Lillo: m.lillo21@studenti.uniba.it
+> Rebecca Margiotta: m.margiotta5@studenti.uniba.it
+> Adriano Gentile: a.gentile97@studenti.uniba.com
+Issue tracker: https://github.com/se4ai2526-uniba/TheClouds
+```

models/model_cards/python/transformer/README.md ADDED Viewed

	@@ -0,0 +1,322 @@

+---
+language:
+- en
+tags:
+- text-classification
+- code-comment-classification
+- transformers
+- codebert
+- python
+- software-engineering
+- multi-label
+license: mit
+datasets:
+- NLBSE/nlbse26-code-comment-classification
+metrics:
+- f1
+- precision
+- recall
+- subset_accuracy
+- runtime
+- gflops
+pipeline_tag: text-classification
+library_name: transformers
+inference: false
+base_model: microsoft/codebert-base
+model-index:
+- name: CodeBERT Transformer for Python Code Comment Classification
+  results:
+  - task:
+      type: text-classification
+      name: Multi-label Text Classification
+    dataset:
+      name: NLBSE Code Comment Classification Dataset (Python)
+      type: NLBSE/nlbse26-code-comment-classification
+      split: test
+    metrics:
+    - type: f1
+      name: Macro F1
+      value: 0.6385
+    - type: f1
+      name: Micro F1
+      value: 0.6781
+    - type: precision
+      name: Macro Precision
+      value: 0.5900
+    - type: recall
+      name: Macro Recall
+      value: 0.7061
+    - type: accuracy
+      name: Subset Accuracy
+      value: 0.5690
+---
+# Transformer Model (CodeBERT) for Python Code Comment Classification
+## Model Details
+- **Model Type:** Transformer-based multi-label classifier (sequence classification head)
+- **Base Model:** [`microsoft/codebert-base`](https://huggingface.co/microsoft/codebert-base)
+- **Language:** Python (code comments in English)
+- **License:** MIT
+- **Developed by:** TheClouds
+- **Model Date:** November 2025
+- **Model Version:** 1.0
+### Description
+This model fine-tunes **CodeBERT** on the **Python** subset of the **NLBSE Code Comment Classification Dataset** for **multi-label** classification. Each Python code comment sentence is mapped to one or more semantic categories describing the role and intent of the comment.
+The classifier operates on the project’s `combo` field (concatenation of the comment sentence with a compact context string) and produces a 5-dimensional binary label vector.
+### Label Set
+For Python, the model predicts the following 5 categories (fixed order in the classifier head):
+1. `Usage`
+2. `Parameters`
+3. `DevelopmentNotes`
+4. `Expand`
+5. `Summary`
+Each prediction is a length-5 vector of 0/1 decisions, obtained by applying a sigmoid activation to the logits and thresholding at 0.5 by default.
+---
+## Intended Use
+The model is intended for:
+- research on **code comment classification** in Python projects,
+- mining and analysis of Python documentation comments,
+- tooling that needs semantic tags for comments (e.g., documentation quality inspection, comment recommendation, navigation support).
+It is designed for **Python code comments** in English or English-like technical language.
+### Out-of-Scope Uses
+- Generic natural language classification outside software engineering.
+- Non-English comments without additional fine-tuning or adaptation.
+- Use in safety- or life-critical decision making.
+---
+## Data
+### Training Data
+- **Dataset:** NLBSE Code Comment Classification Dataset – Python train split
+- **Size (train):** ~1.4k original training examples (with optional supersampled expansion to ~2k examples, depending on the configuration)
+- **Label Space:** 5 multi-label categories (`Usage`, `Parameters`, `DevelopmentNotes`, `Expand`, `Summary`)
+- **Preprocessing:**
+  - Comments extracted from open-source Python projects.
+  - Each instance represented via the `combo` field: `"<comment_sentence> | <class_context>"`.
+  - The project’s preprocessing pipeline can generate balanced training CSVs (via supersampling) under `data/processed/transformer`. The metrics reported here correspond to the current transformer configuration logged in MLflow for Python.
+### Evaluation Data
+- **Dataset:** NLBSE Code Comment Classification Dataset – Python test split
+- **Size (test):** ~300 comment sentences
+- **Evaluation Protocol:** multi-label classification with micro and macro metrics, plus subset accuracy (exact match).
+---
+## Metrics
+### Core Evaluation Metrics (Python, test split)
+From the training/evaluation run logged in MLflow:
+| lan    | cat             | precision | recall  | f1      |
+|--------|-----------------|-----------|---------|---------|
+| python | Usage           | 0.80  | 0.76| 0.78|
+| python | Parameters      | 0.74  | 0.86| 0.79|
+| python | DevelopmentNotes| 0.41  | 0.50| 0.45|
+| python | Expand          | 0.49  | 0.67| 0.57|
+| python | Summary         | 0.63  | 0.82| 0.71|
+- **Micro F1:** 0.6781
+- **Macro F1:** 0.6385
+- **Micro Precision:** 0.6230
+- **Micro Recall:** 0.7438
+- **Macro Precision:** 0.5900
+- **Macro Recall:** 0.7061
+- **Subset Accuracy (exact match):** 0.5690
+- **Micro Accuracy (per-label):** 0.8441
+- **Eval Loss (BCE with logits):** 0.6727
+- **Train Loss (final epoch):** 0.2937
+### Benchmarking Metrics
+Average performance for the Python transformer benchmark:
+- **Average Macro F1:** 0.6385
+- **Average Precision (macro):** 0.5900
+- **Average Recall (macro):** 0.7061
+- **Average Runtime:** ~0.94 seconds (benchmark configuration)
+- **Average GFLOPs:** ~1823.25
+These results indicate that the transformer captures useful patterns across all five Python comment categories, with stronger performance on frequent labels and reasonable performance on less frequent ones.
+---
+## Quantitative Analysis
+The model is evaluated in a multi-label setting:
+- **Micro metrics** emphasize the overall correctness across all label decisions.
+- **Macro metrics** treat all labels equally, highlighting the behaviour on minority classes (e.g., `DevelopmentNotes`).
+Per-class metrics (precision/recall/F1) can be inspected in the detailed classification report logged as an artifact in MLflow for the Python transformer run. In general, the model performs better on high-frequency labels such as `Usage` and `Summary`, while performance on rarer labels is more variable.
+---
+## Training Details
+### Objective and Architecture
+- **Base model:** `microsoft/codebert-base`
+- **Head:** linear classification head with `num_labels = 5`
+- **Problem type:** `multi_label_classification`
+- **Loss function:** `BCEWithLogitsLoss` with per-label **positive class weights** computed from training label frequencies.
+- **Sampling:** `WeightedRandomSampler` over training examples to reduce the impact of label imbalance.
+### Hyperparameters
+- **Max sequence length:** 128
+- **Batch size:** 16
+- **Learning rate:** 2e-5
+- **Optimizer:** AdamW
+- **Scheduler:** Linear warmup and decay
+- **Warmup ratio:** 0.1
+- **Number of epochs:** 5
+- **Threshold for prediction:** 0.5 (per-label on sigmoid probabilities)
+### Preprocessing and Balancing
+- Training uses the **Python** split prepared by the project’s preprocessing pipeline.
+- Optional supersampling (oversampling of underrepresented labels with a cap at the maximum label frequency) is available and can be enabled to improve macro performance.
+- The test split remains unchanged and corresponds to the original NLBSE Python test partition.
+### Hardware / Runtime
+The reported runtime and GFLOPs are based on the project’s benchmarking setup (single GPU, standard research workstation). Actual latency and throughput depend on hardware and batch size.
+---
+## How to Use
+Install `transformers` and `torch`:
+```bash
+pip install transformers torch
+```
+Then load the model and tokenizer (replace the model ID with your repository name):
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+MODEL_ID = "se4ai2526-uniba/python-transformer"  # replace with actual ID
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
+model.eval()
+LABELS = [
+    "Usage",
+    "Parameters",
+    "DevelopmentNotes",
+    "Expand",
+    "Summary",
+]
+def predict_labels(texts, threshold: float = 0.5):
+    if isinstance(texts, str):
+        texts = [texts]
+    inputs = tokenizer(
+        texts,
+        padding=True,
+        truncation=True,
+        max_length=128,
+        return_tensors="pt",
+    )
+    with torch.no_grad():
+        logits = model(**inputs).logits
+        probs = torch.sigmoid(logits)
+    preds = (probs > threshold).int().cpu().numpy()
+    results = []
+    for row in preds:
+        labels = [LABELS[i] for i, v in enumerate(row) if v == 1]
+        results.append(labels)
+    return results
+# Example
+comments = [
+    "# Usage: call this function with a file path | module.py",
+]
+print(predict_labels(comments))
+```
+For full reproducibility consistent with the project, use the `ModelPredictor` wrapper and the same preprocessing used during training.
+---
+## Limitations and Biases
+* **Domain-limited:** Trained only on Python code comments from open-source repositories.
+* **Imbalanced labels:** Some categories are relatively underrepresented; performance on these labels can lag behind frequent ones.
+* **Robustness:** Behavioral tests show that the current model:
+  * is deterministic and stable on duplicate inputs,
+  * aligns with several curated golden examples,
+  * remains sensitive to some benign text changes (extra whitespace, case changes, typos) unless additional normalization/augmentation is introduced.
+---
+## Ethical Considerations
+* The model reflects the style and biases of the open-source Python projects it was trained on.
+* It does not filter offensive or inappropriate content in comments; it only predicts semantic categories.
+* Outputs should be treated as assistive signals, not as authoritative judgements.
+---
+## Citation
+If you use this model in academic work or derived systems, please cite:
+> TheClouds Team. "NLBSE'26 Code Comment Classification – Python Model." 2025.
+BibTeX:
+```bibtex
+@misc{theclouds_nlbse26_code_comment_classification_python,
+  title        = {NLBSE'26 Code Comment Classification: Python Model},
+  author       = {TheClouds Team},
+  year         = {2025},
+  note         = {Model available on Hugging Face},
+  howpublished = {\url{To be published}}
+}
+```
+Contact:
+For questions, feedback, or collaboration requests related to this model, please contact:
+> Giacomo Signorile: g.signorile14@studenti.uniba.it
+> Davide Pio Posa: d.posa3@studenti.uniba.it
+> Marco Lillo: m.lillo21@studenti.uniba.it
+> Rebecca Margiotta: m.margiotta5@studenti.uniba.it
+> Adriano Gentile: a.gentile97@studenti.uniba.com
+Issue tracker: https://github.com/se4ai2526-uniba/TheClouds
+```
+## Acknowledgements
+This model was developed as part of research on **Natural Language-Based Software Engineering (NLBSE)** and the **Code Comment Classification** task, building on the NLBSE’26 competition data and earlier SetFit and Random Forest baselines.