Spaces:

seai2526-uniba-TheClouds
/

Code-Comment-Classification-Api

Running

File size: 11,360 Bytes
---
language:
- en
tags:
- text-classification
- code-comment-classification
- transformers
- codebert
- pharo
- software-engineering
- multi-label
license: mit
datasets:
- NLBSE/nlbse26-code-comment-classification
metrics:
- f1
- precision
- recall
- subset_accuracy
- runtime
- gflops
pipeline_tag: text-classification
library_name: transformers
inference: false
base_model: microsoft/codebert-base
model-index:
- name: CodeBERT Transformer for Pharo Code Comment Classification
  results:
  - task:
      type: text-classification
      name: Multi-label Text Classification
    dataset:
      name: NLBSE Code Comment Classification Dataset (Pharo)
      type: NLBSE/nlbse26-code-comment-classification
      split: test
    metrics:
    - type: f1
      name: Macro F1
      value: 0.5980
    - type: f1
      name: Micro F1
      value: 0.6720
    - type: precision
      name: Macro Precision
      value: 0.5234
    - type: recall
      name: Macro Recall
      value: 0.7157
    - type: accuracy
      name: Subset Accuracy
      value: 0.5096
---

# Transformer Model (CodeBERT) for Pharo Code Comment Classification

## Model Details

- **Model Type:** Transformer-based multi-label classifier (sequence classification head)
- **Base Model:** [`microsoft/codebert-base`](https://huggingface.co/microsoft/codebert-base)
- **Language:** Pharo (code comments in English/technical English)
- **License:** MIT
- **Developed by:** TheClouds
- **Model Date:** November 2025
- **Model Version:** 1.0

### Description

This model fine-tunes **CodeBERT** on the **Pharo** subset of the **NLBSE Code Comment Classification Dataset** for **multi-label** classification. Each Pharo code comment sentence is mapped to one or more semantic categories that describe design intent and responsibilities of classes and methods.

The classifier operates on the `combo` field used in the project (comment sentence plus a compact context string) and produces a 6-dimensional binary label vector.

### Label Set

For Pharo, the model predicts the following 6 categories (fixed order in the classifier head):

1. `Keyimplementationpoints`
2. `Example`
3. `Responsibilities`
4. `Intent`
5. `Keymessages`
6. `Collaborators`

Each prediction is a length-6 vector of 0/1 decisions, obtained by applying a sigmoid activation to the logits and thresholding at 0.5 by default.

---

## Intended Use

The model is intended for:

- research on **code comment and design documentation classification** in Pharo projects,
- mining and analysis of Pharo method/class comments to extract design intent and responsibilities,
- tools that need semantic tags for Smalltalk/Pharo comments (e.g., navigation, documentation quality checks, design overview tools).

It is designed for **Pharo code comments** written in English or English-like technical language.

### Out-of-Scope Uses

- Generic text classification outside software engineering and Pharo/Smalltalk ecosystems.
- Non-English comments, or comments from unrelated programming languages, without additional fine-tuning.
- Any safety- or life-critical decision-making context.

---

## Data

### Training Data

- **Dataset:** NLBSE Code Comment Classification Dataset – Pharo train split
- **Size (train):** ~900 original training examples (expanded to ~1.9k via supersampling in the current configuration)
- **Label Space:** 6 multi-label categories (`Keyimplementationpoints`, `Example`, `Responsibilities`, `Intent`, `Keymessages`, `Collaborators`)
- **Preprocessing:**
  - Comments extracted from real-world Pharo projects.
  - Each sample represented using the `combo` field: `"<comment_sentence> | <class_context>"` (or similar contextual string).
  - For this transformer configuration, the training data come from `data/processed/transformer`, where a supersampling procedure is applied to reduce label imbalance.

### Evaluation Data

- **Dataset:** NLBSE Code Comment Classification Dataset – Pharo test split
- **Size (test):** ~200 comment sentences
- **Evaluation Protocol:** multi-label classification with micro/macro metrics and subset accuracy (exact-match), evaluated on the original, non-supersampled test split.

---

## Metrics

### Core Evaluation Metrics (Pharo, test split)

From the training/evaluation run logged in MLflow:

| lan   | cat                    | precision | recall  | f1      |
|-------|------------------------|-----------|---------|---------|
| pharo | Keyimplementationpoints| 0.47  | 0.68| 0.56|
| pharo | Example                | 0.89  | 0.83| 0.86|
| pharo | Responsibilities       | 0.57  | 0.76| 0.65|
| pharo | Intent                 | 0.83  | 0.90| 0.86|
| pharo | Keymessages            | 0.47  | 0.73| 0.57|
| pharo | Collaborators          | 0.33  | 0.57| 0.42|

- **Micro F1:** 0.6720  
- **Macro F1:** 0.5980  
- **Micro Precision:** 0.5964  
- **Micro Recall:** 0.7696  
- **Macro Precision:** 0.5234  
- **Macro Recall:** 0.7157  
- **Subset Accuracy (exact match):** 0.5096  
- **Micro Accuracy (per-label):** 0.8694  
- **Eval Loss (BCE with logits):** 0.5889  
- **Train Loss (final epoch):** 0.2149  

### Benchmarking Metrics

Average performance over Pharo transformer benchmarking runs:

- **Average Macro F1:** 0.5980  
- **Average Precision (macro):** 0.5234  
- **Average Recall (macro):** 0.7157  
- **Average Runtime:** ~1.35 seconds (benchmark configuration)  
- **Average GFLOPs:** ~1943.77  

These results indicate that the model captures the main Pharo comment categories reasonably well, with particularly strong recall, while macro precision and F1 reflect the dataset’s label imbalance and limited size.

---

## Quantitative Analysis

The evaluation is fully multi-label:

- **Micro metrics** reflect overall correctness across all label decisions.
- **Macro metrics** treat each of the six labels equally, exposing weaknesses on rarer categories (e.g., `Collaborators`, `Keymessages`).

A detailed per-class breakdown (precision/recall/F1 per label) is available in the classification report artifact for the Pharo transformer run in MLflow. High-level observations include:

- Better performance on frequent categories such as `Example` and `Responsibilities`.
- More variable performance on `Intent`, `Keymessages`, and `Collaborators`, due to fewer training examples.

---

## Training Details

### Objective and Architecture

- **Base model:** `microsoft/codebert-base`
- **Head:** linear classification head with `num_labels = 6`
- **Problem type:** `multi_label_classification`
- **Loss function:** `BCEWithLogitsLoss` with per-label **positive class weights** computed from training label frequencies.
- **Sampling:** `WeightedRandomSampler` over training samples to partially correct for label imbalance.

### Hyperparameters

- **Max sequence length:** 128  
- **Batch size:** 16  
- **Learning rate:** 2e-5  
- **Optimizer:** AdamW  
- **Scheduler:** Linear warmup and decay  
- **Warmup ratio:** 0.1  
- **Number of epochs:** 5  
- **Prediction threshold:** 0.5 (per-label on sigmoid probabilities)

### Preprocessing and Balancing

- Training data for Pharo are produced by the project’s preprocessing module, which:
  - ensures a `combo` text field is present,
  - parses the label strings into binary vectors,
  - applies **supersampling** on the train split only (up to a cap at the maximum original label frequency).
- The test split is not modified and corresponds to the original NLBSE Pharo test data.

### Hardware / Runtime

The runtime and GFLOPs figures are measured in the project’s benchmarking environment (single GPU, standard research workstation). Actual performance in deployment will depend on hardware and batch size.

---

## How to Use

Install dependencies:

```bash
pip install transformers torch
```

Then load the model and tokenizer (replace the model ID with the actual repository):

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

MODEL_ID = "se4ai2526-uniba/pharo-transformer"  # replace with actual ID

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
model.eval()

LABELS = [
    "Keyimplementationpoints",
    "Example",
    "Responsibilities",
    "Intent",
    "Keymessages",
    "Collaborators",
]

def predict_labels(texts, threshold: float = 0.5):
    if isinstance(texts, str):
        texts = [texts]

    inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=128,
        return_tensors="pt",
    )
    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.sigmoid(logits)

    preds = (probs > threshold).int().cpu().numpy()
    results = []
    for row in preds:
        labels = [LABELS[i] for i, v in enumerate(row) if v == 1]
        results.append(labels)
    return results

# Example
comments = [
    "\"The intent of this class is to manage UI events\" | MyWidget class",
]
print(predict_labels(comments))
```

For consistency with the rest of the project, you can also use the shared `ModelPredictor` wrapper and the same preprocessing normalization applied during training.

---

## Limitations and Biases

* **Limited data:** The Pharo split is relatively small compared to Java/Python; this constrains the model’s ability to generalize, especially for minority labels.
* **Imbalanced label distribution:** Despite supersampling and positive weights, some categories remain harder to predict reliably.
* **Sensitivity to perturbations:** Behavioral tests show:

  * deterministic behaviour and stable predictions on duplicate inputs,
  * alignment with several curated golden examples,
  * sensitivity to some benign text changes (whitespace, case, punctuation, typos) unless additional normalization or data augmentation is introduced.

---

## Ethical Considerations

* The model is trained on comments from open-source Pharo projects and inherits their style, norms, and potential biases.
* It does not filter or sanitize comment content; it only categorizes comments into design/documentation categories.
* Outputs should be used as assistive signals in tooling and analysis, not as authoritative judgements.

---

## Citation

If you use this model in academic work or derived systems, please cite:

> TheClouds Team. "NLBSE'26 Code Comment Classification – Pharo Model." 2025.

BibTeX:

```bibtex
@misc{theclouds_nlbse26_code_comment_classification_pharo,
  title        = {NLBSE'26 Code Comment Classification: Pharo Model},
  author       = {TheClouds Team},
  year         = {2025},
  note         = {Model available on Hugging Face},
  howpublished = {\url{To be published}}
}
```

Contact:

For questions, feedback, or collaboration requests related to this model, please contact:
> Giacomo Signorile: g.signorile14@studenti.uniba.it
> Davide Pio Posa: d.posa3@studenti.uniba.it
> Marco Lillo: m.lillo21@studenti.uniba.it
> Rebecca Margiotta: m.margiotta5@studenti.uniba.it
> Adriano Gentile: a.gentile97@studenti.uniba.com

Issue tracker: https://github.com/se4ai2526-uniba/TheClouds

```
## Acknowledgements

This model was developed as part of research on **Natural Language-Based Software Engineering (NLBSE)** and the **Code Comment Classification** task, using the NLBSE’26 competition data and building on earlier SetFit and Random Forest baselines.