Sky-Blue-da-ba-dee's picture
fixed a typo in the project name
9636971
metadata
language:
  - en
tags:
  - text-classification
  - code-comment-classification
  - transformers
  - codebert
  - pharo
  - software-engineering
  - multi-label
license: mit
datasets:
  - NLBSE/nlbse26-code-comment-classification
metrics:
  - f1
  - precision
  - recall
  - subset_accuracy
  - runtime
  - gflops
pipeline_tag: text-classification
library_name: transformers
inference: false
base_model: microsoft/codebert-base
model-index:
  - name: CodeBERT Transformer for Pharo Code Comment Classification
    results:
      - task:
          type: text-classification
          name: Multi-label Text Classification
        dataset:
          name: NLBSE Code Comment Classification Dataset (Pharo)
          type: NLBSE/nlbse26-code-comment-classification
          split: test
        metrics:
          - type: f1
            name: Macro F1
            value: 0.598
          - type: f1
            name: Micro F1
            value: 0.672
          - type: precision
            name: Macro Precision
            value: 0.5234
          - type: recall
            name: Macro Recall
            value: 0.7157
          - type: accuracy
            name: Subset Accuracy
            value: 0.5096

Transformer Model (CodeBERT) for Pharo Code Comment Classification

Model Details

  • Model Type: Transformer-based multi-label classifier (sequence classification head)
  • Base Model: microsoft/codebert-base
  • Language: Pharo (code comments in English/technical English)
  • License: MIT
  • Developed by: TheClouds
  • Model Date: November 2025
  • Model Version: 1.0

Description

This model fine-tunes CodeBERT on the Pharo subset of the NLBSE Code Comment Classification Dataset for multi-label classification. Each Pharo code comment sentence is mapped to one or more semantic categories that describe design intent and responsibilities of classes and methods.

The classifier operates on the combo field used in the project (comment sentence plus a compact context string) and produces a 6-dimensional binary label vector.

Label Set

For Pharo, the model predicts the following 6 categories (fixed order in the classifier head):

  1. Keyimplementationpoints
  2. Example
  3. Responsibilities
  4. Intent
  5. Keymessages
  6. Collaborators

Each prediction is a length-6 vector of 0/1 decisions, obtained by applying a sigmoid activation to the logits and thresholding at 0.5 by default.


Intended Use

The model is intended for:

  • research on code comment and design documentation classification in Pharo projects,
  • mining and analysis of Pharo method/class comments to extract design intent and responsibilities,
  • tools that need semantic tags for Smalltalk/Pharo comments (e.g., navigation, documentation quality checks, design overview tools).

It is designed for Pharo code comments written in English or English-like technical language.

Out-of-Scope Uses

  • Generic text classification outside software engineering and Pharo/Smalltalk ecosystems.
  • Non-English comments, or comments from unrelated programming languages, without additional fine-tuning.
  • Any safety- or life-critical decision-making context.

Data

Training Data

  • Dataset: NLBSE Code Comment Classification Dataset – Pharo train split
  • Size (train): ~900 original training examples (expanded to ~1.9k via supersampling in the current configuration)
  • Label Space: 6 multi-label categories (Keyimplementationpoints, Example, Responsibilities, Intent, Keymessages, Collaborators)
  • Preprocessing:
    • Comments extracted from real-world Pharo projects.
    • Each sample represented using the combo field: "<comment_sentence> | <class_context>" (or similar contextual string).
    • For this transformer configuration, the training data come from data/processed/transformer, where a supersampling procedure is applied to reduce label imbalance.

Evaluation Data

  • Dataset: NLBSE Code Comment Classification Dataset – Pharo test split
  • Size (test): ~200 comment sentences
  • Evaluation Protocol: multi-label classification with micro/macro metrics and subset accuracy (exact-match), evaluated on the original, non-supersampled test split.

Metrics

Core Evaluation Metrics (Pharo, test split)

From the training/evaluation run logged in MLflow:

lan cat precision recall f1
pharo Keyimplementationpoints 0.47 0.68 0.56
pharo Example 0.89 0.83 0.86
pharo Responsibilities 0.57 0.76 0.65
pharo Intent 0.83 0.90 0.86
pharo Keymessages 0.47 0.73 0.57
pharo Collaborators 0.33 0.57 0.42
  • Micro F1: 0.6720
  • Macro F1: 0.5980
  • Micro Precision: 0.5964
  • Micro Recall: 0.7696
  • Macro Precision: 0.5234
  • Macro Recall: 0.7157
  • Subset Accuracy (exact match): 0.5096
  • Micro Accuracy (per-label): 0.8694
  • Eval Loss (BCE with logits): 0.5889
  • Train Loss (final epoch): 0.2149

Benchmarking Metrics

Average performance over Pharo transformer benchmarking runs:

  • Average Macro F1: 0.5980
  • Average Precision (macro): 0.5234
  • Average Recall (macro): 0.7157
  • Average Runtime: ~1.35 seconds (benchmark configuration)
  • Average GFLOPs: ~1943.77

These results indicate that the model captures the main Pharo comment categories reasonably well, with particularly strong recall, while macro precision and F1 reflect the dataset’s label imbalance and limited size.


Quantitative Analysis

The evaluation is fully multi-label:

  • Micro metrics reflect overall correctness across all label decisions.
  • Macro metrics treat each of the six labels equally, exposing weaknesses on rarer categories (e.g., Collaborators, Keymessages).

A detailed per-class breakdown (precision/recall/F1 per label) is available in the classification report artifact for the Pharo transformer run in MLflow. High-level observations include:

  • Better performance on frequent categories such as Example and Responsibilities.
  • More variable performance on Intent, Keymessages, and Collaborators, due to fewer training examples.

Training Details

Objective and Architecture

  • Base model: microsoft/codebert-base
  • Head: linear classification head with num_labels = 6
  • Problem type: multi_label_classification
  • Loss function: BCEWithLogitsLoss with per-label positive class weights computed from training label frequencies.
  • Sampling: WeightedRandomSampler over training samples to partially correct for label imbalance.

Hyperparameters

  • Max sequence length: 128
  • Batch size: 16
  • Learning rate: 2e-5
  • Optimizer: AdamW
  • Scheduler: Linear warmup and decay
  • Warmup ratio: 0.1
  • Number of epochs: 5
  • Prediction threshold: 0.5 (per-label on sigmoid probabilities)

Preprocessing and Balancing

  • Training data for Pharo are produced by the project’s preprocessing module, which:
    • ensures a combo text field is present,
    • parses the label strings into binary vectors,
    • applies supersampling on the train split only (up to a cap at the maximum original label frequency).
  • The test split is not modified and corresponds to the original NLBSE Pharo test data.

Hardware / Runtime

The runtime and GFLOPs figures are measured in the project’s benchmarking environment (single GPU, standard research workstation). Actual performance in deployment will depend on hardware and batch size.


How to Use

Install dependencies:

pip install transformers torch

Then load the model and tokenizer (replace the model ID with the actual repository):

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

MODEL_ID = "se4ai2526-uniba/pharo-transformer"  # replace with actual ID

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
model.eval()

LABELS = [
    "Keyimplementationpoints",
    "Example",
    "Responsibilities",
    "Intent",
    "Keymessages",
    "Collaborators",
]

def predict_labels(texts, threshold: float = 0.5):
    if isinstance(texts, str):
        texts = [texts]

    inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=128,
        return_tensors="pt",
    )
    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.sigmoid(logits)

    preds = (probs > threshold).int().cpu().numpy()
    results = []
    for row in preds:
        labels = [LABELS[i] for i, v in enumerate(row) if v == 1]
        results.append(labels)
    return results

# Example
comments = [
    "\"The intent of this class is to manage UI events\" | MyWidget class",
]
print(predict_labels(comments))

For consistency with the rest of the project, you can also use the shared ModelPredictor wrapper and the same preprocessing normalization applied during training.


Limitations and Biases

  • Limited data: The Pharo split is relatively small compared to Java/Python; this constrains the model’s ability to generalize, especially for minority labels.

  • Imbalanced label distribution: Despite supersampling and positive weights, some categories remain harder to predict reliably.

  • Sensitivity to perturbations: Behavioral tests show:

    • deterministic behaviour and stable predictions on duplicate inputs,
    • alignment with several curated golden examples,
    • sensitivity to some benign text changes (whitespace, case, punctuation, typos) unless additional normalization or data augmentation is introduced.

Ethical Considerations

  • The model is trained on comments from open-source Pharo projects and inherits their style, norms, and potential biases.
  • It does not filter or sanitize comment content; it only categorizes comments into design/documentation categories.
  • Outputs should be used as assistive signals in tooling and analysis, not as authoritative judgements.

Citation

If you use this model in academic work or derived systems, please cite:

TheClouds Team. "NLBSE'26 Code Comment Classification – Pharo Model." 2025.

BibTeX:

@misc{theclouds_nlbse26_code_comment_classification_pharo,
  title        = {NLBSE'26 Code Comment Classification: Pharo Model},
  author       = {TheClouds Team},
  year         = {2025},
  note         = {Model available on Hugging Face},
  howpublished = {\url{To be published}}
}

Contact:

For questions, feedback, or collaboration requests related to this model, please contact:

Giacomo Signorile: g.signorile14@studenti.uniba.it Davide Pio Posa: d.posa3@studenti.uniba.it Marco Lillo: m.lillo21@studenti.uniba.it Rebecca Margiotta: m.margiotta5@studenti.uniba.it Adriano Gentile: a.gentile97@studenti.uniba.com

Issue tracker: https://github.com/se4ai2526-uniba/TheClouds

## Acknowledgements

This model was developed as part of research on **Natural Language-Based Software Engineering (NLBSE)** and the **Code Comment Classification** task, using the NLBSE’26 competition data and building on earlier SetFit and Random Forest baselines.