Spaces:

seai2526-uniba-TheClouds
/

Code-Comment-Classification-Api

Running

App Files Files Community

Code-Comment-Classification-Api / models /model_cards /pharo /transformer /README.md

Sky-Blue-da-ba-dee

fixed a typo in the project name

9636971 11 days ago

preview code

raw

history blame contribute delete

11.4 kB

metadata

language:
  - en
tags:
  - text-classification
  - code-comment-classification
  - transformers
  - codebert
  - pharo
  - software-engineering
  - multi-label
license: mit
datasets:
  - NLBSE/nlbse26-code-comment-classification
metrics:
  - f1
  - precision
  - recall
  - subset_accuracy
  - runtime
  - gflops
pipeline_tag: text-classification
library_name: transformers
inference: false
base_model: microsoft/codebert-base
model-index:
  - name: CodeBERT Transformer for Pharo Code Comment Classification
    results:
      - task:
          type: text-classification
          name: Multi-label Text Classification
        dataset:
          name: NLBSE Code Comment Classification Dataset (Pharo)
          type: NLBSE/nlbse26-code-comment-classification
          split: test
        metrics:
          - type: f1
            name: Macro F1
            value: 0.598
          - type: f1
            name: Micro F1
            value: 0.672
          - type: precision
            name: Macro Precision
            value: 0.5234
          - type: recall
            name: Macro Recall
            value: 0.7157
          - type: accuracy
            name: Subset Accuracy
            value: 0.5096

Transformer Model (CodeBERT) for Pharo Code Comment Classification

Model Details

Model Type: Transformer-based multi-label classifier (sequence classification head)
Base Model: microsoft/codebert-base
Language: Pharo (code comments in English/technical English)
License: MIT
Developed by: TheClouds
Model Date: November 2025
Model Version: 1.0

Description

This model fine-tunes CodeBERT on the Pharo subset of the NLBSE Code Comment Classification Dataset for multi-label classification. Each Pharo code comment sentence is mapped to one or more semantic categories that describe design intent and responsibilities of classes and methods.

The classifier operates on the combo field used in the project (comment sentence plus a compact context string) and produces a 6-dimensional binary label vector.

Label Set

For Pharo, the model predicts the following 6 categories (fixed order in the classifier head):

Keyimplementationpoints
Example
Responsibilities
Intent
Keymessages
Collaborators

Each prediction is a length-6 vector of 0/1 decisions, obtained by applying a sigmoid activation to the logits and thresholding at 0.5 by default.

Intended Use

The model is intended for:

research on code comment and design documentation classification in Pharo projects,
mining and analysis of Pharo method/class comments to extract design intent and responsibilities,
tools that need semantic tags for Smalltalk/Pharo comments (e.g., navigation, documentation quality checks, design overview tools).

It is designed for Pharo code comments written in English or English-like technical language.

Out-of-Scope Uses

Generic text classification outside software engineering and Pharo/Smalltalk ecosystems.
Non-English comments, or comments from unrelated programming languages, without additional fine-tuning.
Any safety- or life-critical decision-making context.

Data

Training Data

Dataset: NLBSE Code Comment Classification Dataset – Pharo train split
Size (train): ~900 original training examples (expanded to ~1.9k via supersampling in the current configuration)
Label Space: 6 multi-label categories (Keyimplementationpoints, Example, Responsibilities, Intent, Keymessages, Collaborators)
Preprocessing:
- Comments extracted from real-world Pharo projects.
- Each sample represented using the combo field: "<comment_sentence> | <class_context>" (or similar contextual string).
- For this transformer configuration, the training data come from data/processed/transformer, where a supersampling procedure is applied to reduce label imbalance.

Evaluation Data

Dataset: NLBSE Code Comment Classification Dataset – Pharo test split
Size (test): ~200 comment sentences
Evaluation Protocol: multi-label classification with micro/macro metrics and subset accuracy (exact-match), evaluated on the original, non-supersampled test split.

Metrics

Core Evaluation Metrics (Pharo, test split)

From the training/evaluation run logged in MLflow:

lan	cat	precision	recall	f1
pharo	Keyimplementationpoints	0.47	0.68	0.56
pharo	Example	0.89	0.83	0.86
pharo	Responsibilities	0.57	0.76	0.65
pharo	Intent	0.83	0.90	0.86
pharo	Keymessages	0.47	0.73	0.57
pharo	Collaborators	0.33	0.57	0.42

Micro F1: 0.6720
Macro F1: 0.5980
Micro Precision: 0.5964
Micro Recall: 0.7696
Macro Precision: 0.5234
Macro Recall: 0.7157
Subset Accuracy (exact match): 0.5096
Micro Accuracy (per-label): 0.8694
Eval Loss (BCE with logits): 0.5889
Train Loss (final epoch): 0.2149

Benchmarking Metrics

Average performance over Pharo transformer benchmarking runs:

Average Macro F1: 0.5980
Average Precision (macro): 0.5234
Average Recall (macro): 0.7157
Average Runtime: ~1.35 seconds (benchmark configuration)
Average GFLOPs: ~1943.77

These results indicate that the model captures the main Pharo comment categories reasonably well, with particularly strong recall, while macro precision and F1 reflect the dataset’s label imbalance and limited size.

Quantitative Analysis

The evaluation is fully multi-label:

Micro metrics reflect overall correctness across all label decisions.
Macro metrics treat each of the six labels equally, exposing weaknesses on rarer categories (e.g., Collaborators, Keymessages).

A detailed per-class breakdown (precision/recall/F1 per label) is available in the classification report artifact for the Pharo transformer run in MLflow. High-level observations include:

Better performance on frequent categories such as Example and Responsibilities.
More variable performance on Intent, Keymessages, and Collaborators, due to fewer training examples.

Training Details

Objective and Architecture

Base model: microsoft/codebert-base
Head: linear classification head with num_labels = 6
Problem type: multi_label_classification
Loss function: BCEWithLogitsLoss with per-label positive class weights computed from training label frequencies.
Sampling: WeightedRandomSampler over training samples to partially correct for label imbalance.

Hyperparameters

Max sequence length: 128
Batch size: 16
Learning rate: 2e-5
Optimizer: AdamW
Scheduler: Linear warmup and decay
Warmup ratio: 0.1
Number of epochs: 5
Prediction threshold: 0.5 (per-label on sigmoid probabilities)

Preprocessing and Balancing

Training data for Pharo are produced by the project’s preprocessing module, which:
- ensures a combo text field is present,
- parses the label strings into binary vectors,
- applies supersampling on the train split only (up to a cap at the maximum original label frequency).
The test split is not modified and corresponds to the original NLBSE Pharo test data.

Hardware / Runtime

The runtime and GFLOPs figures are measured in the project’s benchmarking environment (single GPU, standard research workstation). Actual performance in deployment will depend on hardware and batch size.

How to Use

Install dependencies:

pip install transformers torch

Then load the model and tokenizer (replace the model ID with the actual repository):

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

MODEL_ID = "se4ai2526-uniba/pharo-transformer"  # replace with actual ID

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
model.eval()

LABELS = [
    "Keyimplementationpoints",
    "Example",
    "Responsibilities",
    "Intent",
    "Keymessages",
    "Collaborators",
]

def predict_labels(texts, threshold: float = 0.5):
    if isinstance(texts, str):
        texts = [texts]

    inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=128,
        return_tensors="pt",
    )
    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.sigmoid(logits)

    preds = (probs > threshold).int().cpu().numpy()
    results = []
    for row in preds:
        labels = [LABELS[i] for i, v in enumerate(row) if v == 1]
        results.append(labels)
    return results

# Example
comments = [
    "\"The intent of this class is to manage UI events\" | MyWidget class",
]
print(predict_labels(comments))

For consistency with the rest of the project, you can also use the shared ModelPredictor wrapper and the same preprocessing normalization applied during training.

Limitations and Biases

Limited data: The Pharo split is relatively small compared to Java/Python; this constrains the model’s ability to generalize, especially for minority labels.
Imbalanced label distribution: Despite supersampling and positive weights, some categories remain harder to predict reliably.
Sensitivity to perturbations: Behavioral tests show:
- deterministic behaviour and stable predictions on duplicate inputs,
- alignment with several curated golden examples,
- sensitivity to some benign text changes (whitespace, case, punctuation, typos) unless additional normalization or data augmentation is introduced.

Ethical Considerations

The model is trained on comments from open-source Pharo projects and inherits their style, norms, and potential biases.
It does not filter or sanitize comment content; it only categorizes comments into design/documentation categories.
Outputs should be used as assistive signals in tooling and analysis, not as authoritative judgements.

Citation

If you use this model in academic work or derived systems, please cite:

TheClouds Team. "NLBSE'26 Code Comment Classification – Pharo Model." 2025.

BibTeX:

@misc{theclouds_nlbse26_code_comment_classification_pharo,
  title        = {NLBSE'26 Code Comment Classification: Pharo Model},
  author       = {TheClouds Team},
  year         = {2025},
  note         = {Model available on Hugging Face},
  howpublished = {\url{To be published}}
}

Contact:

For questions, feedback, or collaboration requests related to this model, please contact:

Giacomo Signorile: g.signorile14@studenti.uniba.it Davide Pio Posa: d.posa3@studenti.uniba.it Marco Lillo: m.lillo21@studenti.uniba.it Rebecca Margiotta: m.margiotta5@studenti.uniba.it Adriano Gentile: a.gentile97@studenti.uniba.com

Issue tracker: https://github.com/se4ai2526-uniba/TheClouds

## Acknowledgements

This model was developed as part of research on **Natural Language-Based Software Engineering (NLBSE)** and the **Code Comment Classification** task, using the NLBSE’26 competition data and building on earlier SetFit and Random Forest baselines.