Spaces:

seai2526-uniba-TheClouds
/

Code-Comment-Classification-Api

Running

File size: 10,701 Bytes
---
language:
- en
tags:
- text-classification
- code-comment-classification
- transformers
- codebert
- python
- software-engineering
- multi-label
license: mit
datasets:
- NLBSE/nlbse26-code-comment-classification
metrics:
- f1
- precision
- recall
- subset_accuracy
- runtime
- gflops
pipeline_tag: text-classification
library_name: transformers
inference: false
base_model: microsoft/codebert-base
model-index:
- name: CodeBERT Transformer for Python Code Comment Classification
  results:
  - task:
      type: text-classification
      name: Multi-label Text Classification
    dataset:
      name: NLBSE Code Comment Classification Dataset (Python)
      type: NLBSE/nlbse26-code-comment-classification
      split: test
    metrics:
    - type: f1
      name: Macro F1
      value: 0.6385
    - type: f1
      name: Micro F1
      value: 0.6781
    - type: precision
      name: Macro Precision
      value: 0.5900
    - type: recall
      name: Macro Recall
      value: 0.7061
    - type: accuracy
      name: Subset Accuracy
      value: 0.5690
---

# Transformer Model (CodeBERT) for Python Code Comment Classification

## Model Details

- **Model Type:** Transformer-based multi-label classifier (sequence classification head)
- **Base Model:** [`microsoft/codebert-base`](https://huggingface.co/microsoft/codebert-base)
- **Language:** Python (code comments in English)
- **License:** MIT
- **Developed by:** TheClouds
- **Model Date:** November 2025
- **Model Version:** 1.0

### Description

This model fine-tunes **CodeBERT** on the **Python** subset of the **NLBSE Code Comment Classification Dataset** for **multi-label** classification. Each Python code comment sentence is mapped to one or more semantic categories describing the role and intent of the comment.

The classifier operates on the project’s `combo` field (concatenation of the comment sentence with a compact context string) and produces a 5-dimensional binary label vector.

### Label Set

For Python, the model predicts the following 5 categories (fixed order in the classifier head):

1. `Usage`
2. `Parameters`
3. `DevelopmentNotes`
4. `Expand`
5. `Summary`

Each prediction is a length-5 vector of 0/1 decisions, obtained by applying a sigmoid activation to the logits and thresholding at 0.5 by default.

---

## Intended Use

The model is intended for:

- research on **code comment classification** in Python projects,
- mining and analysis of Python documentation comments,
- tooling that needs semantic tags for comments (e.g., documentation quality inspection, comment recommendation, navigation support).

It is designed for **Python code comments** in English or English-like technical language.

### Out-of-Scope Uses

- Generic natural language classification outside software engineering.
- Non-English comments without additional fine-tuning or adaptation.
- Use in safety- or life-critical decision making.

---

## Data

### Training Data

- **Dataset:** NLBSE Code Comment Classification Dataset – Python train split
- **Size (train):** ~1.4k original training examples (with optional supersampled expansion to ~2k examples, depending on the configuration)
- **Label Space:** 5 multi-label categories (`Usage`, `Parameters`, `DevelopmentNotes`, `Expand`, `Summary`)
- **Preprocessing:**
  - Comments extracted from open-source Python projects.
  - Each instance represented via the `combo` field: `"<comment_sentence> | <class_context>"`.
  - The project’s preprocessing pipeline can generate balanced training CSVs (via supersampling) under `data/processed/transformer`. The metrics reported here correspond to the current transformer configuration logged in MLflow for Python.

### Evaluation Data

- **Dataset:** NLBSE Code Comment Classification Dataset – Python test split
- **Size (test):** ~300 comment sentences
- **Evaluation Protocol:** multi-label classification with micro and macro metrics, plus subset accuracy (exact match).

---

## Metrics

### Core Evaluation Metrics (Python, test split)

From the training/evaluation run logged in MLflow:

| lan    | cat             | precision | recall  | f1      |
|--------|-----------------|-----------|---------|---------|
| python | Usage           | 0.80  | 0.76| 0.78|
| python | Parameters      | 0.74  | 0.86| 0.79|
| python | DevelopmentNotes| 0.41  | 0.50| 0.45|
| python | Expand          | 0.49  | 0.67| 0.57|
| python | Summary         | 0.63  | 0.82| 0.71|


- **Micro F1:** 0.6781  
- **Macro F1:** 0.6385  
- **Micro Precision:** 0.6230  
- **Micro Recall:** 0.7438  
- **Macro Precision:** 0.5900  
- **Macro Recall:** 0.7061  
- **Subset Accuracy (exact match):** 0.5690  
- **Micro Accuracy (per-label):** 0.8441  
- **Eval Loss (BCE with logits):** 0.6727  
- **Train Loss (final epoch):** 0.2937  

### Benchmarking Metrics

Average performance for the Python transformer benchmark:

- **Average Macro F1:** 0.6385  
- **Average Precision (macro):** 0.5900  
- **Average Recall (macro):** 0.7061  
- **Average Runtime:** ~0.94 seconds (benchmark configuration)  
- **Average GFLOPs:** ~1823.25  

These results indicate that the transformer captures useful patterns across all five Python comment categories, with stronger performance on frequent labels and reasonable performance on less frequent ones.

---

## Quantitative Analysis

The model is evaluated in a multi-label setting:

- **Micro metrics** emphasize the overall correctness across all label decisions.
- **Macro metrics** treat all labels equally, highlighting the behaviour on minority classes (e.g., `DevelopmentNotes`).

Per-class metrics (precision/recall/F1) can be inspected in the detailed classification report logged as an artifact in MLflow for the Python transformer run. In general, the model performs better on high-frequency labels such as `Usage` and `Summary`, while performance on rarer labels is more variable.

---

## Training Details

### Objective and Architecture

- **Base model:** `microsoft/codebert-base`
- **Head:** linear classification head with `num_labels = 5`
- **Problem type:** `multi_label_classification`
- **Loss function:** `BCEWithLogitsLoss` with per-label **positive class weights** computed from training label frequencies.
- **Sampling:** `WeightedRandomSampler` over training examples to reduce the impact of label imbalance.

### Hyperparameters

- **Max sequence length:** 128  
- **Batch size:** 16  
- **Learning rate:** 2e-5  
- **Optimizer:** AdamW  
- **Scheduler:** Linear warmup and decay  
- **Warmup ratio:** 0.1  
- **Number of epochs:** 5  
- **Threshold for prediction:** 0.5 (per-label on sigmoid probabilities)

### Preprocessing and Balancing

- Training uses the **Python** split prepared by the project’s preprocessing pipeline.
- Optional supersampling (oversampling of underrepresented labels with a cap at the maximum label frequency) is available and can be enabled to improve macro performance.
- The test split remains unchanged and corresponds to the original NLBSE Python test partition.

### Hardware / Runtime

The reported runtime and GFLOPs are based on the project’s benchmarking setup (single GPU, standard research workstation). Actual latency and throughput depend on hardware and batch size.

---

## How to Use

Install `transformers` and `torch`:

```bash
pip install transformers torch
```

Then load the model and tokenizer (replace the model ID with your repository name):

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

MODEL_ID = "se4ai2526-uniba/python-transformer"  # replace with actual ID

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
model.eval()

LABELS = [
    "Usage",
    "Parameters",
    "DevelopmentNotes",
    "Expand",
    "Summary",
]

def predict_labels(texts, threshold: float = 0.5):
    if isinstance(texts, str):
        texts = [texts]

    inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=128,
        return_tensors="pt",
    )
    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.sigmoid(logits)

    preds = (probs > threshold).int().cpu().numpy()
    results = []
    for row in preds:
        labels = [LABELS[i] for i, v in enumerate(row) if v == 1]
        results.append(labels)
    return results

# Example
comments = [
    "# Usage: call this function with a file path | module.py",
]
print(predict_labels(comments))
```

For full reproducibility consistent with the project, use the `ModelPredictor` wrapper and the same preprocessing used during training.

---

## Limitations and Biases

* **Domain-limited:** Trained only on Python code comments from open-source repositories.
* **Imbalanced labels:** Some categories are relatively underrepresented; performance on these labels can lag behind frequent ones.
* **Robustness:** Behavioral tests show that the current model:

  * is deterministic and stable on duplicate inputs,
  * aligns with several curated golden examples,
  * remains sensitive to some benign text changes (extra whitespace, case changes, typos) unless additional normalization/augmentation is introduced.

---

## Ethical Considerations

* The model reflects the style and biases of the open-source Python projects it was trained on.
* It does not filter offensive or inappropriate content in comments; it only predicts semantic categories.
* Outputs should be treated as assistive signals, not as authoritative judgements.

---

## Citation

If you use this model in academic work or derived systems, please cite:

> TheClouds Team. "NLBSE'26 Code Comment Classification – Python Model." 2025.

BibTeX:

```bibtex
@misc{theclouds_nlbse26_code_comment_classification_python,
  title        = {NLBSE'26 Code Comment Classification: Python Model},
  author       = {TheClouds Team},
  year         = {2025},
  note         = {Model available on Hugging Face},
  howpublished = {\url{To be published}}
}
```

Contact:

For questions, feedback, or collaboration requests related to this model, please contact:
> Giacomo Signorile: g.signorile14@studenti.uniba.it
> Davide Pio Posa: d.posa3@studenti.uniba.it
> Marco Lillo: m.lillo21@studenti.uniba.it
> Rebecca Margiotta: m.margiotta5@studenti.uniba.it
> Adriano Gentile: a.gentile97@studenti.uniba.com

Issue tracker: https://github.com/se4ai2526-uniba/TheClouds

```

## Acknowledgements

This model was developed as part of research on **Natural Language-Based Software Engineering (NLBSE)** and the **Code Comment Classification** task, building on the NLBSE’26 competition data and earlier SetFit and Random Forest baselines.