--- language: - en tags: - text-classification - code-comment-classification - transformers - codebert - python - software-engineering - multi-label license: mit datasets: - NLBSE/nlbse26-code-comment-classification metrics: - f1 - precision - recall - subset_accuracy - runtime - gflops pipeline_tag: text-classification library_name: transformers inference: false base_model: microsoft/codebert-base model-index: - name: CodeBERT Transformer for Python Code Comment Classification results: - task: type: text-classification name: Multi-label Text Classification dataset: name: NLBSE Code Comment Classification Dataset (Python) type: NLBSE/nlbse26-code-comment-classification split: test metrics: - type: f1 name: Macro F1 value: 0.6385 - type: f1 name: Micro F1 value: 0.6781 - type: precision name: Macro Precision value: 0.5900 - type: recall name: Macro Recall value: 0.7061 - type: accuracy name: Subset Accuracy value: 0.5690 --- # Transformer Model (CodeBERT) for Python Code Comment Classification ## Model Details - **Model Type:** Transformer-based multi-label classifier (sequence classification head) - **Base Model:** [`microsoft/codebert-base`](https://huggingface.co/microsoft/codebert-base) - **Language:** Python (code comments in English) - **License:** MIT - **Developed by:** TheClouds - **Model Date:** November 2025 - **Model Version:** 1.0 ### Description This model fine-tunes **CodeBERT** on the **Python** subset of the **NLBSE Code Comment Classification Dataset** for **multi-label** classification. Each Python code comment sentence is mapped to one or more semantic categories describing the role and intent of the comment. The classifier operates on the project’s `combo` field (concatenation of the comment sentence with a compact context string) and produces a 5-dimensional binary label vector. ### Label Set For Python, the model predicts the following 5 categories (fixed order in the classifier head): 1. `Usage` 2. `Parameters` 3. `DevelopmentNotes` 4. `Expand` 5. `Summary` Each prediction is a length-5 vector of 0/1 decisions, obtained by applying a sigmoid activation to the logits and thresholding at 0.5 by default. --- ## Intended Use The model is intended for: - research on **code comment classification** in Python projects, - mining and analysis of Python documentation comments, - tooling that needs semantic tags for comments (e.g., documentation quality inspection, comment recommendation, navigation support). It is designed for **Python code comments** in English or English-like technical language. ### Out-of-Scope Uses - Generic natural language classification outside software engineering. - Non-English comments without additional fine-tuning or adaptation. - Use in safety- or life-critical decision making. --- ## Data ### Training Data - **Dataset:** NLBSE Code Comment Classification Dataset – Python train split - **Size (train):** ~1.4k original training examples (with optional supersampled expansion to ~2k examples, depending on the configuration) - **Label Space:** 5 multi-label categories (`Usage`, `Parameters`, `DevelopmentNotes`, `Expand`, `Summary`) - **Preprocessing:** - Comments extracted from open-source Python projects. - Each instance represented via the `combo` field: `" | "`. - The project’s preprocessing pipeline can generate balanced training CSVs (via supersampling) under `data/processed/transformer`. The metrics reported here correspond to the current transformer configuration logged in MLflow for Python. ### Evaluation Data - **Dataset:** NLBSE Code Comment Classification Dataset – Python test split - **Size (test):** ~300 comment sentences - **Evaluation Protocol:** multi-label classification with micro and macro metrics, plus subset accuracy (exact match). --- ## Metrics ### Core Evaluation Metrics (Python, test split) From the training/evaluation run logged in MLflow: | lan | cat | precision | recall | f1 | |--------|-----------------|-----------|---------|---------| | python | Usage | 0.80 | 0.76| 0.78| | python | Parameters | 0.74 | 0.86| 0.79| | python | DevelopmentNotes| 0.41 | 0.50| 0.45| | python | Expand | 0.49 | 0.67| 0.57| | python | Summary | 0.63 | 0.82| 0.71| - **Micro F1:** 0.6781 - **Macro F1:** 0.6385 - **Micro Precision:** 0.6230 - **Micro Recall:** 0.7438 - **Macro Precision:** 0.5900 - **Macro Recall:** 0.7061 - **Subset Accuracy (exact match):** 0.5690 - **Micro Accuracy (per-label):** 0.8441 - **Eval Loss (BCE with logits):** 0.6727 - **Train Loss (final epoch):** 0.2937 ### Benchmarking Metrics Average performance for the Python transformer benchmark: - **Average Macro F1:** 0.6385 - **Average Precision (macro):** 0.5900 - **Average Recall (macro):** 0.7061 - **Average Runtime:** ~0.94 seconds (benchmark configuration) - **Average GFLOPs:** ~1823.25 These results indicate that the transformer captures useful patterns across all five Python comment categories, with stronger performance on frequent labels and reasonable performance on less frequent ones. --- ## Quantitative Analysis The model is evaluated in a multi-label setting: - **Micro metrics** emphasize the overall correctness across all label decisions. - **Macro metrics** treat all labels equally, highlighting the behaviour on minority classes (e.g., `DevelopmentNotes`). Per-class metrics (precision/recall/F1) can be inspected in the detailed classification report logged as an artifact in MLflow for the Python transformer run. In general, the model performs better on high-frequency labels such as `Usage` and `Summary`, while performance on rarer labels is more variable. --- ## Training Details ### Objective and Architecture - **Base model:** `microsoft/codebert-base` - **Head:** linear classification head with `num_labels = 5` - **Problem type:** `multi_label_classification` - **Loss function:** `BCEWithLogitsLoss` with per-label **positive class weights** computed from training label frequencies. - **Sampling:** `WeightedRandomSampler` over training examples to reduce the impact of label imbalance. ### Hyperparameters - **Max sequence length:** 128 - **Batch size:** 16 - **Learning rate:** 2e-5 - **Optimizer:** AdamW - **Scheduler:** Linear warmup and decay - **Warmup ratio:** 0.1 - **Number of epochs:** 5 - **Threshold for prediction:** 0.5 (per-label on sigmoid probabilities) ### Preprocessing and Balancing - Training uses the **Python** split prepared by the project’s preprocessing pipeline. - Optional supersampling (oversampling of underrepresented labels with a cap at the maximum label frequency) is available and can be enabled to improve macro performance. - The test split remains unchanged and corresponds to the original NLBSE Python test partition. ### Hardware / Runtime The reported runtime and GFLOPs are based on the project’s benchmarking setup (single GPU, standard research workstation). Actual latency and throughput depend on hardware and batch size. --- ## How to Use Install `transformers` and `torch`: ```bash pip install transformers torch ``` Then load the model and tokenizer (replace the model ID with your repository name): ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification MODEL_ID = "se4ai2526-uniba/python-transformer" # replace with actual ID tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID) model.eval() LABELS = [ "Usage", "Parameters", "DevelopmentNotes", "Expand", "Summary", ] def predict_labels(texts, threshold: float = 0.5): if isinstance(texts, str): texts = [texts] inputs = tokenizer( texts, padding=True, truncation=True, max_length=128, return_tensors="pt", ) with torch.no_grad(): logits = model(**inputs).logits probs = torch.sigmoid(logits) preds = (probs > threshold).int().cpu().numpy() results = [] for row in preds: labels = [LABELS[i] for i, v in enumerate(row) if v == 1] results.append(labels) return results # Example comments = [ "# Usage: call this function with a file path | module.py", ] print(predict_labels(comments)) ``` For full reproducibility consistent with the project, use the `ModelPredictor` wrapper and the same preprocessing used during training. --- ## Limitations and Biases * **Domain-limited:** Trained only on Python code comments from open-source repositories. * **Imbalanced labels:** Some categories are relatively underrepresented; performance on these labels can lag behind frequent ones. * **Robustness:** Behavioral tests show that the current model: * is deterministic and stable on duplicate inputs, * aligns with several curated golden examples, * remains sensitive to some benign text changes (extra whitespace, case changes, typos) unless additional normalization/augmentation is introduced. --- ## Ethical Considerations * The model reflects the style and biases of the open-source Python projects it was trained on. * It does not filter offensive or inappropriate content in comments; it only predicts semantic categories. * Outputs should be treated as assistive signals, not as authoritative judgements. --- ## Citation If you use this model in academic work or derived systems, please cite: > TheClouds Team. "NLBSE'26 Code Comment Classification – Python Model." 2025. BibTeX: ```bibtex @misc{theclouds_nlbse26_code_comment_classification_python, title = {NLBSE'26 Code Comment Classification: Python Model}, author = {TheClouds Team}, year = {2025}, note = {Model available on Hugging Face}, howpublished = {\url{To be published}} } ``` Contact: For questions, feedback, or collaboration requests related to this model, please contact: > Giacomo Signorile: g.signorile14@studenti.uniba.it > Davide Pio Posa: d.posa3@studenti.uniba.it > Marco Lillo: m.lillo21@studenti.uniba.it > Rebecca Margiotta: m.margiotta5@studenti.uniba.it > Adriano Gentile: a.gentile97@studenti.uniba.com Issue tracker: https://github.com/se4ai2526-uniba/TheClouds ``` ## Acknowledgements This model was developed as part of research on **Natural Language-Based Software Engineering (NLBSE)** and the **Code Comment Classification** task, building on the NLBSE’26 competition data and earlier SetFit and Random Forest baselines.