language:
- en
tags:
- software-engineering
- multi-label-classification
- pull-requests
- skills
license: mit
Dataset Card for SkillScope Dataset
Dataset Details
- Name: SkillScope Dataset (NLBSE Tool Competition)
- Repository: NLBSE/SkillCompetition
- Version: 1.0 (Processed for Hopcroft Project)
- Type: Tabular / Text / Code
- Task: Multi-label Classification
- Maintainers: se4ai2526-uniba (Hopcroft Project Team)
Intended Use
Primary Intended Uses
- Training and evaluating multi-label classification models to predict required skills for resolving GitHub issues/Pull Requests.
- Analyzing the relationship between issue characteristics (title, body, code changes) and developer skills.
- Benchmarking feature extraction techniques (TF-IDF vs. Embeddings) in Software Engineering contexts.
Out-of-Scope Use Cases
- Profiling individual developers (the dataset focuses on issues/PRs, not user profiling).
- General purpose code generation.
Dataset Contents
The dataset consists of merged Pull Requests from 11 Java repositories.
- Total Samples (Raw): 7,245 merged PRs
- Source Files: 57,206
- Methods: 59,644
- Classes: 13,097
- Labels: 217 distinct skill labels (domain/sub-domain pairs)
Schema
The data is stored in a SQLite database (skillscope_data.db) with the following main structures:
nlbse_tool_competition_data_by_issue: Main table containing PR features (title, description, file paths) and skill labels.vw_nlbse_tool_competition_data_by_file: View providing file-level granularity.
Context and Motivation
Motivation
This dataset was created for the NLBSE (Natural Language-based Software Engineering) Tool Competition to foster research in automating skill identification in software maintenance. Accurately identifying required skills for an issue can help in automatic expert recommendation and task assignment.
Context
The data is derived from open-source Java projects on GitHub. It represents real-world development scenarios where developers describe issues and implement fixes.
Dataset Creation and Preprocessing
Source Data
The raw data is downloaded from the Hugging Face Hub (NLBSE/SkillCompetition).
Preprocessing Steps (Hopcroft Project)
To ensure data quality for modeling, the following preprocessing steps are applied (via data_cleaning.py):
- Duplicate Removal: ~6.5% of samples were identified as duplicates and removed.
- Conflict Resolution: ~8.9% of samples had conflicting labels for identical features; resolved using majority voting.
- Rare Label Removal: Labels with fewer than 5 occurrences were removed to ensure valid cross-validation.
- Feature Extraction:
- Text Cleaning: Removal of URLs, HTML, Markdown, and normalization.
- TF-IDF: Uni-grams and bi-grams (max 5000 features).
- Embeddings: Sentence embeddings using
all-MiniLM-L6-v2.
- Splitting: 80/20 Train/Test split using
MultilabelStratifiedShuffleSplitto maintain label distribution and prevent data leakage.
Considerations
Ethical Considerations
- Privacy: The data comes from public GitHub repositories. No private or sensitive personal information is explicitly included, though developer names/IDs might be present in metadata.
- Bias: The dataset is limited to Java repositories, so models may not generalize to other programming languages or ecosystems.
Caveats and Recommendations
- Label Imbalance: The dataset is highly imbalanced (long-tail distribution of skills). Techniques like MLSMOTE or ADASYN are recommended.
- Multi-label Nature: Most samples have multiple labels; evaluation metrics should account for this (e.g., Micro-F1).
- Text Noise: PR descriptions can be noisy or sparse; robust preprocessing is essential.