Spaces:

DaCrow13
/

Hopcroft-Skill-Classification

Running

App Files Files Community

Hopcroft-Skill-Classification / data /README.md

DaCrow13

Deploy to HF Spaces (Clean)

225af6a about 1 month ago

preview code

raw

history blame contribute delete

3.92 kB

metadata

language:
  - en
tags:
  - software-engineering
  - multi-label-classification
  - pull-requests
  - skills
license: mit

Dataset Card for SkillScope Dataset

Dataset Details

Name: SkillScope Dataset (NLBSE Tool Competition)
Repository: NLBSE/SkillCompetition
Version: 1.0 (Processed for Hopcroft Project)
Type: Tabular / Text / Code
Task: Multi-label Classification
Maintainers: se4ai2526-uniba (Hopcroft Project Team)

Intended Use

Primary Intended Uses

Training and evaluating multi-label classification models to predict required skills for resolving GitHub issues/Pull Requests.
Analyzing the relationship between issue characteristics (title, body, code changes) and developer skills.
Benchmarking feature extraction techniques (TF-IDF vs. Embeddings) in Software Engineering contexts.

Out-of-Scope Use Cases

Profiling individual developers (the dataset focuses on issues/PRs, not user profiling).
General purpose code generation.

Dataset Contents

The dataset consists of merged Pull Requests from 11 Java repositories.

Total Samples (Raw): 7,245 merged PRs
Source Files: 57,206
Methods: 59,644
Classes: 13,097
Labels: 217 distinct skill labels (domain/sub-domain pairs)

Schema

The data is stored in a SQLite database (skillscope_data.db) with the following main structures:

nlbse_tool_competition_data_by_issue: Main table containing PR features (title, description, file paths) and skill labels.
vw_nlbse_tool_competition_data_by_file: View providing file-level granularity.

Context and Motivation

Motivation

This dataset was created for the NLBSE (Natural Language-based Software Engineering) Tool Competition to foster research in automating skill identification in software maintenance. Accurately identifying required skills for an issue can help in automatic expert recommendation and task assignment.

Context

The data is derived from open-source Java projects on GitHub. It represents real-world development scenarios where developers describe issues and implement fixes.

Dataset Creation and Preprocessing

Source Data

The raw data is downloaded from the Hugging Face Hub (NLBSE/SkillCompetition).

Preprocessing Steps (Hopcroft Project)

To ensure data quality for modeling, the following preprocessing steps are applied (via data_cleaning.py):

Duplicate Removal: ~6.5% of samples were identified as duplicates and removed.
Conflict Resolution: ~8.9% of samples had conflicting labels for identical features; resolved using majority voting.
Rare Label Removal: Labels with fewer than 5 occurrences were removed to ensure valid cross-validation.
Feature Extraction:
- Text Cleaning: Removal of URLs, HTML, Markdown, and normalization.
- TF-IDF: Uni-grams and bi-grams (max 5000 features).
- Embeddings: Sentence embeddings using all-MiniLM-L6-v2.
Splitting: 80/20 Train/Test split using MultilabelStratifiedShuffleSplit to maintain label distribution and prevent data leakage.

Considerations

Ethical Considerations

Privacy: The data comes from public GitHub repositories. No private or sensitive personal information is explicitly included, though developer names/IDs might be present in metadata.
Bias: The dataset is limited to Java repositories, so models may not generalize to other programming languages or ecosystems.

Caveats and Recommendations

Label Imbalance: The dataset is highly imbalanced (long-tail distribution of skills). Techniques like MLSMOTE or ADASYN are recommended.
Multi-label Nature: Most samples have multiple labels; evaluation metrics should account for this (e.g., Micro-F1).
Text Noise: PR descriptions can be noisy or sparse; robust preprocessing is essential.