metadata
language: en
license: mit
tags:
- multi-label-classification
- tfidf
- embeddings
- random-forest
- oversampling
- mlsmote
- software-engineering
datasets:
- NLBSE/SkillCompetition
model-index:
- name: random_forest_tfidf_gridsearch
results:
- status: success
metrics:
cv_best_f1_micro: 0.595038375202279
test_precision_micro: 0.690371373744215
test_recall_micro: 0.5287455692919513
test_f1_micro: 0.5988446098110252
params:
estimator__max_depth: '10'
estimator__min_samples_split: '2'
estimator__n_estimators: '200'
feature_type: embedding
model_type: RandomForest + MultiOutput
use_cleaned: 'True'
oversampling: 'False'
dvc:
path: random_forest_tfidf_gridsearch.pkl
- name: random_forest_tfidf_gridsearch_smote
results:
- status: success
metrics:
cv_best_f1_micro: 0.59092598557871
test_precision_micro: 0.6923300238053766
test_recall_micro: 0.5154318319356791
test_f1_micro: 0.59092598557871
params:
feature_type: tfidf
oversampling: MLSMOTE (RandomOverSampler fallback)
dvc:
path: random_forest_tfidf_gridsearch_smote.pkl
- name: random_forest_embedding_gridsearch
results:
- status: success
metrics:
cv_best_f1_micro: 0.6012826418169578
test_precision_micro: 0.703060266254212
test_recall_micro: 0.5252460640075934
test_f1_micro: 0.6012826418169578
params:
feature_type: embedding
oversampling: 'False'
dvc:
path: random_forest_embedding_gridsearch.pkl
- name: random_forest_embedding_gridsearch_smote
results:
- status: success
metrics:
cv_best_f1_micro: 0.5962084744755453
test_precision_micro: 0.7031004709576139
test_recall_micro: 0.5175288364319172
test_f1_micro: 0.5962084744755453
params:
feature_type: embedding
oversampling: MLSMOTE (RandomOverSampler fallback)
dvc:
path: random_forest_embedding_gridsearch_smote.pkl
Model cards for committed models
Overview
- This file documents four trained model artifacts available in the repository: two TF‑IDF based Random Forest models (baseline and with oversampling) and two embedding‑based Random Forest models (baseline and with oversampling).
- For dataset provenance and preprocessing details see
data/README.md.
- random_forest_tfidf_gridsearch
Model details
- Name:
random_forest_tfidf_gridsearch - Organization: Hopcroft (se4ai2526-uniba)
- Model type:
RandomForestClassifierwrapped inMultiOutputClassifierfor multi-label outputs - Branch:
Milestone-4
Intended use
- Suitable for research and benchmarking on multi-label skill prediction for GitHub PRs/issues. Not intended for automated high‑stakes decisions or profiling individuals without further validation.
Training data and preprocessing
- Dataset: Processed SkillScope Dataset (NLBSE/SkillCompetition) as prepared for this project.
- Features: TF‑IDF (unigrams and bigrams), up to
MAX_TFIDF_FEATURES=5000. - Feature and label files are referenced via
get_feature_paths(feature_type='tfidf', use_cleaned=True)inconfig.py.
Evaluation
- Reported metrics include micro‑precision, micro‑recall and micro‑F1 on a held‑out test split.
- Protocol: 80/20 multilabel‑stratified split; hyperparameters selected via 5‑fold cross‑validation optimizing
f1_micro. - MLflow run:
random_forest_tfidf_gridsearch(seehopcroft_skill_classification_tool_competition/config.py).
Limitations and recommendations
- Trained on Java repositories; generalization to other languages is not ensured.
- Label imbalance affects rare labels; apply per‑label thresholds or further sampling strategies if required.
Usage
- Artifact path:
models/random_forest_tfidf_gridsearch.pkl. - Example:
import joblib model = joblib.load('models/random_forest_tfidf_gridsearch.pkl') y = model.predict(X_tfidf)
- random_forest_tfidf_gridsearch_smote
Model details
- Name:
random_forest_tfidf_gridsearch_smote - Model type:
RandomForestClassifierinsideMultiOutputClassifiertrained with multi‑label oversampling
Intended use
- Intended to improve recall for under‑represented labels by applying MLSMOTE (or RandomOverSampler fallback) during training.
Training and preprocessing
- Features: TF‑IDF (same configuration as the baseline).
- Oversampling: local MLSMOTE implementation when available; otherwise
RandomOverSampler. Oversampling metadata (method and synthetic sample counts) are logged to MLflow. - Training script:
hopcroft_skill_classification_tool_competition/modeling/train.py(actionsmote).
Evaluation
- MLflow run:
random_forest_tfidf_gridsearch_smote.
Limitations and recommendations
- Synthetic samples may introduce distributional artifacts; validate synthetic examples and per‑label metrics before deployment.
Usage
- Artifact path:
models/random_forest_tfidf_gridsearch_smote.pkl.
- random_forest_embedding_gridsearch
Model details
- Name:
random_forest_embedding_gridsearch - Features: sentence embeddings produced by
all-MiniLM-L6-v2(seeconfig.EMBEDDING_MODEL_NAME).
Intended use
- Uses semantic embeddings to capture contextual information from PR text; suitable for research and prototyping.
Training and preprocessing
- Embeddings generated and stored via
get_feature_paths(feature_type='embedding', use_cleaned=True). - Training script: see
hopcroft_skill_classification_tool_competition/modeling/train.py.
Evaluation
- MLflow run:
random_forest_embedding_gridsearch.
Limitations and recommendations
- Embeddings encode dataset biases; verify performance when transferring to other repositories or languages.
Usage
- Artifact path:
models/random_forest_embedding_gridsearch.pkl. - Example:
model.predict(X_embeddings)
- random_forest_embedding_gridsearch_smote
Model details
- Name:
random_forest_embedding_gridsearch_smote - Combines embedding features with multi‑label oversampling to address rare labels.
Training and evaluation
- Oversampling: MLSMOTE preferred;
RandomOverSamplerfallback if MLSMOTE is unavailable. - MLflow run:
random_forest_embedding_gridsearch_smote.
Limitations and recommendations
- Review synthetic examples and re‑evaluate on target data prior to deployment.
Usage
- Artifact path:
models/random_forest_embedding_gridsearch_smote.pkl.
Publishing guidance for Hugging Face Hub
The YAML front‑matter enables rendering on the Hugging Face Hub. Recommended repository contents for publishing:
README.md(this file)- model artifact(s) (
*.pkl) - vectorizer(s) and label map (e.g.
tfidf_vectorizer.pkl,label_names.pkl) - a minimal inference example or notebook
Evaluation Data and Protocol
- Evaluation split: an 80/20 multilabel‑stratified train/test split was used for final evaluation.
- Cross-validation: hyperparameters were selected via 5‑fold cross‑validation optimizing
f1_micro. - Test metrics reported: micro precision, micro recall, micro F1 (reported in the YAML
model-indexfor each model).
Quantitative Analyses
- Reported unitary results: micro‑precision, micro‑recall and micro‑F1 on the held‑out test split for each model.
- Where available,
cv_best_f1_microis the best cross‑validation f1_micro recorded during training; when a CV value was not present in tracking, the test F1 is used as a proxy and noted in the README. - Notes on comparability: TF‑IDF and embedding models are evaluated on the same held‑out splits (features differ); reported metrics are comparable for broad benchmarking but not for per‑label fairness analyses.
How Metrics Were Computed
- Metrics were computed using scikit‑learn's
precision_score,recall_score, andf1_scorewithaverage='micro'andzero_division=0on the held‑out test labels and model predictions. - Test feature and label files used are available under
data/processed/tfidf/anddata/processed/embedding/(paths referenced fromhopcroft_skill_classification_tool_competition.config.get_feature_paths).
Ethical Considerations and Caveats
- The dataset contains examples from Java repositories; model generalization to other languages or domains is not guaranteed.
- Label imbalance is present; oversampling (MLSMOTE or RandomOverSampler fallback) was used in two variants to improve recall for rare labels — inspect per‑label metrics before deploying.
- The models and README are intended for research and benchmarking. They are not validated for safety‑critical or high‑stakes automated decisioning.