|
|
--- |
|
|
language: en |
|
|
license: mit |
|
|
tags: |
|
|
- multi-label-classification |
|
|
- tfidf |
|
|
- embeddings |
|
|
- random-forest |
|
|
- oversampling |
|
|
- mlsmote |
|
|
- software-engineering |
|
|
datasets: |
|
|
- NLBSE/SkillCompetition |
|
|
model-index: |
|
|
- name: random_forest_tfidf_gridsearch |
|
|
results: |
|
|
- status: success |
|
|
metrics: |
|
|
cv_best_f1_micro: 0.595038375202279 |
|
|
test_precision_micro: 0.690371373744215 |
|
|
test_recall_micro: 0.5287455692919513 |
|
|
test_f1_micro: 0.5988446098110252 |
|
|
params: |
|
|
estimator__max_depth: '10' |
|
|
estimator__min_samples_split: '2' |
|
|
estimator__n_estimators: '200' |
|
|
feature_type: embedding |
|
|
model_type: RandomForest + MultiOutput |
|
|
use_cleaned: 'True' |
|
|
oversampling: 'False' |
|
|
dvc: |
|
|
path: random_forest_tfidf_gridsearch.pkl |
|
|
- name: random_forest_tfidf_gridsearch_smote |
|
|
results: |
|
|
- status: success |
|
|
metrics: |
|
|
cv_best_f1_micro: 0.59092598557871 |
|
|
test_precision_micro: 0.6923300238053766 |
|
|
test_recall_micro: 0.5154318319356791 |
|
|
test_f1_micro: 0.59092598557871 |
|
|
params: |
|
|
feature_type: tfidf |
|
|
oversampling: 'MLSMOTE (RandomOverSampler fallback)' |
|
|
dvc: |
|
|
path: random_forest_tfidf_gridsearch_smote.pkl |
|
|
- name: random_forest_embedding_gridsearch |
|
|
results: |
|
|
- status: success |
|
|
metrics: |
|
|
cv_best_f1_micro: 0.6012826418169578 |
|
|
test_precision_micro: 0.703060266254212 |
|
|
test_recall_micro: 0.5252460640075934 |
|
|
test_f1_micro: 0.6012826418169578 |
|
|
params: |
|
|
feature_type: embedding |
|
|
oversampling: 'False' |
|
|
dvc: |
|
|
path: random_forest_embedding_gridsearch.pkl |
|
|
- name: random_forest_embedding_gridsearch_smote |
|
|
results: |
|
|
- status: success |
|
|
metrics: |
|
|
cv_best_f1_micro: 0.5962084744755453 |
|
|
test_precision_micro: 0.7031004709576139 |
|
|
test_recall_micro: 0.5175288364319172 |
|
|
test_f1_micro: 0.5962084744755453 |
|
|
params: |
|
|
feature_type: embedding |
|
|
oversampling: 'MLSMOTE (RandomOverSampler fallback)' |
|
|
dvc: |
|
|
path: random_forest_embedding_gridsearch_smote.pkl |
|
|
--- |
|
|
|
|
|
|
|
|
Model cards for committed models |
|
|
|
|
|
Overview |
|
|
- This file documents four trained model artifacts available in the repository: two TF‑IDF based Random Forest models (baseline and with oversampling) and two embedding‑based Random Forest models (baseline and with oversampling). |
|
|
- For dataset provenance and preprocessing details see `data/README.md`. |
|
|
|
|
|
1) random_forest_tfidf_gridsearch |
|
|
|
|
|
Model details |
|
|
- Name: `random_forest_tfidf_gridsearch` |
|
|
- Organization: Hopcroft (se4ai2526-uniba) |
|
|
- Model type: `RandomForestClassifier` wrapped in `MultiOutputClassifier` for multi-label outputs |
|
|
- Branch: `Milestone-4` |
|
|
|
|
|
Intended use |
|
|
- Suitable for research and benchmarking on multi-label skill prediction for GitHub PRs/issues. Not intended for automated high‑stakes decisions or profiling individuals without further validation. |
|
|
|
|
|
Training data and preprocessing |
|
|
- Dataset: Processed SkillScope Dataset (NLBSE/SkillCompetition) as prepared for this project. |
|
|
- Features: TF‑IDF (unigrams and bigrams), up to `MAX_TFIDF_FEATURES=5000`. |
|
|
- Feature and label files are referenced via `get_feature_paths(feature_type='tfidf', use_cleaned=True)` in `config.py`. |
|
|
|
|
|
Evaluation |
|
|
- Reported metrics include micro‑precision, micro‑recall and micro‑F1 on a held‑out test split. |
|
|
- Protocol: 80/20 multilabel‑stratified split; hyperparameters selected via 5‑fold cross‑validation optimizing `f1_micro`. |
|
|
- MLflow run: `random_forest_tfidf_gridsearch` (see `hopcroft_skill_classification_tool_competition/config.py`). |
|
|
|
|
|
Limitations and recommendations |
|
|
- Trained on Java repositories; generalization to other languages is not ensured. |
|
|
- Label imbalance affects rare labels; apply per‑label thresholds or further sampling strategies if required. |
|
|
|
|
|
Usage |
|
|
- Artifact path: `models/random_forest_tfidf_gridsearch.pkl`. |
|
|
- Example: |
|
|
```python |
|
|
import joblib |
|
|
model = joblib.load('models/random_forest_tfidf_gridsearch.pkl') |
|
|
y = model.predict(X_tfidf) |
|
|
``` |
|
|
|
|
|
2) random_forest_tfidf_gridsearch_smote |
|
|
|
|
|
Model details |
|
|
- Name: `random_forest_tfidf_gridsearch_smote` |
|
|
- Model type: `RandomForestClassifier` inside `MultiOutputClassifier` trained with multi‑label oversampling |
|
|
|
|
|
Intended use |
|
|
- Intended to improve recall for under‑represented labels by applying MLSMOTE (or RandomOverSampler fallback) during training. |
|
|
|
|
|
Training and preprocessing |
|
|
- Features: TF‑IDF (same configuration as the baseline). |
|
|
- Oversampling: local MLSMOTE implementation when available; otherwise `RandomOverSampler`. Oversampling metadata (method and synthetic sample counts) are logged to MLflow. |
|
|
- Training script: `hopcroft_skill_classification_tool_competition/modeling/train.py` (action `smote`). |
|
|
|
|
|
Evaluation |
|
|
- MLflow run: `random_forest_tfidf_gridsearch_smote`. |
|
|
|
|
|
Limitations and recommendations |
|
|
- Synthetic samples may introduce distributional artifacts; validate synthetic examples and per‑label metrics before deployment. |
|
|
|
|
|
Usage |
|
|
- Artifact path: `models/random_forest_tfidf_gridsearch_smote.pkl`. |
|
|
|
|
|
3) random_forest_embedding_gridsearch |
|
|
|
|
|
Model details |
|
|
- Name: `random_forest_embedding_gridsearch` |
|
|
- Features: sentence embeddings produced by `all-MiniLM-L6-v2` (see `config.EMBEDDING_MODEL_NAME`). |
|
|
|
|
|
Intended use |
|
|
- Uses semantic embeddings to capture contextual information from PR text; suitable for research and prototyping. |
|
|
|
|
|
Training and preprocessing |
|
|
- Embeddings generated and stored via `get_feature_paths(feature_type='embedding', use_cleaned=True)`. |
|
|
- Training script: see `hopcroft_skill_classification_tool_competition/modeling/train.py`. |
|
|
|
|
|
Evaluation |
|
|
- MLflow run: `random_forest_embedding_gridsearch`. |
|
|
|
|
|
Limitations and recommendations |
|
|
- Embeddings encode dataset biases; verify performance when transferring to other repositories or languages. |
|
|
|
|
|
Usage |
|
|
- Artifact path: `models/random_forest_embedding_gridsearch.pkl`. |
|
|
- Example: |
|
|
```python |
|
|
model.predict(X_embeddings) |
|
|
``` |
|
|
|
|
|
4) random_forest_embedding_gridsearch_smote |
|
|
|
|
|
Model details |
|
|
- Name: `random_forest_embedding_gridsearch_smote` |
|
|
- Combines embedding features with multi‑label oversampling to address rare labels. |
|
|
|
|
|
Training and evaluation |
|
|
- Oversampling: MLSMOTE preferred; `RandomOverSampler` fallback if MLSMOTE is unavailable. |
|
|
- MLflow run: `random_forest_embedding_gridsearch_smote`. |
|
|
|
|
|
Limitations and recommendations |
|
|
- Review synthetic examples and re‑evaluate on target data prior to deployment. |
|
|
|
|
|
Usage |
|
|
- Artifact path: `models/random_forest_embedding_gridsearch_smote.pkl`. |
|
|
|
|
|
Publishing guidance for Hugging Face Hub |
|
|
- The YAML front‑matter enables rendering on the Hugging Face Hub. Recommended repository contents for publishing: |
|
|
- `README.md` (this file) |
|
|
- model artifact(s) (`*.pkl`) |
|
|
- vectorizer(s) and label map (e.g. `tfidf_vectorizer.pkl`, `label_names.pkl`) |
|
|
- a minimal inference example or notebook |
|
|
|
|
|
Evaluation Data and Protocol |
|
|
- Evaluation split: an 80/20 multilabel‑stratified train/test split was used for final evaluation. |
|
|
- Cross-validation: hyperparameters were selected via 5‑fold cross‑validation optimizing `f1_micro`. |
|
|
- Test metrics reported: micro precision, micro recall, micro F1 (reported in the YAML `model-index` for each model). |
|
|
|
|
|
Quantitative Analyses |
|
|
- Reported unitary results: micro‑precision, micro‑recall and micro‑F1 on the held‑out test split for each model. |
|
|
- Where available, `cv_best_f1_micro` is the best cross‑validation f1_micro recorded during training; when a CV value was not present in tracking, the test F1 is used as a proxy and noted in the README. |
|
|
- Notes on comparability: TF‑IDF and embedding models are evaluated on the same held‑out splits (features differ); reported metrics are comparable for broad benchmarking but not for per‑label fairness analyses. |
|
|
|
|
|
How Metrics Were Computed |
|
|
- Metrics were computed using scikit‑learn's `precision_score`, `recall_score`, and `f1_score` with `average='micro'` and `zero_division=0` on the held‑out test labels and model predictions. |
|
|
- Test feature and label files used are available under `data/processed/tfidf/` and `data/processed/embedding/` (paths referenced from `hopcroft_skill_classification_tool_competition.config.get_feature_paths`). |
|
|
|
|
|
Ethical Considerations and Caveats |
|
|
- The dataset contains examples from Java repositories; model generalization to other languages or domains is not guaranteed. |
|
|
- Label imbalance is present; oversampling (MLSMOTE or RandomOverSampler fallback) was used in two variants to improve recall for rare labels — inspect per‑label metrics before deploying. |
|
|
- The models and README are intended for research and benchmarking. They are not validated for safety‑critical or high‑stakes automated decisioning. |
|
|
|
|
|
|
|
|
|