Spaces:
Sleeping
Sleeping
| language: en | |
| license: mit | |
| tags: | |
| - multi-label-classification | |
| - tfidf | |
| - embeddings | |
| - random-forest | |
| - oversampling | |
| - mlsmote | |
| - software-engineering | |
| datasets: | |
| - NLBSE/SkillCompetition | |
| model-index: | |
| - name: random_forest_tfidf_gridsearch | |
| results: | |
| - status: success | |
| metrics: | |
| cv_best_f1_micro: 0.595038375202279 | |
| test_precision_micro: 0.690371373744215 | |
| test_recall_micro: 0.5287455692919513 | |
| test_f1_micro: 0.5988446098110252 | |
| params: | |
| estimator__max_depth: '10' | |
| estimator__min_samples_split: '2' | |
| estimator__n_estimators: '200' | |
| feature_type: embedding | |
| model_type: RandomForest + MultiOutput | |
| use_cleaned: 'True' | |
| oversampling: 'False' | |
| dvc: | |
| path: random_forest_tfidf_gridsearch.pkl | |
| - name: random_forest_tfidf_gridsearch_smote | |
| results: | |
| - status: success | |
| metrics: | |
| cv_best_f1_micro: 0.59092598557871 | |
| test_precision_micro: 0.6923300238053766 | |
| test_recall_micro: 0.5154318319356791 | |
| test_f1_micro: 0.59092598557871 | |
| params: | |
| feature_type: tfidf | |
| oversampling: 'MLSMOTE (RandomOverSampler fallback)' | |
| dvc: | |
| path: random_forest_tfidf_gridsearch_smote.pkl | |
| - name: random_forest_embedding_gridsearch | |
| results: | |
| - status: success | |
| metrics: | |
| cv_best_f1_micro: 0.6012826418169578 | |
| test_precision_micro: 0.703060266254212 | |
| test_recall_micro: 0.5252460640075934 | |
| test_f1_micro: 0.6012826418169578 | |
| params: | |
| feature_type: embedding | |
| oversampling: 'False' | |
| dvc: | |
| path: random_forest_embedding_gridsearch.pkl | |
| - name: random_forest_embedding_gridsearch_smote | |
| results: | |
| - status: success | |
| metrics: | |
| cv_best_f1_micro: 0.5962084744755453 | |
| test_precision_micro: 0.7031004709576139 | |
| test_recall_micro: 0.5175288364319172 | |
| test_f1_micro: 0.5962084744755453 | |
| params: | |
| feature_type: embedding | |
| oversampling: 'MLSMOTE (RandomOverSampler fallback)' | |
| dvc: | |
| path: random_forest_embedding_gridsearch_smote.pkl | |
| Model cards for committed models | |
| Overview | |
| - This file documents four trained model artifacts available in the repository: two TF‑IDF based Random Forest models (baseline and with oversampling) and two embedding‑based Random Forest models (baseline and with oversampling). | |
| - For dataset provenance and preprocessing details see `data/README.md`. | |
| 1) random_forest_tfidf_gridsearch | |
| Model details | |
| - Name: `random_forest_tfidf_gridsearch` | |
| - Organization: Hopcroft (se4ai2526-uniba) | |
| - Model type: `RandomForestClassifier` wrapped in `MultiOutputClassifier` for multi-label outputs | |
| - Branch: `Milestone-4` | |
| Intended use | |
| - Suitable for research and benchmarking on multi-label skill prediction for GitHub PRs/issues. Not intended for automated high‑stakes decisions or profiling individuals without further validation. | |
| Training data and preprocessing | |
| - Dataset: Processed SkillScope Dataset (NLBSE/SkillCompetition) as prepared for this project. | |
| - Features: TF‑IDF (unigrams and bigrams), up to `MAX_TFIDF_FEATURES=5000`. | |
| - Feature and label files are referenced via `get_feature_paths(feature_type='tfidf', use_cleaned=True)` in `config.py`. | |
| Evaluation | |
| - Reported metrics include micro‑precision, micro‑recall and micro‑F1 on a held‑out test split. | |
| - Protocol: 80/20 multilabel‑stratified split; hyperparameters selected via 5‑fold cross‑validation optimizing `f1_micro`. | |
| - MLflow run: `random_forest_tfidf_gridsearch` (see `hopcroft_skill_classification_tool_competition/config.py`). | |
| Limitations and recommendations | |
| - Trained on Java repositories; generalization to other languages is not ensured. | |
| - Label imbalance affects rare labels; apply per‑label thresholds or further sampling strategies if required. | |
| Usage | |
| - Artifact path: `models/random_forest_tfidf_gridsearch.pkl`. | |
| - Example: | |
| ```python | |
| import joblib | |
| model = joblib.load('models/random_forest_tfidf_gridsearch.pkl') | |
| y = model.predict(X_tfidf) | |
| ``` | |
| 2) random_forest_tfidf_gridsearch_smote | |
| Model details | |
| - Name: `random_forest_tfidf_gridsearch_smote` | |
| - Model type: `RandomForestClassifier` inside `MultiOutputClassifier` trained with multi‑label oversampling | |
| Intended use | |
| - Intended to improve recall for under‑represented labels by applying MLSMOTE (or RandomOverSampler fallback) during training. | |
| Training and preprocessing | |
| - Features: TF‑IDF (same configuration as the baseline). | |
| - Oversampling: local MLSMOTE implementation when available; otherwise `RandomOverSampler`. Oversampling metadata (method and synthetic sample counts) are logged to MLflow. | |
| - Training script: `hopcroft_skill_classification_tool_competition/modeling/train.py` (action `smote`). | |
| Evaluation | |
| - MLflow run: `random_forest_tfidf_gridsearch_smote`. | |
| Limitations and recommendations | |
| - Synthetic samples may introduce distributional artifacts; validate synthetic examples and per‑label metrics before deployment. | |
| Usage | |
| - Artifact path: `models/random_forest_tfidf_gridsearch_smote.pkl`. | |
| 3) random_forest_embedding_gridsearch | |
| Model details | |
| - Name: `random_forest_embedding_gridsearch` | |
| - Features: sentence embeddings produced by `all-MiniLM-L6-v2` (see `config.EMBEDDING_MODEL_NAME`). | |
| Intended use | |
| - Uses semantic embeddings to capture contextual information from PR text; suitable for research and prototyping. | |
| Training and preprocessing | |
| - Embeddings generated and stored via `get_feature_paths(feature_type='embedding', use_cleaned=True)`. | |
| - Training script: see `hopcroft_skill_classification_tool_competition/modeling/train.py`. | |
| Evaluation | |
| - MLflow run: `random_forest_embedding_gridsearch`. | |
| Limitations and recommendations | |
| - Embeddings encode dataset biases; verify performance when transferring to other repositories or languages. | |
| Usage | |
| - Artifact path: `models/random_forest_embedding_gridsearch.pkl`. | |
| - Example: | |
| ```python | |
| model.predict(X_embeddings) | |
| ``` | |
| 4) random_forest_embedding_gridsearch_smote | |
| Model details | |
| - Name: `random_forest_embedding_gridsearch_smote` | |
| - Combines embedding features with multi‑label oversampling to address rare labels. | |
| Training and evaluation | |
| - Oversampling: MLSMOTE preferred; `RandomOverSampler` fallback if MLSMOTE is unavailable. | |
| - MLflow run: `random_forest_embedding_gridsearch_smote`. | |
| Limitations and recommendations | |
| - Review synthetic examples and re‑evaluate on target data prior to deployment. | |
| Usage | |
| - Artifact path: `models/random_forest_embedding_gridsearch_smote.pkl`. | |
| Publishing guidance for Hugging Face Hub | |
| - The YAML front‑matter enables rendering on the Hugging Face Hub. Recommended repository contents for publishing: | |
| - `README.md` (this file) | |
| - model artifact(s) (`*.pkl`) | |
| - vectorizer(s) and label map (e.g. `tfidf_vectorizer.pkl`, `label_names.pkl`) | |
| - a minimal inference example or notebook | |
| Evaluation Data and Protocol | |
| - Evaluation split: an 80/20 multilabel‑stratified train/test split was used for final evaluation. | |
| - Cross-validation: hyperparameters were selected via 5‑fold cross‑validation optimizing `f1_micro`. | |
| - Test metrics reported: micro precision, micro recall, micro F1 (reported in the YAML `model-index` for each model). | |
| Quantitative Analyses | |
| - Reported unitary results: micro‑precision, micro‑recall and micro‑F1 on the held‑out test split for each model. | |
| - Where available, `cv_best_f1_micro` is the best cross‑validation f1_micro recorded during training; when a CV value was not present in tracking, the test F1 is used as a proxy and noted in the README. | |
| - Notes on comparability: TF‑IDF and embedding models are evaluated on the same held‑out splits (features differ); reported metrics are comparable for broad benchmarking but not for per‑label fairness analyses. | |
| How Metrics Were Computed | |
| - Metrics were computed using scikit‑learn's `precision_score`, `recall_score`, and `f1_score` with `average='micro'` and `zero_division=0` on the held‑out test labels and model predictions. | |
| - Test feature and label files used are available under `data/processed/tfidf/` and `data/processed/embedding/` (paths referenced from `hopcroft_skill_classification_tool_competition.config.get_feature_paths`). | |
| Ethical Considerations and Caveats | |
| - The dataset contains examples from Java repositories; model generalization to other languages or domains is not guaranteed. | |
| - Label imbalance is present; oversampling (MLSMOTE or RandomOverSampler fallback) was used in two variants to improve recall for rare labels — inspect per‑label metrics before deploying. | |
| - The models and README are intended for research and benchmarking. They are not validated for safety‑critical or high‑stakes automated decisioning. | |