File size: 8,665 Bytes
225af6a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
language: en
license: mit
tags:
- multi-label-classification
- tfidf
- embeddings
- random-forest
- oversampling
- mlsmote
- software-engineering
datasets:
- NLBSE/SkillCompetition
model-index:
- name: random_forest_tfidf_gridsearch
  results:
  - status: success
    metrics:
      cv_best_f1_micro: 0.595038375202279
      test_precision_micro: 0.690371373744215
      test_recall_micro: 0.5287455692919513
      test_f1_micro: 0.5988446098110252
    params:
      estimator__max_depth: '10'
      estimator__min_samples_split: '2'
      estimator__n_estimators: '200'
      feature_type: embedding
      model_type: RandomForest + MultiOutput
      use_cleaned: 'True'
      oversampling: 'False'
    dvc:
      path: random_forest_tfidf_gridsearch.pkl
- name: random_forest_tfidf_gridsearch_smote
  results:
  - status: success
    metrics:
      cv_best_f1_micro: 0.59092598557871
      test_precision_micro: 0.6923300238053766
      test_recall_micro: 0.5154318319356791
      test_f1_micro: 0.59092598557871
    params:
      feature_type: tfidf
      oversampling: 'MLSMOTE (RandomOverSampler fallback)'
    dvc:
      path: random_forest_tfidf_gridsearch_smote.pkl
- name: random_forest_embedding_gridsearch
  results:
  - status: success
    metrics:
      cv_best_f1_micro: 0.6012826418169578
      test_precision_micro: 0.703060266254212
      test_recall_micro: 0.5252460640075934
      test_f1_micro: 0.6012826418169578
    params:
      feature_type: embedding
      oversampling: 'False'
    dvc:
      path: random_forest_embedding_gridsearch.pkl
- name: random_forest_embedding_gridsearch_smote
  results:
  - status: success
    metrics:
      cv_best_f1_micro: 0.5962084744755453
      test_precision_micro: 0.7031004709576139
      test_recall_micro: 0.5175288364319172
      test_f1_micro: 0.5962084744755453
    params:
      feature_type: embedding
      oversampling: 'MLSMOTE (RandomOverSampler fallback)'
    dvc:
      path: random_forest_embedding_gridsearch_smote.pkl
---

 
Model cards for committed models

Overview
- This file documents four trained model artifacts available in the repository: two TF‑IDF based Random Forest models (baseline and with oversampling) and two embedding‑based Random Forest models (baseline and with oversampling).
- For dataset provenance and preprocessing details see `data/README.md`.

1) random_forest_tfidf_gridsearch

Model details
- Name: `random_forest_tfidf_gridsearch`
- Organization: Hopcroft (se4ai2526-uniba)
- Model type: `RandomForestClassifier` wrapped in `MultiOutputClassifier` for multi-label outputs
- Branch: `Milestone-4`

Intended use
- Suitable for research and benchmarking on multi-label skill prediction for GitHub PRs/issues. Not intended for automated high‑stakes decisions or profiling individuals without further validation.

Training data and preprocessing
- Dataset: Processed SkillScope Dataset (NLBSE/SkillCompetition) as prepared for this project.
- Features: TF‑IDF (unigrams and bigrams), up to `MAX_TFIDF_FEATURES=5000`.
- Feature and label files are referenced via `get_feature_paths(feature_type='tfidf', use_cleaned=True)` in `config.py`.

Evaluation
- Reported metrics include micro‑precision, micro‑recall and micro‑F1 on a held‑out test split.
- Protocol: 80/20 multilabel‑stratified split; hyperparameters selected via 5‑fold cross‑validation optimizing `f1_micro`.
- MLflow run: `random_forest_tfidf_gridsearch` (see `hopcroft_skill_classification_tool_competition/config.py`).

Limitations and recommendations
- Trained on Java repositories; generalization to other languages is not ensured.
- Label imbalance affects rare labels; apply per‑label thresholds or further sampling strategies if required.

Usage
- Artifact path: `models/random_forest_tfidf_gridsearch.pkl`.
- Example:
  ```python
  import joblib
  model = joblib.load('models/random_forest_tfidf_gridsearch.pkl')
  y = model.predict(X_tfidf)
  ```

2) random_forest_tfidf_gridsearch_smote

Model details
- Name: `random_forest_tfidf_gridsearch_smote`
- Model type: `RandomForestClassifier` inside `MultiOutputClassifier` trained with multi‑label oversampling

Intended use
- Intended to improve recall for under‑represented labels by applying MLSMOTE (or RandomOverSampler fallback) during training.

Training and preprocessing
- Features: TF‑IDF (same configuration as the baseline).
- Oversampling: local MLSMOTE implementation when available; otherwise `RandomOverSampler`. Oversampling metadata (method and synthetic sample counts) are logged to MLflow.
- Training script: `hopcroft_skill_classification_tool_competition/modeling/train.py` (action `smote`).

Evaluation
- MLflow run: `random_forest_tfidf_gridsearch_smote`.

Limitations and recommendations
- Synthetic samples may introduce distributional artifacts; validate synthetic examples and per‑label metrics before deployment.

Usage
- Artifact path: `models/random_forest_tfidf_gridsearch_smote.pkl`.

3) random_forest_embedding_gridsearch

Model details
- Name: `random_forest_embedding_gridsearch`
- Features: sentence embeddings produced by `all-MiniLM-L6-v2` (see `config.EMBEDDING_MODEL_NAME`).

Intended use
- Uses semantic embeddings to capture contextual information from PR text; suitable for research and prototyping.

Training and preprocessing
- Embeddings generated and stored via `get_feature_paths(feature_type='embedding', use_cleaned=True)`.
- Training script: see `hopcroft_skill_classification_tool_competition/modeling/train.py`.

Evaluation
- MLflow run: `random_forest_embedding_gridsearch`.

Limitations and recommendations
- Embeddings encode dataset biases; verify performance when transferring to other repositories or languages.

Usage
- Artifact path: `models/random_forest_embedding_gridsearch.pkl`.
- Example:
  ```python
  model.predict(X_embeddings)
  ```

4) random_forest_embedding_gridsearch_smote

Model details
- Name: `random_forest_embedding_gridsearch_smote`
- Combines embedding features with multi‑label oversampling to address rare labels.

Training and evaluation
- Oversampling: MLSMOTE preferred; `RandomOverSampler` fallback if MLSMOTE is unavailable.
- MLflow run: `random_forest_embedding_gridsearch_smote`.

Limitations and recommendations
- Review synthetic examples and re‑evaluate on target data prior to deployment.

Usage
- Artifact path: `models/random_forest_embedding_gridsearch_smote.pkl`.

Publishing guidance for Hugging Face Hub
- The YAML front‑matter enables rendering on the Hugging Face Hub. Recommended repository contents for publishing:
  - `README.md` (this file)
  - model artifact(s) (`*.pkl`)
  - vectorizer(s) and label map (e.g. `tfidf_vectorizer.pkl`, `label_names.pkl`)
  - a minimal inference example or notebook

  Evaluation Data and Protocol
  - Evaluation split: an 80/20 multilabel‑stratified train/test split was used for final evaluation.
  - Cross-validation: hyperparameters were selected via 5‑fold cross‑validation optimizing `f1_micro`.
  - Test metrics reported: micro precision, micro recall, micro F1 (reported in the YAML `model-index` for each model).

  Quantitative Analyses
  - Reported unitary results: micro‑precision, micro‑recall and micro‑F1 on the held‑out test split for each model.
  - Where available, `cv_best_f1_micro` is the best cross‑validation f1_micro recorded during training; when a CV value was not present in tracking, the test F1 is used as a proxy and noted in the README.
  - Notes on comparability: TF‑IDF and embedding models are evaluated on the same held‑out splits (features differ); reported metrics are comparable for broad benchmarking but not for per‑label fairness analyses.

  How Metrics Were Computed
  - Metrics were computed using scikit‑learn's `precision_score`, `recall_score`, and `f1_score` with `average='micro'` and `zero_division=0` on the held‑out test labels and model predictions.
  - Test feature and label files used are available under `data/processed/tfidf/` and `data/processed/embedding/` (paths referenced from `hopcroft_skill_classification_tool_competition.config.get_feature_paths`).

  Ethical Considerations and Caveats
  - The dataset contains examples from Java repositories; model generalization to other languages or domains is not guaranteed.
  - Label imbalance is present; oversampling (MLSMOTE or RandomOverSampler fallback) was used in two variants to improve recall for rare labels — inspect per‑label metrics before deploying.
  - The models and README are intended for research and benchmarking. They are not validated for safety‑critical or high‑stakes automated decisioning.