Hopcroft-Skill-Classification

Sleeping

App Files Files Community

Hopcroft-Skill-Classification / models /README.md

DaCrow13

Deploy to HF Spaces (Clean)

39d224b about 2 months ago

preview code

raw

history blame contribute delete

8.67 kB

	---
	language: en
	license: mit
	tags:
	- multi-label-classification
	- tfidf
	- embeddings
	- random-forest
	- oversampling
	- mlsmote
	- software-engineering
	datasets:
	- NLBSE/SkillCompetition
	model-index:
	- name: random_forest_tfidf_gridsearch
	results:
	- status: success
	metrics:
	cv_best_f1_micro: 0.595038375202279
	test_precision_micro: 0.690371373744215
	test_recall_micro: 0.5287455692919513
	test_f1_micro: 0.5988446098110252
	params:
	estimator__max_depth: '10'
	estimator__min_samples_split: '2'
	estimator__n_estimators: '200'
	feature_type: embedding
	model_type: RandomForest + MultiOutput
	use_cleaned: 'True'
	oversampling: 'False'
	dvc:
	path: random_forest_tfidf_gridsearch.pkl
	- name: random_forest_tfidf_gridsearch_smote
	results:
	- status: success
	metrics:
	cv_best_f1_micro: 0.59092598557871
	test_precision_micro: 0.6923300238053766
	test_recall_micro: 0.5154318319356791
	test_f1_micro: 0.59092598557871
	params:
	feature_type: tfidf
	oversampling: 'MLSMOTE (RandomOverSampler fallback)'
	dvc:
	path: random_forest_tfidf_gridsearch_smote.pkl
	- name: random_forest_embedding_gridsearch
	results:
	- status: success
	metrics:
	cv_best_f1_micro: 0.6012826418169578
	test_precision_micro: 0.703060266254212
	test_recall_micro: 0.5252460640075934
	test_f1_micro: 0.6012826418169578
	params:
	feature_type: embedding
	oversampling: 'False'
	dvc:
	path: random_forest_embedding_gridsearch.pkl
	- name: random_forest_embedding_gridsearch_smote
	results:
	- status: success
	metrics:
	cv_best_f1_micro: 0.5962084744755453
	test_precision_micro: 0.7031004709576139
	test_recall_micro: 0.5175288364319172
	test_f1_micro: 0.5962084744755453
	params:
	feature_type: embedding
	oversampling: 'MLSMOTE (RandomOverSampler fallback)'
	dvc:
	path: random_forest_embedding_gridsearch_smote.pkl
	---


	Model cards for committed models

	Overview
	- This file documents four trained model artifacts available in the repository: two TF‑IDF based Random Forest models (baseline and with oversampling) and two embedding‑based Random Forest models (baseline and with oversampling).
	- For dataset provenance and preprocessing details see `data/README.md`.

	1) random_forest_tfidf_gridsearch

	Model details
	- Name: `random_forest_tfidf_gridsearch`
	- Organization: Hopcroft (se4ai2526-uniba)
	- Model type: `RandomForestClassifier` wrapped in `MultiOutputClassifier` for multi-label outputs
	- Branch: `Milestone-4`

	Intended use
	- Suitable for research and benchmarking on multi-label skill prediction for GitHub PRs/issues. Not intended for automated high‑stakes decisions or profiling individuals without further validation.

	Training data and preprocessing
	- Dataset: Processed SkillScope Dataset (NLBSE/SkillCompetition) as prepared for this project.
	- Features: TF‑IDF (unigrams and bigrams), up to `MAX_TFIDF_FEATURES=5000`.
	- Feature and label files are referenced via `get_feature_paths(feature_type='tfidf', use_cleaned=True)` in `config.py`.

	Evaluation
	- Reported metrics include micro‑precision, micro‑recall and micro‑F1 on a held‑out test split.
	- Protocol: 80/20 multilabel‑stratified split; hyperparameters selected via 5‑fold cross‑validation optimizing `f1_micro`.
	- MLflow run: `random_forest_tfidf_gridsearch` (see `hopcroft_skill_classification_tool_competition/config.py`).

	Limitations and recommendations
	- Trained on Java repositories; generalization to other languages is not ensured.
	- Label imbalance affects rare labels; apply per‑label thresholds or further sampling strategies if required.

	Usage
	- Artifact path: `models/random_forest_tfidf_gridsearch.pkl`.
	- Example:
	```python
	import joblib
	model = joblib.load('models/random_forest_tfidf_gridsearch.pkl')
	y = model.predict(X_tfidf)
	```

	2) random_forest_tfidf_gridsearch_smote

	Model details
	- Name: `random_forest_tfidf_gridsearch_smote`
	- Model type: `RandomForestClassifier` inside `MultiOutputClassifier` trained with multi‑label oversampling

	Intended use
	- Intended to improve recall for under‑represented labels by applying MLSMOTE (or RandomOverSampler fallback) during training.

	Training and preprocessing
	- Features: TF‑IDF (same configuration as the baseline).
	- Oversampling: local MLSMOTE implementation when available; otherwise `RandomOverSampler`. Oversampling metadata (method and synthetic sample counts) are logged to MLflow.
	- Training script: `hopcroft_skill_classification_tool_competition/modeling/train.py` (action `smote`).

	Evaluation
	- MLflow run: `random_forest_tfidf_gridsearch_smote`.

	Limitations and recommendations
	- Synthetic samples may introduce distributional artifacts; validate synthetic examples and per‑label metrics before deployment.

	Usage
	- Artifact path: `models/random_forest_tfidf_gridsearch_smote.pkl`.

	3) random_forest_embedding_gridsearch

	Model details
	- Name: `random_forest_embedding_gridsearch`
	- Features: sentence embeddings produced by `all-MiniLM-L6-v2` (see `config.EMBEDDING_MODEL_NAME`).

	Intended use
	- Uses semantic embeddings to capture contextual information from PR text; suitable for research and prototyping.

	Training and preprocessing
	- Embeddings generated and stored via `get_feature_paths(feature_type='embedding', use_cleaned=True)`.
	- Training script: see `hopcroft_skill_classification_tool_competition/modeling/train.py`.

	Evaluation
	- MLflow run: `random_forest_embedding_gridsearch`.

	Limitations and recommendations
	- Embeddings encode dataset biases; verify performance when transferring to other repositories or languages.

	Usage
	- Artifact path: `models/random_forest_embedding_gridsearch.pkl`.
	- Example:
	```python
	model.predict(X_embeddings)
	```

	4) random_forest_embedding_gridsearch_smote

	Model details
	- Name: `random_forest_embedding_gridsearch_smote`
	- Combines embedding features with multi‑label oversampling to address rare labels.

	Training and evaluation
	- Oversampling: MLSMOTE preferred; `RandomOverSampler` fallback if MLSMOTE is unavailable.
	- MLflow run: `random_forest_embedding_gridsearch_smote`.

	Limitations and recommendations
	- Review synthetic examples and re‑evaluate on target data prior to deployment.

	Usage
	- Artifact path: `models/random_forest_embedding_gridsearch_smote.pkl`.

	Publishing guidance for Hugging Face Hub
	- The YAML front‑matter enables rendering on the Hugging Face Hub. Recommended repository contents for publishing:
	- `README.md` (this file)
	- model artifact(s) (`*.pkl`)
	- vectorizer(s) and label map (e.g. `tfidf_vectorizer.pkl`, `label_names.pkl`)
	- a minimal inference example or notebook

	Evaluation Data and Protocol
	- Evaluation split: an 80/20 multilabel‑stratified train/test split was used for final evaluation.
	- Cross-validation: hyperparameters were selected via 5‑fold cross‑validation optimizing `f1_micro`.
	- Test metrics reported: micro precision, micro recall, micro F1 (reported in the YAML `model-index` for each model).

	Quantitative Analyses
	- Reported unitary results: micro‑precision, micro‑recall and micro‑F1 on the held‑out test split for each model.
	- Where available, `cv_best_f1_micro` is the best cross‑validation f1_micro recorded during training; when a CV value was not present in tracking, the test F1 is used as a proxy and noted in the README.
	- Notes on comparability: TF‑IDF and embedding models are evaluated on the same held‑out splits (features differ); reported metrics are comparable for broad benchmarking but not for per‑label fairness analyses.

	How Metrics Were Computed
	- Metrics were computed using scikit‑learn's `precision_score`, `recall_score`, and `f1_score` with `average='micro'` and `zero_division=0` on the held‑out test labels and model predictions.
	- Test feature and label files used are available under `data/processed/tfidf/` and `data/processed/embedding/` (paths referenced from `hopcroft_skill_classification_tool_competition.config.get_feature_paths`).

	Ethical Considerations and Caveats
	- The dataset contains examples from Java repositories; model generalization to other languages or domains is not guaranteed.
	- Label imbalance is present; oversampling (MLSMOTE or RandomOverSampler fallback) was used in two variants to improve recall for rare labels — inspect per‑label metrics before deploying.
	- The models and README are intended for research and benchmarking. They are not validated for safety‑critical or high‑stakes automated decisioning.