DaCrow13
Deploy to HF Spaces (Clean)
225af6a
metadata
language: en
license: mit
tags:
  - multi-label-classification
  - tfidf
  - embeddings
  - random-forest
  - oversampling
  - mlsmote
  - software-engineering
datasets:
  - NLBSE/SkillCompetition
model-index:
  - name: random_forest_tfidf_gridsearch
    results:
      - status: success
        metrics:
          cv_best_f1_micro: 0.595038375202279
          test_precision_micro: 0.690371373744215
          test_recall_micro: 0.5287455692919513
          test_f1_micro: 0.5988446098110252
        params:
          estimator__max_depth: '10'
          estimator__min_samples_split: '2'
          estimator__n_estimators: '200'
          feature_type: embedding
          model_type: RandomForest + MultiOutput
          use_cleaned: 'True'
          oversampling: 'False'
        dvc:
          path: random_forest_tfidf_gridsearch.pkl
  - name: random_forest_tfidf_gridsearch_smote
    results:
      - status: success
        metrics:
          cv_best_f1_micro: 0.59092598557871
          test_precision_micro: 0.6923300238053766
          test_recall_micro: 0.5154318319356791
          test_f1_micro: 0.59092598557871
        params:
          feature_type: tfidf
          oversampling: MLSMOTE (RandomOverSampler fallback)
        dvc:
          path: random_forest_tfidf_gridsearch_smote.pkl
  - name: random_forest_embedding_gridsearch
    results:
      - status: success
        metrics:
          cv_best_f1_micro: 0.6012826418169578
          test_precision_micro: 0.703060266254212
          test_recall_micro: 0.5252460640075934
          test_f1_micro: 0.6012826418169578
        params:
          feature_type: embedding
          oversampling: 'False'
        dvc:
          path: random_forest_embedding_gridsearch.pkl
  - name: random_forest_embedding_gridsearch_smote
    results:
      - status: success
        metrics:
          cv_best_f1_micro: 0.5962084744755453
          test_precision_micro: 0.7031004709576139
          test_recall_micro: 0.5175288364319172
          test_f1_micro: 0.5962084744755453
        params:
          feature_type: embedding
          oversampling: MLSMOTE (RandomOverSampler fallback)
        dvc:
          path: random_forest_embedding_gridsearch_smote.pkl

Model cards for committed models

Overview

  • This file documents four trained model artifacts available in the repository: two TF‑IDF based Random Forest models (baseline and with oversampling) and two embedding‑based Random Forest models (baseline and with oversampling).
  • For dataset provenance and preprocessing details see data/README.md.
  1. random_forest_tfidf_gridsearch

Model details

  • Name: random_forest_tfidf_gridsearch
  • Organization: Hopcroft (se4ai2526-uniba)
  • Model type: RandomForestClassifier wrapped in MultiOutputClassifier for multi-label outputs
  • Branch: Milestone-4

Intended use

  • Suitable for research and benchmarking on multi-label skill prediction for GitHub PRs/issues. Not intended for automated high‑stakes decisions or profiling individuals without further validation.

Training data and preprocessing

  • Dataset: Processed SkillScope Dataset (NLBSE/SkillCompetition) as prepared for this project.
  • Features: TF‑IDF (unigrams and bigrams), up to MAX_TFIDF_FEATURES=5000.
  • Feature and label files are referenced via get_feature_paths(feature_type='tfidf', use_cleaned=True) in config.py.

Evaluation

  • Reported metrics include micro‑precision, micro‑recall and micro‑F1 on a held‑out test split.
  • Protocol: 80/20 multilabel‑stratified split; hyperparameters selected via 5‑fold cross‑validation optimizing f1_micro.
  • MLflow run: random_forest_tfidf_gridsearch (see hopcroft_skill_classification_tool_competition/config.py).

Limitations and recommendations

  • Trained on Java repositories; generalization to other languages is not ensured.
  • Label imbalance affects rare labels; apply per‑label thresholds or further sampling strategies if required.

Usage

  • Artifact path: models/random_forest_tfidf_gridsearch.pkl.
  • Example:
    import joblib
    model = joblib.load('models/random_forest_tfidf_gridsearch.pkl')
    y = model.predict(X_tfidf)
    
  1. random_forest_tfidf_gridsearch_smote

Model details

  • Name: random_forest_tfidf_gridsearch_smote
  • Model type: RandomForestClassifier inside MultiOutputClassifier trained with multi‑label oversampling

Intended use

  • Intended to improve recall for under‑represented labels by applying MLSMOTE (or RandomOverSampler fallback) during training.

Training and preprocessing

  • Features: TF‑IDF (same configuration as the baseline).
  • Oversampling: local MLSMOTE implementation when available; otherwise RandomOverSampler. Oversampling metadata (method and synthetic sample counts) are logged to MLflow.
  • Training script: hopcroft_skill_classification_tool_competition/modeling/train.py (action smote).

Evaluation

  • MLflow run: random_forest_tfidf_gridsearch_smote.

Limitations and recommendations

  • Synthetic samples may introduce distributional artifacts; validate synthetic examples and per‑label metrics before deployment.

Usage

  • Artifact path: models/random_forest_tfidf_gridsearch_smote.pkl.
  1. random_forest_embedding_gridsearch

Model details

  • Name: random_forest_embedding_gridsearch
  • Features: sentence embeddings produced by all-MiniLM-L6-v2 (see config.EMBEDDING_MODEL_NAME).

Intended use

  • Uses semantic embeddings to capture contextual information from PR text; suitable for research and prototyping.

Training and preprocessing

  • Embeddings generated and stored via get_feature_paths(feature_type='embedding', use_cleaned=True).
  • Training script: see hopcroft_skill_classification_tool_competition/modeling/train.py.

Evaluation

  • MLflow run: random_forest_embedding_gridsearch.

Limitations and recommendations

  • Embeddings encode dataset biases; verify performance when transferring to other repositories or languages.

Usage

  • Artifact path: models/random_forest_embedding_gridsearch.pkl.
  • Example:
    model.predict(X_embeddings)
    
  1. random_forest_embedding_gridsearch_smote

Model details

  • Name: random_forest_embedding_gridsearch_smote
  • Combines embedding features with multi‑label oversampling to address rare labels.

Training and evaluation

  • Oversampling: MLSMOTE preferred; RandomOverSampler fallback if MLSMOTE is unavailable.
  • MLflow run: random_forest_embedding_gridsearch_smote.

Limitations and recommendations

  • Review synthetic examples and re‑evaluate on target data prior to deployment.

Usage

  • Artifact path: models/random_forest_embedding_gridsearch_smote.pkl.

Publishing guidance for Hugging Face Hub

  • The YAML front‑matter enables rendering on the Hugging Face Hub. Recommended repository contents for publishing:

    • README.md (this file)
    • model artifact(s) (*.pkl)
    • vectorizer(s) and label map (e.g. tfidf_vectorizer.pkl, label_names.pkl)
    • a minimal inference example or notebook

    Evaluation Data and Protocol

    • Evaluation split: an 80/20 multilabel‑stratified train/test split was used for final evaluation.
    • Cross-validation: hyperparameters were selected via 5‑fold cross‑validation optimizing f1_micro.
    • Test metrics reported: micro precision, micro recall, micro F1 (reported in the YAML model-index for each model).

    Quantitative Analyses

    • Reported unitary results: micro‑precision, micro‑recall and micro‑F1 on the held‑out test split for each model.
    • Where available, cv_best_f1_micro is the best cross‑validation f1_micro recorded during training; when a CV value was not present in tracking, the test F1 is used as a proxy and noted in the README.
    • Notes on comparability: TF‑IDF and embedding models are evaluated on the same held‑out splits (features differ); reported metrics are comparable for broad benchmarking but not for per‑label fairness analyses.

    How Metrics Were Computed

    • Metrics were computed using scikit‑learn's precision_score, recall_score, and f1_score with average='micro' and zero_division=0 on the held‑out test labels and model predictions.
    • Test feature and label files used are available under data/processed/tfidf/ and data/processed/embedding/ (paths referenced from hopcroft_skill_classification_tool_competition.config.get_feature_paths).

    Ethical Considerations and Caveats

    • The dataset contains examples from Java repositories; model generalization to other languages or domains is not guaranteed.
    • Label imbalance is present; oversampling (MLSMOTE or RandomOverSampler fallback) was used in two variants to improve recall for rare labels — inspect per‑label metrics before deploying.
    • The models and README are intended for research and benchmarking. They are not validated for safety‑critical or high‑stakes automated decisioning.