Model Card for {{ model_id | default("Model ID", true) }}
This is a fine tuned version of the RandomForestEntr_BAG_L1 model for classification. This was fine tuned on the EricCRX/books-tabular-datasetwhich is a dataset of the measurements of books. In this case, it was used for binary classification between softcover and hardcover books.
Model Details
Model Description
This model uses the RandomForestEntr_BAG_L1 with accuracy as the main parameter and multi class accuracy and cross entropy as the main hyperparameters. It also uses L1 regularization to reduce overfitting.
- Developed by: Devin DeCosmo
- Model type: Binary Classifier
- Language(s) (NLP): English
- License: MIT
- Finetuned from model: RandomForestEntr_BAG_L1
Uses
This is used for classification of books as softcover or hardcover based on their measurements.
Out-of-Scope Use
If the dataset was expanded, this could be used to classify other types of books or a larger dataset.
Bias, Risks, and Limitations
This is trained off a small dataset of 30 original books and 300 augmented rows. This limited training dataset is liable to overfitting of the model and additional information is required to make it more robust.
Recommendations
The small dataset size means this model is not highly generalizable.
How to Get Started with the Model
Use the code below to get started with the model. This code is from the 24-679 Lecture on tabular datasets.
Download the zipped native predictor directory zip_local_path = huggingface_hub.hf_hub_download( repo_id=MODEL_REPO_ID, repo_type="model", filename="autogluon_predictor_dir.zip", local_dir=str(download_dir), local_dir_use_symlinks=False )
Unzip to a folder native_dir = download_dir / "predictor_dir" if native_dir.exists(): shutil.rmtree(native_dir) native_dir.mkdir(parents=True, exist_ok=True)
with zipfile.ZipFile(zip_local_path, "r") as zf: zf.extractall(str(native_dir))
Load native predictor predictor_native = autogluon.tabular.TabularPredictor.load(str(native_dir))
Inference on synthetic test X_test = df_synth_test.drop(columns=[TARGET_COL]) y_true = df_synth_test[TARGET_COL].reset_index(drop=True) y_pred = predictor_native.predict(X_test).reset_index(drop=True)
Combine results results = pandas.DataFrame({ "y_true": y_true, "y_pred": y_pred }) display(results)
Training Details
Training Data
EricCRX/books-tabular-dataset
This is the training dataset used. It consists of 30 original measurements used for validation along with 300 synthetic pieces of data from training.
Training Procedure
This model was trained with an AutoML process with accuracy as the main metrics. This model used a max time_limit of 300 seconds to reduce training time and "best_quality" to improve results
Training Hyperparameters
- Training regime: {{ training_regime | default("[More Information Needed]", true)}}
Evaluation
Testing Data, Factors & Metrics
Testing Data
maryzhang/hw1-24679-image-dataset The testing data was the 'original' split, the 30 original images in this set.
Factors
This dataset is evaluating whether the books are hardcovers "1", or softcovers "0"
Metrics
The testing metric used was accuracy to ensure the highest accuracy of the model possible. Training time was also considered to ensure final models were not computationally infeasible.
Results
After training with the initial dataset, this model reached an accuracy of 97% in validation. It also had an individual prediction time of 0.12 seconds making it fast with a high accuracy.
This validation should not be taken as a metric for robustness. Due to the small dataset, this cannot be confirmed to work for outside mearements. Expanding this dataset could find issues or improvements to this model.
Summary
This model reached a high accuracy with our current model, but this perfomance can not be confirmed to continue as the dataset was very small.