| --- |
| |
| |
| {} |
| --- |
| |
| # Model Card for Model ID |
|
|
| <!-- Provide a quick summary of what the model is/does. --> |
|
|
| This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1). |
|
|
| ## Model Details |
| This model classifies news headlines as either NBC or Fox News. |
|
|
| ### Model Description |
|
|
| <!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
|
| - **Developed by:** Jack Bader, Kaiyuan Wang, Pairan Xu |
| - **Taks:** Binary classification (NBC News vs. Fox News) |
| - **Preprocessing:** TF-IDF vectorization applied to the text data |
| - stop_words = "english" |
| - max_features = 1000 |
| - **Model type:** Random Forest |
| - **Freamwork:** Scikit-learn |
| - |
| #### Metrics |
|
|
| <!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
|
| - Accuracy Score |
|
|
| ### Model Evaluation |
| ```python |
| import pandas as pd |
| import joblib |
| from huggingface_hub import hf_hub_download |
| from sklearn.feature_extraction.text import TfidfVectorizer |
| from sklearn.metrics import classification_report |
| |
| # Mount to drive |
| from google.colab import drive |
| drive.mount('/content/drive') |
| |
| # Load test set |
| test_df = pd.read_csv("/content/drive/MyDrive/test_data_random_subset.csv", encoding="Windows-1252") |
| |
| # Log in w/ huggingface token |
| !huggingface-cli login |
| |
| # Download the model |
| model = hf_hub_download(repo_id = "CIS5190FinalProj/RandomForest", filename = "best_rf_model.pkl") |
| |
| # Download the vectorizer |
| tfidf_vectorizer = hf_hub_download(repo_id = "CIS5190FinalProj/RandomForest", filename = "tfidf_vectorizer.pkl") |
| |
| # Load the model |
| pipeline = joblib.load(model) |
| |
| # Load the vectorizer |
| tfidf_vectorizer = joblib.load(tfidf_vectorizer) |
| |
| # Extract the headlines from the test set |
| X_test = test_df['title'] |
| |
| # Apply transformation to the headlines into numerical features |
| X_test_transformed = tfidf_vectorizer.transform(X_test) |
| |
| # Make predictions using the pipeline |
| y_pred = pipeline.predict(X_test_transformed) |
| |
| # Extract 'labels' as target |
| y_test = test_df['label'] |
| |
| # Print classification report |
| print(classification_report(y_test, y_pred)) |