Spaces:

tkbarb10
/

ads505-app

Sleeping

ads505-app / tabs /predictive_model_tabs /pred_model_one.py

Taylor Kirk

Fresh deployment after moving datasets to hf datahub

5d4981c 4 months ago

11.8 kB

	import streamlit as st

	# include expected schema for uploaded data

	def render():

	st.markdown("""
	<style>
	@import url('https://fonts.googleapis.com/css2?family=Inter:wght@400;600&display=swap');
	.custom-section {
	font-family: 'Inter', sans-serif;
	font-size: 16px;
	line-height: 1.4;
	}
	.custom-section strong {
	font-weight: 600;
	color: #2E86AB;
	display: inline-block;
	margin-bottom: 8px;
	}
	.custom-section ol,
	.custom-section ul {
	margin-top: 0 !important;
	margin-bottom: 10px !important;
	padding-top: !important;
	}
	</style>
	""", unsafe_allow_html=True)

	st.markdown("## 🤖 About Predictive Modeling")

	st.markdown("""
	<div class="custom-section">
	<strong>Project Goal</strong><br>
	Our objective was to build a model that predicts which customer reviews will be found helpful by others. By identifying helpful negative reviews, we can surface potential product or service issues worth investigating. Similarly, elevating helpful positive reviews highlights what customers value most.
	<br><br>

	<strong>Secondary Benefit</strong><br>
	Understanding the characteristics of helpful reviews enables reviewers to improve the quality of their feedback, making it more valuable for both businesses and consumers.

	<strong>What You'll Find Below</strong>
	<ul>
	<li><strong>Data Schema</strong> — Required format and fields needed to run predictions</li>
	<li><strong>Modeling Process</strong> — Step-by-step explanation of how the model works</li>
	<li><strong>Interactive Demo (Tab 2)</strong> — Hands-on walkthrough before applying the model to your own data on the User Page</li>
	</ul>
	</div>
	""", unsafe_allow_html=True)

	#st.divider()
	st.markdown("""
	<hr style='
	border: none;
	height: 2px;
	background: linear-gradient(to right, #2E86AB, #87ceeb, #2E86AB);
	margin: 20px 0;
	'>
	""", unsafe_allow_html=True)

	st.markdown("## Data Schema")

	st.markdown("""

	\| Column Name \| Data Type \| Description \|
	\|------------\|-----------\|-------------\|
	\| `lemma_title` \| string \| Lemmatized version of the review title \|
	\| `lemma_text` \| string \| Lemmatized version of the review text \|
	\| `images` \| boolen \| Binary indicator if the review includes an image or not \|
	\| `Review Length` \| integer \| Character count of the review text \|
	\| `Title Length` \| integer \| Character count of the review title \|

	Read more about lemmatization and the process used in our models [here](https://www.geeksforgeeks.org/python/python-lemmatization-with-nltk/)
	""")

	st.markdown("""
	<hr style='
	border: none;
	height: 2px;
	background: linear-gradient(to right, #2E86AB, #87ceeb, #2E86AB);
	margin: 20px 0;
	'>
	""", unsafe_allow_html=True)

	st.markdown("## Model Components")

	st.markdown("""
	<div class="custom-section">
	Our model uses a <strong>four-stage pipeline</strong> to predict review helpfulness. We trained on 60,000+ reviews
	and achieved <strong>71.7% accuracy</strong> and <strong>63.6% F1-macro score</strong>. The model outputs probability scores,
	allowing you to rank and prioritize reviews that are most likely to be found helpful by customers.
	</div><br>
	""", unsafe_allow_html=True)

	st.markdown("### The Pipeline")

	st.markdown("""
	<div class="custom-section">

	<strong>1. TF-IDF Vectorization</strong> — Extracting meaningful text patterns<br>
	We transform `lemma_title` and `lemma_text` into numerical features using Term Frequency-Inverse Document Frequency (TF-IDF).
	This approach identifies words and phrases that distinguish helpful reviews from unhelpful ones by balancing how often a term
	appears in a specific review against how common it is across all reviews. Words that appear frequently in helpful reviews but
	rarely elsewhere receive higher weights, making them strong predictive signals.
	<br>

	<strong>Why TF-IDF?</strong> It automatically downweights generic words while highlighting distinctive language patterns.
	We use 1-2 word phrases (bigrams) to capture meaningful combinations like "works great" or "poor quality."
	<br><br>

	<strong>2. Standard Scaler</strong> — Normalizing review metrics<br>
	Review length and title length are scaled to have mean=0 and standard deviation=1. This prevents longer reviews from
	dominating the model simply due to scale differences.
	<br>

	<strong>Known limitation:</strong> We discovered that helpfulness has a non-linear relationship with length. Very short and
	very long reviews both tend to receive fewer helpful votes, with medium-length reviews performing best. Our linear scaling
	doesn't fully capture this relationship, suggesting polynomial features or binning could improve future iterations.
	<br><br>

	<strong>3. Truncated SVD</strong> — Dimensionality reduction for efficiency<br>
	After TF-IDF, our feature space explodes to 200,000+ dimensions (one for each unique word/phrase). We use Truncated Singular
	Value Decomposition to compress this down to <strong>800 components</strong> while retaining <strong>70% of the variance</strong>.
	This dramatically speeds up training while maintaining predictive power.
	<br>

	<strong>Why Truncated SVD over PCA?</strong> It works directly with sparse matrices (TF-IDF produces mostly zeros), making it
	far more memory-efficient. We tuned the component count by balancing F1-macro score against model complexity.
	<br><br>

	<strong>4. Stochastic Gradient Descent Classifier (SGDC)</strong> — The final predictor<br>
	We compared five models: Decision Trees, K-Nearest Neighbors, Linear SVM, XGBoost, and SGDC. <strong>SGDC emerged as the best
	overall performer,</strong>narrowly beating XGBoost on the gains curve (a metric measuring how well the model prioritizes truly
	helpful reviews at the top of its predictions).
	<br><br>

	<strong>Key tuning decisions:</strong>
	<ul>
	<li><strong>class_weight='balanced'</strong>: Our data is imbalanced (80% of reviews have zero helpful votes), so we weighted
	the minority class to prevent the model from simply predicting "not helpful" for everything</li>
	<li><strong>loss='modified_huber'</strong>: Provides probability estimates (needed for ranking) while being robust to outliers</li>
	<li><strong>early_stopping=True</strong>: Prevents overfitting by monitoring validation performance</li>
	</ul>

	<strong>Why SGDC over XGBoost?</strong> While XGBoost had slightly better raw accuracy (72% vs 71.7%), SGDC showed better
	generalization, faster training, and superior performance on the gains curve, meaning it does a better job surfacing the
	<em>most</em> helpful reviews, which is what matters for practical use.
	</div>
	""", unsafe_allow_html=True)

	st.markdown("### Model Performance & Insights")

	st.markdown("""
	<div class="custom-section">
	<strong>What makes a review helpful?</strong> Our analysis revealed three key patterns:
	<ul>
	<li><strong>Including an image</strong> significantly increases helpfulness</li>
	<li><strong>Medium-length reviews</strong> (not too short, not too long) perform best</li>
	<li><strong>Specific vocabulary</strong> varies by product category — suggesting category-specific models could further improve accuracy</li>
	</ul>

	<strong>Practical application:</strong> The model outputs probability scores (0-1) that allow you to rank your reviews.
	Focus on high-probability <strong>negative</strong> reviews to identify product issues early, and elevate high-probability
	<strong>positive</strong> reviews to guide purchasing decisions.
	</div>
	""", unsafe_allow_html=True)

	st.markdown("### Limitations & Future Improvements")

	st.markdown("""
	<div class="custom-section">
	<strong>Current limitations to be aware of:</strong>
	<ul>
	<li><strong>Category-agnostic training</strong> — The model was trained across all product categories. Category-specific models
	would likely improve accuracy since "helpful" looks different for electronics vs. beauty products</li>
	<li><strong>Low helpfulness threshold</strong> — We defined "helpful" as 1+ votes due to computational constraints. A higher
	threshold (e.g., 5+ votes) would be more meaningful but requires training on larger datasets</li>
	<li><strong>Non-linear length relationships</strong> — As mentioned above, polynomial features could better capture the
	sweet spot for review length</li>
	</ul>

	<strong>What we'd do with more resources:</strong> Train separate models per category, use a higher helpfulness threshold,
	experiment with transformer-based models (BERT, etc.), and incorporate temporal features (how quickly reviews receive votes).
	</div>
	""", unsafe_allow_html=True)

	st.markdown("""
	Below you'll find the specific hyperparameters tuned using [Optuna](https://optuna.readthedocs.io/en/stable/index.html),
	an automated hyperparameter optimization framework. Click each section to see the final parameter values and learn more
	about the methods used.
	""")

	pre, pred = st.columns(2)

	with pre.expander("Preprocessing Steps"):
	col1, col2, col3, col4 = st.columns(4)

	with col1.popover("TF-IDF Title"):
	st.write("Learn more about tf-idf [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)")
	st.code("""
	{'max_df': 0.95,
	'min_df': 1,
	'ngram_range': (1, 2),
	'stop_words': 'english',
	'sublinear_tf': True}
	""")

	with col2.popover("TF-IDF Text"):
	st.write("""The sci-kit native english stop words argument was used here for convenience, however there are [known issues](https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words)
	so a future iteration of this project might find improvement is using a more robust selection of stop words and including ones that
	are custom to the specific domain being modeled""")
	st.code("""
	{'max_df': 0.9,
	'min_df': 2,
	'ngram_range': (1, 2),
	'stop_words': 'english',
	'sublinear_tf': True}
	""")

	with col3.popover("Standard Scaler"):
	st.write("Default settings for [Standard Scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) were used to scale review length and title length")

	with col4.popover("Truncated SVD"):
	st.write("The only parameter changed was `n_components`. Value used was 800. [Truncated SVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html)")

	with pred.expander("Predictive Model"):
	st.write("Model used was Stochastic Gradient Descent Classifier [(SGDC)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)")
	st.code("""
	{'alpha': 0.0002,
	'class_weight': 'balanced',
	'early_stopping': True,
	'eta0': 0.001,
	'l1_ratio': 0.9,
	'learning_rate': 'adaptive',
	'loss': 'modified_huber',
	'max_iter': 500,
	'n_iter_no_change': 8,
	'penalty': 'elasticnet',
	'validation_fraction': 0.15}
	""")
	st.write("Most important parameters were the loss function, `class_weight` and `early_stopping`. Every other parameter tuned lead to marginal improvements")