language: en
license: mit
tags:
- tabular-classification
- book-ratings
- goodreads
- random-forest
metrics:
- accuracy
- r2
- mae
Predictive Analytics for the Publishing Industry: The Goodreads Project
Executive Summary
Can the success of a book be predicted before it ever hits the shelves? This project explores that question by building a comprehensive machine learning pipeline. We transformed a raw dataset of over 50,000 books into a high-precision tool capable of predicting user ratings with a 0.85 R² score and classifying potential "Hits" with 89% accuracy.
The Engineering Pipeline
1. Data Cleaning & Preprocessing
Our raw data was not model ready. Our first mission was to ensure every row was trustworthy and every feature was statistically sound. We also removed low utility features that took extra space and could damage our research and model.
Handling Missing Values & Consistency
- Intelligent Imputation: Rather than dropping rows with missing values, we utilized Median Imputation for skewed numerical features (like
number_of_pages) to preserve the dataset's statistical power. - Schema Standardization: Standardized all column headers to
snake_caseand stripped whitespace to prevent programmatic errors.
Outlier Detection & Treatment
- The Logarithmic Shift:
rating_countexhibited a "Long Tail" distribution. We applied a Log Transformation (rating_count_log) to normalize this scale, preventing high-popularity outliers from overwhelming the model's weight distribution. - Impossible Values: We filtered out "impossible" entries (e.g., 0-page books) and extreme edge cases (10,000+ page box sets) to focus the model on the standard retail book market.
Figure 2: Boxplot analysis identifying and filtering statistical outliers.
2. Exploratory Data Analysis (EDA)
We began by uncovering the natural relationships in the data. Our analysis revealed that "Hype" (rating counts) and "Size" (page counts) were influential, but they only told a fraction of the story.
Question 1: How are the book ratings distributed?
Figure 1: Identifying the "center" of the data to justify our classification threshold.
Insights: We can clearly see how the most common average rating among the books in the dataset centers around a 4.0 rating. This makes sense since people tend to read and finish books they already thought they would like.
Question 2: Does the "Hype" (number of reviews) correlate with the Score?
Figure 2: Checking if high-volume books (popular) are rated better than niche books.
Insights: We can see the cloud of dots thickening as the ratings increase. This shows how less rated books can have extreme average ratings, both higher and lower, since each review matters more, while popular books with many reviews center around a 3.8 to 4.2 rating.
Important note: The cloud is mostly even, which means there is little correlation between a book's popularity and a book's rating.
Question 3: Are longer books rated higher or lower?
Figure 3: Investigating if "Epic" length contributes to higher perceived quality.
Insights: We can see the black line has a slight upward tilt, which means there is a slight positive correlation between a book length and its average rating. This makes sense as readers of longer books (800+ pages) tend to be more invested and bigger fans of the book.
Question 4: Which genres dominate the high-rating charts?
*Figure 4: Determining if 'Genre' is a strong predictor of success.
Insights: We can see that the best rated genres are the sequential books and surprisingly the Unknown genres, which we named the rows where the data was missing.
This plot teaches us that the genre has a big influence on the rating of a book.
3. Feature Engineering: The "Author Reputation" Signal
The most significant breakthrough came from engineering the Author Reputation Score. By calculating the historical average rating for each author, we gave the model a "human" insight into quality that raw metadata lacks.
Figure 5: Importance of Author Reputation relative to other features.
4. Unsupervised Learning: Discovering Book "Personas"
Using K-Means Clustering, we identified four distinct "Personas" within the dataset.
- Cluster 0 The Modern Epics: These are the "Mainstream Blockbusters." They are thick books with massive popularity.
- Cluster 1 The Standard Read: Defined by average length and average popularity. This is likely the largest group of standard fiction.
- Cluster 2 The Purple Legacy: These are "The Classics." They are much older than the rest of the dataset. Because there aren't many 70-year-old books in a modern dataset, they appeared as "specks" in the PCA, but mathematically, their age makes them a very distinct, elite group.
- Cluster 3 The High-Quality Hidden Gems: Defined by high author_rep_score and the highest average_rating, but the lowest rating_count_log. They have very few ratings (low hype), but the people who do read them love them, and they are written by top-tier authors.
Figure 6: PCA projection of the 4-cluster K-Means model.
Model Performance & Evaluation
We tested multiple architectures, from Linear Regression to Ensemble Methods. Gradient Boosting emerged as the winner for its ability to handle non-linear interactions between features.
Regression Metrics
| Metric | Baseline | Final Model | Improvement |
|---|---|---|---|
| R-squared ($R^2$) | 0.0735 | 0.8531 | +1,060% |
| MAE | 0.2441 | 0.0820 | -66% |
Classification: Predicting the "Hit"
We converted the task into a binary classification (Hit vs. Standard) using a 4.0 rating threshold.
Figure 7: Confusion Matrix for the winning Random Forest Classifier.
Why Precision Matters: In a business context, a False Positive, wrongly predicting a hit, is more costly than a False Negative, missing a hit. Our model achieves 90% Precision, making it a reliable risk-mitigation tool for publishers.
Deployment
The final winning model has been serialized using pickle and is hosted here on the Hugging Face Model Hub.
How to use:
import pickle
# Load the pre-trained classifier
with open('final_book_classifier.pkl', 'rb') as f:
model = pickle.load(f)
# Predict using engineered features: [pages, log_hype, author_score, is_series, etc.]
# prediction = model.predict(new_book_data)
Tech Stack
Language: Python 3.12
Libraries: Pandas, Scikit-Learn, Matplotlib, Seaborn, NumPy
Clustering: K-Means
Dimensionality Reduction: PCA
Models: Linear Regression, Random Forest, Gradient Boosting, KNN, Logistic Regression