Update README.md
Browse files
README.md
CHANGED
|
@@ -21,19 +21,33 @@ Can the success of a book be predicted before it ever hits the shelves? This pro
|
|
| 21 |
|
| 22 |
## The Engineering Pipeline
|
| 23 |
|
| 24 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
We began by uncovering the natural relationships in the data. Our analysis revealed that "Hype" (rating counts) and "Size" (page counts) were influential, but they only told a fraction of the story.
|
| 26 |
|
| 27 |

|
| 28 |
*Figure 1: Correlation between physical metadata and user ratings.*
|
| 29 |
|
| 30 |
-
###
|
| 31 |
The most significant breakthrough came from engineering the **Author Reputation Score**. By calculating the historical average rating for each author, we gave the model a "human" insight into quality that raw metadata lacks.
|
| 32 |
|
| 33 |

|
| 34 |
*Figure 2: Importance of Author Reputation relative to other features.*
|
| 35 |
|
| 36 |
-
###
|
| 37 |
Using **K-Means Clustering**, we identified four distinct "Personas" within the dataset.
|
| 38 |
* **The Classics:** High-age, stable-rating books.
|
| 39 |
* **The Modern Epics:** High page count, high popularity.
|
|
@@ -61,7 +75,7 @@ We converted the task into a binary classification (Hit vs. Standard) using a **
|
|
| 61 |

|
| 62 |
*Figure 4: Confusion Matrix for the winning Random Forest Classifier.*
|
| 63 |
|
| 64 |
-
**Why Precision Matters:** In a business context, a **False Positive**
|
| 65 |
|
| 66 |
---
|
| 67 |
|
|
|
|
| 21 |
|
| 22 |
## The Engineering Pipeline
|
| 23 |
|
| 24 |
+
## 1. Data Cleaning & Preprocessing
|
| 25 |
+
Our raw data was not model ready. Our first mission was to ensure every row was trustworthy and every feature was statistically sound.
|
| 26 |
+
|
| 27 |
+
### Handling Missing Values & Consistency
|
| 28 |
+
* **Intelligent Imputation:** Rather than dropping rows with missing values, we utilized **Median Imputation** for skewed numerical features (like `number_of_pages`) to preserve the dataset's statistical power.
|
| 29 |
+
* **Schema Standardization:** Standardized all column headers to `snake_case` and stripped whitespace to prevent programmatic errors.
|
| 30 |
+
|
| 31 |
+
### Outlier Detection & Treatment
|
| 32 |
+
* **The Logarithmic Shift:** `rating_count` exhibited a "Long Tail" distribution. We applied a **Log Transformation** (`rating_count_log`) to normalize this scale, preventing high-popularity outliers from overwhelming the model's weight distribution.
|
| 33 |
+
* **Impossible Values:** We filtered out "impossible" entries (e.g., 0-page books) and extreme edge cases (10,000+ page box sets) to focus the model on the standard retail book market.
|
| 34 |
+
|
| 35 |
+

|
| 36 |
+
*Figure 2: Boxplot analysis identifying and filtering statistical outliers.*
|
| 37 |
+
|
| 38 |
+
### 2. Exploratory Data Analysis (EDA)
|
| 39 |
We began by uncovering the natural relationships in the data. Our analysis revealed that "Hype" (rating counts) and "Size" (page counts) were influential, but they only told a fraction of the story.
|
| 40 |
|
| 41 |

|
| 42 |
*Figure 1: Correlation between physical metadata and user ratings.*
|
| 43 |
|
| 44 |
+
### 3. Feature Engineering: The "Author Reputation" Signal
|
| 45 |
The most significant breakthrough came from engineering the **Author Reputation Score**. By calculating the historical average rating for each author, we gave the model a "human" insight into quality that raw metadata lacks.
|
| 46 |
|
| 47 |

|
| 48 |
*Figure 2: Importance of Author Reputation relative to other features.*
|
| 49 |
|
| 50 |
+
### 4. Unsupervised Learning: Discovering Book "Personas"
|
| 51 |
Using **K-Means Clustering**, we identified four distinct "Personas" within the dataset.
|
| 52 |
* **The Classics:** High-age, stable-rating books.
|
| 53 |
* **The Modern Epics:** High page count, high popularity.
|
|
|
|
| 75 |

|
| 76 |
*Figure 4: Confusion Matrix for the winning Random Forest Classifier.*
|
| 77 |
|
| 78 |
+
**Why Precision Matters:** In a business context, a **False Positive**, wrongly predicting a hit, is more costly than a **False Negative**, missing a hit. Our model achieves **90% Precision**, making it a reliable risk-mitigation tool for publishers.
|
| 79 |
|
| 80 |
---
|
| 81 |
|