Update README.md
Browse files
README.md
CHANGED
|
@@ -38,14 +38,31 @@ Our raw data was not model ready. Our first mission was to ensure every row was
|
|
| 38 |
### 2. Exploratory Data Analysis (EDA)
|
| 39 |
We began by uncovering the natural relationships in the data. Our analysis revealed that "Hype" (rating counts) and "Size" (page counts) were influential, but they only told a fraction of the story.
|
| 40 |
|
| 41 |
-
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
### 3. Feature Engineering: The "Author Reputation" Signal
|
| 45 |
The most significant breakthrough came from engineering the **Author Reputation Score**. By calculating the historical average rating for each author, we gave the model a "human" insight into quality that raw metadata lacks.
|
| 46 |
|
| 47 |

|
| 48 |
-
*Figure
|
| 49 |
|
| 50 |
### 4. Unsupervised Learning: Discovering Book "Personas"
|
| 51 |
Using **K-Means Clustering**, we identified four distinct "Personas" within the dataset.
|
|
@@ -55,7 +72,7 @@ Using **K-Means Clustering**, we identified four distinct "Personas" within the
|
|
| 55 |
* **The Everyman Read:** Standard length and average popularity.
|
| 56 |
|
| 57 |

|
| 58 |
-
*Figure
|
| 59 |
|
| 60 |
---
|
| 61 |
|
|
@@ -73,7 +90,7 @@ We tested multiple architectures, from Linear Regression to Ensemble Methods. **
|
|
| 73 |
We converted the task into a binary classification (Hit vs. Standard) using a **4.0 rating threshold**.
|
| 74 |
|
| 75 |

|
| 76 |
-
*Figure
|
| 77 |
|
| 78 |
**Why Precision Matters:** In a business context, a **False Positive**, wrongly predicting a hit, is more costly than a **False Negative**, missing a hit. Our model achieves **90% Precision**, making it a reliable risk-mitigation tool for publishers.
|
| 79 |
|
|
|
|
| 38 |
### 2. Exploratory Data Analysis (EDA)
|
| 39 |
We began by uncovering the natural relationships in the data. Our analysis revealed that "Hype" (rating counts) and "Size" (page counts) were influential, but they only told a fraction of the story.
|
| 40 |
|
| 41 |
+
**Question 1: How are the book ratings distributed?**
|
| 42 |
+
|
| 43 |
+

|
| 44 |
+
*Figure 1: Identifying the "center" of the data to justify our classification threshold.*
|
| 45 |
+
|
| 46 |
+
**Question 2: Does the "Hype" (number of reviews) correlate with the Score?**
|
| 47 |
+
|
| 48 |
+

|
| 49 |
+
*Figure 2: Checking if high-volume books (popular) are rated better than niche books.*
|
| 50 |
+
|
| 51 |
+
**Question 3: Are longer books rated higher or lower?**
|
| 52 |
+
|
| 53 |
+

|
| 54 |
+
*Figure 3: Investigating if "Epic" length contributes to higher perceived quality.*
|
| 55 |
+
|
| 56 |
+
**Question 4: Which genres dominate the high-rating charts?**
|
| 57 |
+
|
| 58 |
+

|
| 59 |
+
*Figure 4: Determining if 'Genre' is a strong predictor of success.
|
| 60 |
|
| 61 |
### 3. Feature Engineering: The "Author Reputation" Signal
|
| 62 |
The most significant breakthrough came from engineering the **Author Reputation Score**. By calculating the historical average rating for each author, we gave the model a "human" insight into quality that raw metadata lacks.
|
| 63 |
|
| 64 |

|
| 65 |
+
*Figure 5: Importance of Author Reputation relative to other features.*
|
| 66 |
|
| 67 |
### 4. Unsupervised Learning: Discovering Book "Personas"
|
| 68 |
Using **K-Means Clustering**, we identified four distinct "Personas" within the dataset.
|
|
|
|
| 72 |
* **The Everyman Read:** Standard length and average popularity.
|
| 73 |
|
| 74 |

|
| 75 |
+
*Figure 6: PCA projection of the 4-cluster K-Means model.*
|
| 76 |
|
| 77 |
---
|
| 78 |
|
|
|
|
| 90 |
We converted the task into a binary classification (Hit vs. Standard) using a **4.0 rating threshold**.
|
| 91 |
|
| 92 |

|
| 93 |
+
*Figure 7: Confusion Matrix for the winning Random Forest Classifier.*
|
| 94 |
|
| 95 |
**Why Precision Matters:** In a business context, a **False Positive**, wrongly predicting a hit, is more costly than a **False Negative**, missing a hit. Our model achieves **90% Precision**, making it a reliable risk-mitigation tool for publishers.
|
| 96 |
|