Jonathandav commited on
Commit
d10230c
·
verified ·
1 Parent(s): 1bb2be5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -5
README.md CHANGED
@@ -38,14 +38,31 @@ Our raw data was not model ready. Our first mission was to ensure every row was
38
  ### 2. Exploratory Data Analysis (EDA)
39
  We began by uncovering the natural relationships in the data. Our analysis revealed that "Hype" (rating counts) and "Size" (page counts) were influential, but they only told a fraction of the story.
40
 
41
- ![EDA Correlation Heatmap](PASTE_LINK_OR_FILENAME_HERE)
42
- *Figure 1: Correlation between physical metadata and user ratings.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  ### 3. Feature Engineering: The "Author Reputation" Signal
45
  The most significant breakthrough came from engineering the **Author Reputation Score**. By calculating the historical average rating for each author, we gave the model a "human" insight into quality that raw metadata lacks.
46
 
47
  ![Feature Importance Bar Chart](PASTE_LINK_OR_FILENAME_HERE)
48
- *Figure 2: Importance of Author Reputation relative to other features.*
49
 
50
  ### 4. Unsupervised Learning: Discovering Book "Personas"
51
  Using **K-Means Clustering**, we identified four distinct "Personas" within the dataset.
@@ -55,7 +72,7 @@ Using **K-Means Clustering**, we identified four distinct "Personas" within the
55
  * **The Everyman Read:** Standard length and average popularity.
56
 
57
  ![PCA Cluster Visualization](PASTE_LINK_OR_FILENAME_HERE)
58
- *Figure 3: PCA projection of the 4-cluster K-Means model.*
59
 
60
  ---
61
 
@@ -73,7 +90,7 @@ We tested multiple architectures, from Linear Regression to Ensemble Methods. **
73
  We converted the task into a binary classification (Hit vs. Standard) using a **4.0 rating threshold**.
74
 
75
  ![Confusion Matrix Final](PASTE_LINK_OR_FILENAME_HERE)
76
- *Figure 4: Confusion Matrix for the winning Random Forest Classifier.*
77
 
78
  **Why Precision Matters:** In a business context, a **False Positive**, wrongly predicting a hit, is more costly than a **False Negative**, missing a hit. Our model achieves **90% Precision**, making it a reliable risk-mitigation tool for publishers.
79
 
 
38
  ### 2. Exploratory Data Analysis (EDA)
39
  We began by uncovering the natural relationships in the data. Our analysis revealed that "Hype" (rating counts) and "Size" (page counts) were influential, but they only told a fraction of the story.
40
 
41
+ **Question 1: How are the book ratings distributed?**
42
+
43
+ ![EDA Q1](PASTE_LINK_OR_FILENAME_HERE)
44
+ *Figure 1: Identifying the "center" of the data to justify our classification threshold.*
45
+
46
+ **Question 2: Does the "Hype" (number of reviews) correlate with the Score?**
47
+
48
+ ![EDA Q2](PASTE_LINK_OR_FILENAME_HERE)
49
+ *Figure 2: Checking if high-volume books (popular) are rated better than niche books.*
50
+
51
+ **Question 3: Are longer books rated higher or lower?**
52
+
53
+ ![EDA Q3](PASTE_LINK_OR_FILENAME_HERE)
54
+ *Figure 3: Investigating if "Epic" length contributes to higher perceived quality.*
55
+
56
+ **Question 4: Which genres dominate the high-rating charts?**
57
+
58
+ ![EDA Q4](PASTE_LINK_OR_FILENAME_HERE)
59
+ *Figure 4: Determining if 'Genre' is a strong predictor of success.
60
 
61
  ### 3. Feature Engineering: The "Author Reputation" Signal
62
  The most significant breakthrough came from engineering the **Author Reputation Score**. By calculating the historical average rating for each author, we gave the model a "human" insight into quality that raw metadata lacks.
63
 
64
  ![Feature Importance Bar Chart](PASTE_LINK_OR_FILENAME_HERE)
65
+ *Figure 5: Importance of Author Reputation relative to other features.*
66
 
67
  ### 4. Unsupervised Learning: Discovering Book "Personas"
68
  Using **K-Means Clustering**, we identified four distinct "Personas" within the dataset.
 
72
  * **The Everyman Read:** Standard length and average popularity.
73
 
74
  ![PCA Cluster Visualization](PASTE_LINK_OR_FILENAME_HERE)
75
+ *Figure 6: PCA projection of the 4-cluster K-Means model.*
76
 
77
  ---
78
 
 
90
  We converted the task into a binary classification (Hit vs. Standard) using a **4.0 rating threshold**.
91
 
92
  ![Confusion Matrix Final](PASTE_LINK_OR_FILENAME_HERE)
93
+ *Figure 7: Confusion Matrix for the winning Random Forest Classifier.*
94
 
95
  **Why Precision Matters:** In a business context, a **False Positive**, wrongly predicting a hit, is more costly than a **False Negative**, missing a hit. Our model achieves **90% Precision**, making it a reliable risk-mitigation tool for publishers.
96