Jonathandav
/

GoodReads-Rating-Predictor

@@ -21,19 +21,33 @@ Can the success of a book be predicted before it ever hits the shelves? This pro
 ## The Engineering Pipeline
-### 1. Exploratory Data Analysis (EDA)
 We began by uncovering the natural relationships in the data. Our analysis revealed that "Hype" (rating counts) and "Size" (page counts) were influential, but they only told a fraction of the story.
 ![EDA Correlation Heatmap](PASTE_LINK_OR_FILENAME_HERE)
 *Figure 1: Correlation between physical metadata and user ratings.*
-### 2. Feature Engineering: The "Author Reputation" Signal
 The most significant breakthrough came from engineering the **Author Reputation Score**. By calculating the historical average rating for each author, we gave the model a "human" insight into quality that raw metadata lacks.
 ![Feature Importance Bar Chart](PASTE_LINK_OR_FILENAME_HERE)
 *Figure 2: Importance of Author Reputation relative to other features.*
-### 3. Unsupervised Learning: Discovering Book "Personas"
 Using **K-Means Clustering**, we identified four distinct "Personas" within the dataset.
 * **The Classics:** High-age, stable-rating books.
 * **The Modern Epics:** High page count, high popularity.
@@ -61,7 +75,7 @@ We converted the task into a binary classification (Hit vs. Standard) using a **
 ![Confusion Matrix Final](PASTE_LINK_OR_FILENAME_HERE)
 *Figure 4: Confusion Matrix for the winning Random Forest Classifier.*
-**Why Precision Matters:** In a business context, a **False Positive** (wrongly predicting a hit) is more costly than a **False Negative** (missing a hit). Our model achieves **90% Precision**, making it a reliable risk-mitigation tool for publishers.
 ---

 ## The Engineering Pipeline
+## 1. Data Cleaning & Preprocessing
+Our raw data was not model ready. Our first mission was to ensure every row was trustworthy and every feature was statistically sound.
+### Handling Missing Values & Consistency
+* **Intelligent Imputation:** Rather than dropping rows with missing values, we utilized **Median Imputation** for skewed numerical features (like `number_of_pages`) to preserve the dataset's statistical power.
+* **Schema Standardization:** Standardized all column headers to `snake_case` and stripped whitespace to prevent programmatic errors.
+### Outlier Detection & Treatment
+* **The Logarithmic Shift:** `rating_count` exhibited a "Long Tail" distribution. We applied a **Log Transformation** (`rating_count_log`) to normalize this scale, preventing high-popularity outliers from overwhelming the model's weight distribution.
+* **Impossible Values:** We filtered out "impossible" entries (e.g., 0-page books) and extreme edge cases (10,000+ page box sets) to focus the model on the standard retail book market.
+![Outlier Detection Boxplot](PASTE_LINK_OR_FILENAME_HERE)
+*Figure 2: Boxplot analysis identifying and filtering statistical outliers.*
+### 2. Exploratory Data Analysis (EDA)
 We began by uncovering the natural relationships in the data. Our analysis revealed that "Hype" (rating counts) and "Size" (page counts) were influential, but they only told a fraction of the story.
 ![EDA Correlation Heatmap](PASTE_LINK_OR_FILENAME_HERE)
 *Figure 1: Correlation between physical metadata and user ratings.*
+### 3. Feature Engineering: The "Author Reputation" Signal
 The most significant breakthrough came from engineering the **Author Reputation Score**. By calculating the historical average rating for each author, we gave the model a "human" insight into quality that raw metadata lacks.
 ![Feature Importance Bar Chart](PASTE_LINK_OR_FILENAME_HERE)
 *Figure 2: Importance of Author Reputation relative to other features.*
+### 4. Unsupervised Learning: Discovering Book "Personas"
 Using **K-Means Clustering**, we identified four distinct "Personas" within the dataset.
 * **The Classics:** High-age, stable-rating books.
 * **The Modern Epics:** High page count, high popularity.
 ![Confusion Matrix Final](PASTE_LINK_OR_FILENAME_HERE)
 *Figure 4: Confusion Matrix for the winning Random Forest Classifier.*
+**Why Precision Matters:** In a business context, a **False Positive**, wrongly predicting a hit, is more costly than a **False Negative**, missing a hit. Our model achieves **90% Precision**, making it a reliable risk-mitigation tool for publishers.
 ---