Vehicle Fuel Efficiency: From Raw Data to Machine Learning Insights

1. Dataset Overview

The foundation of this project is a comprehensive dataset detailing vehicle specifications, pricing, and fuel consumption metrics.

Size: The dataset contains over 11,000 records of individual car models.
Features: It includes a mix of categorical and numerical features, such as Make, Model, Year, Engine HP, Engine Cylinders, Transmission Type, Vehicle Style, and MSRP.
Target Variable: The primary target variable for my predictive modeling is Highway MPG (Miles Per Gallon), representing the vehicle's fuel efficiency on the highway.

2. Exploratory Data Analysis (EDA)

Before feeding data into machine learning algorithms, I conducted a thorough Exploratory Data Analysis (EDA) to understand the underlying mechanical and market distributions.

Key steps included:

Advanced Data Cleaning & Imputation: I systematically identified and handled missing values in critical columns. Instead of relying on naive global averages, I utilized Grouping Imputation (e.g., filling missing Engine HP or Cylinders based on the median values of similar vehicles grouped by Make and Model). I also stripped away redundant or highly sparse text features.
Outlier Detection & Treatment: I analyzed the spread of vehicle prices (MSRP) and engine capabilities, specifically hunting down and isolating extreme outliers (such as multi-million-dollar hypercars or anomalous data entries) that would otherwise heavily skew my baseline models.
Correlation Mapping: I identified strong multi-collinearity between specific features (e.g., Engine Cylinders and Engine HP), which directly informed my feature engineering strategy later in the project.

3. Research Questions & Visual Insights

To guide my EDA, I formulated four specific research questions to explore the physical characteristics and market trends within the automotive industry.

Q1: Impact of Engine Power on Fuel Efficiency (City MPG)

I investigated the direct relationship between raw engine horsepower and fuel consumption in urban environments. My hypothesis was that higher engine power necessitates greater fuel intake, leading to lower City MPG. (See figure below)

Q2: Average Highway MPG by Top 10 Manufacturers

I wanted to compare the engineering focus of the leading automotive brands. This analysis helped me identify which of the top 10 manufacturers prioritize highway fuel efficiency in their vehicle lineups. (See figure below)

Q3: Evolution of Average Engine Power (2000-2017)

To understand historical automotive trends, I tracked the average engine horsepower over a 17-year period. This revealed whether the industry has been shifting towards more powerful vehicles over time. (See figure below)

Q4: Shift in Vehicle Style Production (2000-2017)

Finally, I examined how consumer preferences and manufacturing priorities have changed regarding vehicle body styles (e.g., the rise of SUVs vs. Sedans) between the years 2000 and 2017. (See figure below)

4. Baseline Model Training

Before diving into complex algorithms and advanced feature engineering, I established a baseline model. This serves as a benchmark to prove whether my subsequent, more complicated models actually provide genuine value.

Objective

The goal of the baseline model is to predict the Highway MPG (Miles Per Gallon) using a standard, un-engineered set of features.

Features Used

For this initial phase, I trained a basic Linear Regression model. To keep it simple and avoid data leakage, I dropped specific columns:

Target: highway MPG (Obviously removed from the training data).
Data Leakage Prevented: I removed city mpg, because in a real-world scenario, if you don't know the highway efficiency, you likely don't know the city efficiency either.
Text/ID Clutter Removed: I dropped Model and Market Category as they contain hundreds of unique, noisy text values that a simple linear model cannot process effectively without advanced encoding.

The model was trained entirely on the remaining raw physical and market specifications (such as Engine HP, Engine Cylinders, Year, Transmission Type, MSRP, etc.) utilizing standard One-Hot Encoding for categorical variables.

Baseline Results

The Linear Regression baseline performed decently, establishing the initial error margin and predictive accuracy that my future models needed to beat. (See the baseline evaluation results below)

Baseline Feature Importance: What is the Model Thinking?

To understand how the baseline Linear Regression makes its predictions, I extracted its feature coefficients. In a linear model, these coefficients act as weights, showing exactly how much each feature pushes the predicted Highway MPG up (green) or down (red).

(See the feature importance chart below)

Key Takeaways from the Baseline Model:

The Electric Dominance: Engine Fuel Type_electric has a massive positive impact. Electric vehicles operate on MPGe (miles per gallon equivalent), which is drastically higher than traditional combustion engines. Transmission Type_DIRECT_DRIVE, which is heavily associated with EVs, also shows a huge positive coefficient.
The Weight & Aerodynamic Penalties: Large, boxy, and heavy vehicle styles completely dominate the negative side of the chart. Passenger Van, Cargo Van, Cargo Minivan, and various Pickup body styles severely reduce the predicted Highway MPG.
Model Quirks & Limitations: Interestingly, the model assigned significant weights to highly specific, rare exotic brands (like Make_Bugatti, Make_HUMMER, and Make_Spyker). Instead of learning general mechanical rules, the simple linear model tried to "memorize" these specific outlier brands to fix its mathematical errors. This was a clear signal that I needed to engineer better mechanical features and transition to more advanced algorithms.

5. Advanced Feature Engineering

Recognizing the limitations of the baseline model, I implemented advanced feature engineering to provide the machine learning algorithms with a deeper, more structural understanding of the dataset.

Instead of relying solely on explicit labels (like Make or Vehicle Style), I used Unsupervised Learning to discover the inherent mechanical groupings of the vehicles based on their core specifications.

K-Means Clustering & PCA Visualization

I applied the K-Means Clustering algorithm to group the vehicles based on four critical numerical features: Engine HP, Engine Cylinders, MSRP, and Year. The goal was to let the algorithm naturally separate the cars into distinct "Mechanical Profiles" (e.g., separating hypercars from standard sedans without looking at their textual labels).

To visualize these multi-dimensional clusters, I used Principal Component Analysis (PCA) to squash the 4-dimensional data down to a 2D scatter plot, proving that the algorithm successfully created distinct, separate groups.

(See the PCA visualization of the distinct mechanical clusters below)

Extracting New Predictive Features

Creating the clusters wasn't just for visualization; I extracted two powerful new features to train my improved models:

Cluster_ID: Every vehicle was assigned a category (0 through 4) representing its mechanical family. This essentially replaces hundreds of noisy brand and model names with five distinct, highly predictive categories.
Distance_to_Centroid: This is a mathematically calculated feature representing how close a vehicle is to the mathematical center (the "average" or "prototype") of its assigned cluster.
- Why is this important? A vehicle sitting exactly at the centroid is the perfect representation of that cluster's efficiency. A vehicle that is very far from the centroid (a high distance score) is an anomaly within its group, signaling to the model that its MPG might deviate significantly from the group's average.

These newly engineered features were designed to provide tree-based algorithms with clean, highly correlated decision nodes to improve prediction accuracy.

6. Model Evaluation and Comparison

With the engineered features in place, I trained three additional regression models to predict Highway MPG: an Engineered Linear Regression, a Random Forest Regressor, and a Gradient Boosting Regressor.

I evaluated all models against the Baseline using three main metrics: R-squared , Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE).

(See the evaluation charts below)

Performance Analysis

Looking at the results, two main observations require technical explanation:

1. The Drop in Linear Regression Performance: The Engineered Linear Regression performed worse than the Baseline (dropped to 0.906, and MAE increased to 2.05). This is an example of multicollinearity. By engineering features such as HP_per_Cylinder and incorporating Cluster_ID and Distance_to_Centroid alongside the original numeric features, I introduced highly correlated variables. Linear models assume feature independence; when features overlap mathematically, the model struggles to assign accurate, stable coefficients, leading to higher error rates.

2. The Success of Tree-Based Ensembles: The Random Forest Regressor was the clear winner, achieving an R^2 of 0.987 and reducing the MAE to 0.58 MPG. Unlike linear equations, tree-based models make binary splits and are inherently immune to multicollinearity. They excel at capturing non-linear relationships. The Random Forest algorithm effectively utilized the engineered unsupervised features (the clusters and centroid distances) as clear decision nodes to segment the vehicles, resulting in a highly accurate prediction with a minimal error margin.

How Different Models "Think": Feature Importance

To further analyze the performance differences, I examined the feature importance metrics for each model. This visualization reveals how different algorithms prioritize the underlying data.

(See the feature importance comparison below)

Model Interpretation:

Linear Regression (Left): This model evaluates data using static coefficients. The chart shows it assigned high weights to rare, low-production manufacturers (e.g., Bugatti, Spyker, Maybach). This indicates the linear model attempted to memorize specific outliers to correct its mathematical errors, rather than learning broad mechanical rules.
Random Forest (Center): This ensemble model evaluates data through multiple decision trees, distributing its logic across various features. It prioritized core mechanical characteristics like Engine Cylinders and Engine HP, while also utilizing the engineered Distance_to_Centroid. This demonstrates an ability to generalize based on physical specifications rather than manufacturer labels.
Gradient Boosting (Right): This algorithm builds trees sequentially, specifically to correct errors from previous iterations. It exhibits a highly concentrated feature importance profile, relying heavily on the most impactful variables (Engine Fuel Type_electric, Engine Cylinders, and Cluster_ID_Cluster 1) while ignoring less relevant data points.

7. Reframing the Problem: Regression to Classification

To approach the predictive modeling from a different angle, I reframed the original regression problem (predicting continuous Highway MPG) into a discrete classification task.

Target Conversion Strategy

I chose the Quantile Binning (3 Classes) strategy. Converting the continuous MPG target into exactly equal thirds (Low, Average, and High efficiency) prevents the creation of imbalanced classes, which is a common issue when using arbitrary business rule thresholds. This ensures my upcoming classification models will have sufficient and balanced examples to learn the distinct mechanical characteristics of each efficiency tier without bias towards a majority class.

Class Balance Verification

After applying the quantile conversion, I examined the resulting dataset to verify the integrity of the split. As expected with this methodology, the classes are well-balanced. Minor deviations from an exact 33.3% split are due to natural ties in the discrete numerical MPG values occurring exactly at the threshold boundaries.

(See the class distribution chart below)

Theoretical Evaluation Metrics

Before evaluating the models, I defined the critical success metrics based on the context of this dataset:

Precision vs. Recall: In the context of predicting vehicle fuel efficiency, Precision is more important than Recall. Prioritizing recall might correctly identify every efficient car, but it would also mislabel many average cars as highly efficient to cast a wider net. For consumers relying on these predictions to save money on fuel, the absolute reliability of the "High Efficiency" label (Precision) is much more critical.
False Positive vs. False Negative: A False Positive is significantly more critical. This occurs when the model classifies a gas guzzler as "High Efficiency," directly misleading the consumer and resulting in unexpectedly high fuel costs. A False Negative only means an efficient car was mistakenly labeled as average, causing no active financial damage to the user.

Classification Model Comparison

I trained three distinct ensemble classification models to categorize the vehicles: Extra Trees, HistGradientBoosting, and AdaBoost. Performance was measured using Accuracy and Macro F1-Score to ensure balanced evaluation across all three efficiency tiers.

(See the performance comparison below)

Performance Analysis:

Extra Trees (Best Performer): This model achieved the highest Macro F1-Score (0.972). By utilizing highly randomized node splits, Extra Trees effectively minimized variance and prevented overfitting. It successfully mapped the complex, non-linear relationships and engineered features (like Distance_to_Centroid) without memorizing noise.
HistGradientBoosting: Performing closely behind (0.951 Macro F1-Score), this model utilized histogram-based binning to process the continuous numerical features efficiently, proving to be a highly stable gradient boosting solution for this dataset.
AdaBoost (Underperformer): The AdaBoost model underperformed significantly (0.773 Macro F1-Score). AdaBoost builds models sequentially, heavily penalizing and focusing on previously misclassified instances. In a dataset containing dense boundaries between quantile bins and specific outliers, AdaBoost becomes highly sensitive to noise. It likely overfitted on the hard-to-classify vehicles situated exactly on the threshold lines between the Low, Average, and High efficiency classes. Additionally, standard AdaBoost relies on shallow decision "stumps," which lacked the depth required to capture the underlying mechanical groupings that the full trees utilized effectively.

🎓 8. Project Conclusion & Personal Takeaways

Building this project from raw data to predictive modeling was a profound learning experience. Rather than just treating machine learning algorithms as "black boxes," I gained a much deeper understanding of how these models actually operate under the hood and how to strategically apply them to real-world problems.

Key Takeaways:

Understanding Algorithm Logic: I learned how to interpret the decision-making processes of different models—seeing how linear models rely on static weights, while tree-based models build dynamic, non-linear decision paths.
The Necessity of Feature Engineering: I saw firsthand how creating intelligent, mathematically sound features (like applying Unsupervised K-Means Clustering to find mechanical prototypes) is often much more impactful than the choice of the algorithm itself.
Complexity Isn't Always the Answer (The MLP Failure): One of my most valuable lessons came from a complete model failure. During the classification phase, I initially attempted to train a Multi-layer Perceptron (MLP) Neural Network. Because neural networks are extremely sensitive to unscaled data and varying numerical ranges, the model completely collapsed and failed to learn the efficiency patterns. This real-world failure taught me that for structured, tabular data, robust tree-based ensembles (like Extra Trees) are often far superior to forcing a deep learning approach.

Ultimately, this project bridged the gap between theoretical data science and practical, machine learning application.

Downloads last month: -; Downloads are not tracked for this model. How to track