π Startup Funding Analysis β Regression & Classification
π₯ Project Presentation Video
π Project Overview
This project analyzes startup funding data with the goal of understanding the drivers of startup success and building predictive models that estimate funding outcomes.
The project combines exploratory data analysis, feature engineering, clustering, and both regression and classification models to extract actionable insights for investors and analysts.
π― Project Objectives
- Explore and understand patterns in startup funding data.
- Engineer meaningful features that improve predictive performance.
- Use clustering to uncover latent startup profiles.
- Predict funding amounts using regression models.
- Reframe the problem as a classification task (high-funded vs. low-funded startups).
- Compare multiple classification models and select the best-performing one.
π Dataset
The dataset contains information about startups, including:
- Funding rounds and total funding raised
- Company characteristics and operational history
- Temporal and aggregated funding features
Basic preprocessing and cleaning steps were applied before modeling.
β Key Questions & Answers
Q1: What factors are most associated with higher startup funding?
Startups with more funding rounds, longer operating history, and later-stage investments tend to receive higher total funding.
Q2: Can startups be meaningfully grouped using unsupervised learning?
Yes. Clustering revealed clear groups representing early-stage, growth-stage, and highly funded startups.
Q3: Does feature engineering improve model performance?
Yes. Aggregated funding features and cluster-based features significantly improved both regression accuracy and classification F1-scores.
Q4: Is reframing the problem as classification useful?
Absolutely. Classifying startups into high-funded vs. low-funded provides a practical screening tool for investment decision-making.
π Exploratory Data Analysis (EDA)
EDA focused on understanding the distribution of funding amounts, identifying skewness, and examining relationships between key variables.
Funding Distribution
Key insights:
- Funding amounts are highly right-skewed.
- Log transformations are beneficial for regression modeling.
- A small subset of startups accounts for most of the total funding.
π§ Feature Engineering & Clustering
Feature Engineering
Key feature engineering steps included:
- Aggregating funding rounds and amounts.
- Creating temporal features (startup age, years active).
- Scaling numeric features.
- Encoding categorical variables.
Clustering
K-Means clustering was applied on scaled features to create a new categorical feature representing startup funding profiles.
Clustering Visualization
Clustering helped distinguish between early-stage, growth-stage, and mature startups.
π Modeling
Regression
A regression model was trained to predict total funding amount.
The trained regression model was exported as a pickle file.
Regression-to-Classification
The continuous funding target was converted into a binary classification problem using a median split:
- Class 0: Below median funding
- Class 1: At or above median funding
This strategy ensured well-balanced classes.
β Classification Models Trained
Three different classification models were trained and evaluated using the same engineered features:
- Logistic Regression
- Random Forest
- Gradient Boosting
π Evaluation
Model performance was evaluated using:
- Precision
- Recall
- F1-score
- Confusion matrices
Gradient Boosting β Confusion Matrix
Key Findings
- Recall was prioritized, as missing a high-funded startup is more costly than investigating a false positive.
- Gradient Boosting achieved the best balance between precision and recall.
- Most errors across models were false positives, aligning with the business goal.
π Winner Model
β Gradient Boosting was selected as the final classification model.
Reasons:
- Highest F1-score and recall.
- Strong handling of non-linear relationships.
- Best overall performance on unseen test data.
The trained classification model was exported as a pickle file.
π¦ Repository Contents
notebook.ipynbβ Full analysis and modeling pipelineREADME.mdβ Project documentationregression_model.pklβ Trained regression modelclassification_model.pklβ Winning classification model- Plot images used in the README
π Lessons Learned & Reflections
- Feature engineering had a greater impact than model choice alone.
- Clustering enriched downstream supervised models.
- Classification framing provided clearer business value than raw regression.
- Careful metric selection (recall & F1-score) is crucial for decision-oriented ML tasks.
β¨ Extra Work
- Cluster-based feature engineering
- Comparison between regression and classification formulations
- Business-oriented evaluation and interpretation


