YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

🚀 Startup Funding Analysis – Regression & Classification

🎥 Project Presentation Video

📌 Project Overview

This project analyzes startup funding data with the goal of understanding the drivers of startup success and building predictive models that estimate funding outcomes.
The project combines exploratory data analysis, feature engineering, clustering, and both regression and classification models to extract actionable insights for investors and analysts.

🎯 Project Objectives

Explore and understand patterns in startup funding data.
Engineer meaningful features that improve predictive performance.
Use clustering to uncover latent startup profiles.
Predict funding amounts using regression models.
Reframe the problem as a classification task (high-funded vs. low-funded startups).
Compare multiple classification models and select the best-performing one.

📂 Dataset

The dataset contains information about startups, including:

Funding rounds and total funding raised
Company characteristics and operational history
Temporal and aggregated funding features

Basic preprocessing and cleaning steps were applied before modeling.

❓ Key Questions & Answers

Q1: What factors are most associated with higher startup funding?
Startups with more funding rounds, longer operating history, and later-stage investments tend to receive higher total funding.

Q2: Can startups be meaningfully grouped using unsupervised learning?
Yes. Clustering revealed clear groups representing early-stage, growth-stage, and highly funded startups.

Q3: Does feature engineering improve model performance?
Yes. Aggregated funding features and cluster-based features significantly improved both regression accuracy and classification F1-scores.

Q4: Is reframing the problem as classification useful?
Absolutely. Classifying startups into high-funded vs. low-funded provides a practical screening tool for investment decision-making.

🔍 Exploratory Data Analysis (EDA)

EDA focused on understanding the distribution of funding amounts, identifying skewness, and examining relationships between key variables.

Funding Distribution

Key insights:

Funding amounts are highly right-skewed.
Log transformations are beneficial for regression modeling.
A small subset of startups accounts for most of the total funding.

🧠 Feature Engineering & Clustering

Feature Engineering

Key feature engineering steps included:

Aggregating funding rounds and amounts.
Creating temporal features (startup age, years active).
Scaling numeric features.
Encoding categorical variables.

Clustering

K-Means clustering was applied on scaled features to create a new categorical feature representing startup funding profiles.

Clustering Visualization

Clustering helped distinguish between early-stage, growth-stage, and mature startups.

📈 Modeling

Regression

A regression model was trained to predict total funding amount.
The trained regression model was exported as a pickle file.

Regression-to-Classification

The continuous funding target was converted into a binary classification problem using a median split:

Class 0: Below median funding
Class 1: At or above median funding

This strategy ensured well-balanced classes.

✅ Classification Models Trained

Three different classification models were trained and evaluated using the same engineered features:

Logistic Regression
Random Forest
Gradient Boosting

📊 Evaluation

Model performance was evaluated using:

Precision
Recall
F1-score
Confusion matrices

Gradient Boosting – Confusion Matrix

Key Findings

Recall was prioritized, as missing a high-funded startup is more costly than investigating a false positive.
Gradient Boosting achieved the best balance between precision and recall.
Most errors across models were false positives, aligning with the business goal.

🏆 Winner Model

✅ Gradient Boosting was selected as the final classification model.

Reasons:

Highest F1-score and recall.
Strong handling of non-linear relationships.
Best overall performance on unseen test data.

The trained classification model was exported as a pickle file.

📦 Repository Contents

notebook.ipynb – Full analysis and modeling pipeline
README.md – Project documentation
regression_model.pkl – Trained regression model
classification_model.pkl – Winning classification model
Plot images used in the README

🔁 Lessons Learned & Reflections

Feature engineering had a greater impact than model choice alone.
Clustering enriched downstream supervised models.
Classification framing provided clearer business value than raw regression.
Careful metric selection (recall & F1-score) is crucial for decision-oriented ML tasks.

✨ Extra Work

Cluster-based feature engineering
Comparison between regression and classification formulations
Business-oriented evaluation and interpretation

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support