Update README.md

a654250 verified 3 months ago

5.71 kB

	# 🚀 Startup Funding Analysis – Regression & Classification

	## 🎥 Project Presentation Video

	<video controls width="700">
	<source src="https://huggingface.co/danadvash/startupfunding/resolve/main/Assignment%202%20-%20Dana%20Dvash%20Intro%20to%20Data%20Science%20(1).mp4" type="video/mp4" />
	Your browser does not support the video tag.
	</video>

	## 📌 Project Overview
	This project analyzes startup funding data with the goal of understanding the drivers of startup success and building predictive models that estimate funding outcomes.
	The project combines exploratory data analysis, feature engineering, clustering, and both regression and classification models to extract actionable insights for investors and analysts.



	## 🎯 Project Objectives
	- Explore and understand patterns in startup funding data.
	- Engineer meaningful features that improve predictive performance.
	- Use clustering to uncover latent startup profiles.
	- Predict funding amounts using regression models.
	- Reframe the problem as a classification task (high-funded vs. low-funded startups).
	- Compare multiple classification models and select the best-performing one.



	## 📂 Dataset
	The dataset contains information about startups, including:
	- Funding rounds and total funding raised
	- Company characteristics and operational history
	- Temporal and aggregated funding features

	Basic preprocessing and cleaning steps were applied before modeling.



	## ❓ Key Questions & Answers

	Q1: What factors are most associated with higher startup funding?
	Startups with more funding rounds, longer operating history, and later-stage investments tend to receive higher total funding.

	Q2: Can startups be meaningfully grouped using unsupervised learning?
	Yes. Clustering revealed clear groups representing early-stage, growth-stage, and highly funded startups.

	Q3: Does feature engineering improve model performance?
	Yes. Aggregated funding features and cluster-based features significantly improved both regression accuracy and classification F1-scores.

	Q4: Is reframing the problem as classification useful?
	Absolutely. Classifying startups into high-funded vs. low-funded provides a practical screening tool for investment decision-making.



	## 🔍 Exploratory Data Analysis (EDA)
	EDA focused on understanding the distribution of funding amounts, identifying skewness, and examining relationships between key variables.

	### Funding Distribution
	![image](https://cdn-uploads.huggingface.co/production/uploads/6911c99df0486574df1afe3d/nmblWzAYvGE1w66orTDyP.png)

	Key insights:
	- Funding amounts are highly right-skewed.
	- Log transformations are beneficial for regression modeling.
	- A small subset of startups accounts for most of the total funding.



	## 🧠 Feature Engineering & Clustering

	### Feature Engineering
	Key feature engineering steps included:
	- Aggregating funding rounds and amounts.
	- Creating temporal features (startup age, years active).
	- Scaling numeric features.
	- Encoding categorical variables.

	### Clustering
	K-Means clustering was applied on scaled features to create a new categorical feature representing startup funding profiles.

	### Clustering Visualization
	![image](https://cdn-uploads.huggingface.co/production/uploads/6911c99df0486574df1afe3d/qT_PjNaX2W5HdaV_Wg60N.png)

	Clustering helped distinguish between early-stage, growth-stage, and mature startups.



	## 📈 Modeling

	### Regression
	A regression model was trained to predict total funding amount.
	The trained regression model was exported as a pickle file.

	### Regression-to-Classification
	The continuous funding target was converted into a binary classification problem using a median split:
	- Class 0: Below median funding
	- Class 1: At or above median funding

	This strategy ensured well-balanced classes.



	## ✅ Classification Models Trained
	Three different classification models were trained and evaluated using the same engineered features:

	- Logistic Regression
	- Random Forest
	- Gradient Boosting



	## 📊 Evaluation

	Model performance was evaluated using:
	- Precision
	- Recall
	- F1-score
	- Confusion matrices

	### Gradient Boosting – Confusion Matrix
	![image](https://cdn-uploads.huggingface.co/production/uploads/6911c99df0486574df1afe3d/CXRohrGpv9sLBRWCxlCBE.png)

	### Key Findings
	- Recall was prioritized, as missing a high-funded startup is more costly than investigating a false positive.
	- Gradient Boosting achieved the best balance between precision and recall.
	- Most errors across models were false positives, aligning with the business goal.



	## 🏆 Winner Model
	✅ Gradient Boosting was selected as the final classification model.

	Reasons:
	- Highest F1-score and recall.
	- Strong handling of non-linear relationships.
	- Best overall performance on unseen test data.

	The trained classification model was exported as a pickle file.



	## 📦 Repository Contents
	- `notebook.ipynb` – Full analysis and modeling pipeline
	- `README.md` – Project documentation
	- `regression_model.pkl` – Trained regression model
	- `classification_model.pkl` – Winning classification model
	- Plot images used in the README



	## 🔁 Lessons Learned & Reflections
	- Feature engineering had a greater impact than model choice alone.
	- Clustering enriched downstream supervised models.
	- Classification framing provided clearer business value than raw regression.
	- Careful metric selection (recall & F1-score) is crucial for decision-oriented ML tasks.



	## ✨ Extra Work
	- Cluster-based feature engineering
	- Comparison between regression and classification formulations
	- Business-oriented evaluation and interpretation