# 🚀 Startup Funding Analysis – Regression & Classification

## 🎥 Project Presentation Video

<video controls width="700">
  <source src="https://huggingface.co/danadvash/startupfunding/resolve/main/Assignment%202%20-%20Dana%20Dvash%20Intro%20to%20Data%20Science%20(1).mp4" type="video/mp4" />
  Your browser does not support the video tag.
</video>

## 📌 Project Overview
This project analyzes startup funding data with the goal of understanding the drivers of startup success and building predictive models that estimate funding outcomes.  
The project combines **exploratory data analysis**, **feature engineering**, **clustering**, and both **regression** and **classification models** to extract actionable insights for investors and analysts.


## 🎯 Project Objectives
- Explore and understand patterns in startup funding data.
- Engineer meaningful features that improve predictive performance.
- Use **clustering** to uncover latent startup profiles.
- Predict funding amounts using regression models.
- Reframe the problem as a **classification task** (high-funded vs. low-funded startups).
- Compare multiple classification models and select the best-performing one.


## 📂 Dataset
The dataset contains information about startups, including:
- Funding rounds and total funding raised
- Company characteristics and operational history
- Temporal and aggregated funding features

Basic preprocessing and cleaning steps were applied before modeling.


## ❓ Key Questions & Answers

**Q1: What factors are most associated with higher startup funding?**  
Startups with more funding rounds, longer operating history, and later-stage investments tend to receive higher total funding.

**Q2: Can startups be meaningfully grouped using unsupervised learning?**  
Yes. Clustering revealed clear groups representing early-stage, growth-stage, and highly funded startups.

**Q3: Does feature engineering improve model performance?**  
Yes. Aggregated funding features and cluster-based features significantly improved both regression accuracy and classification F1-scores.

**Q4: Is reframing the problem as classification useful?**  
Absolutely. Classifying startups into high-funded vs. low-funded provides a practical screening tool for investment decision-making.


## 🔍 Exploratory Data Analysis (EDA)
EDA focused on understanding the distribution of funding amounts, identifying skewness, and examining relationships between key variables.

### Funding Distribution
![image](https://cdn-uploads.huggingface.co/production/uploads/6911c99df0486574df1afe3d/nmblWzAYvGE1w66orTDyP.png)

Key insights:
- Funding amounts are highly right-skewed.
- Log transformations are beneficial for regression modeling.
- A small subset of startups accounts for most of the total funding.


## 🧠 Feature Engineering & Clustering

### Feature Engineering
Key feature engineering steps included:
- Aggregating funding rounds and amounts.
- Creating temporal features (startup age, years active).
- Scaling numeric features.
- Encoding categorical variables.

### Clustering
K-Means clustering was applied on scaled features to create a new categorical feature representing startup funding profiles.

### Clustering Visualization
![image](https://cdn-uploads.huggingface.co/production/uploads/6911c99df0486574df1afe3d/qT_PjNaX2W5HdaV_Wg60N.png)

Clustering helped distinguish between early-stage, growth-stage, and mature startups.


## 📈 Modeling

### Regression
A regression model was trained to predict total funding amount.  
The trained regression model was exported as a pickle file.

### Regression-to-Classification
The continuous funding target was converted into a binary classification problem using a **median split**:
- Class 0: Below median funding
- Class 1: At or above median funding

This strategy ensured well-balanced classes.


## ✅ Classification Models Trained
Three different classification models were trained and evaluated using the same engineered features:

- Logistic Regression  
- Random Forest  
- Gradient Boosting  


## 📊 Evaluation

Model performance was evaluated using:
- Precision
- Recall
- F1-score
- Confusion matrices

### Gradient Boosting – Confusion Matrix
![image](https://cdn-uploads.huggingface.co/production/uploads/6911c99df0486574df1afe3d/CXRohrGpv9sLBRWCxlCBE.png)

### Key Findings
- **Recall** was prioritized, as missing a high-funded startup is more costly than investigating a false positive.
- Gradient Boosting achieved the best balance between precision and recall.
- Most errors across models were false positives, aligning with the business goal.


## 🏆 Winner Model
✅ **Gradient Boosting** was selected as the final classification model.

Reasons:
- Highest F1-score and recall.
- Strong handling of non-linear relationships.
- Best overall performance on unseen test data.

The trained classification model was exported as a pickle file.


## 📦 Repository Contents
- `notebook.ipynb` – Full analysis and modeling pipeline
- `README.md` – Project documentation
- `regression_model.pkl` – Trained regression model
- `classification_model.pkl` – Winning classification model
- Plot images used in the README


## 🔁 Lessons Learned & Reflections
- Feature engineering had a greater impact than model choice alone.
- Clustering enriched downstream supervised models.
- Classification framing provided clearer business value than raw regression.
- Careful metric selection (recall & F1-score) is crucial for decision-oriented ML tasks.


## ✨ Extra Work
- Cluster-based feature engineering
- Comparison between regression and classification formulations
- Business-oriented evaluation and interpretation