startupfunding / README.md
danadvash's picture
Update README.md
a654250 verified
# πŸš€ Startup Funding Analysis – Regression & Classification
## πŸŽ₯ Project Presentation Video
<video controls width="700">
<source src="https://huggingface.co/danadvash/startupfunding/resolve/main/Assignment%202%20-%20Dana%20Dvash%20Intro%20to%20Data%20Science%20(1).mp4" type="video/mp4" />
Your browser does not support the video tag.
</video>
## πŸ“Œ Project Overview
This project analyzes startup funding data with the goal of understanding the drivers of startup success and building predictive models that estimate funding outcomes.
The project combines **exploratory data analysis**, **feature engineering**, **clustering**, and both **regression** and **classification models** to extract actionable insights for investors and analysts.
## 🎯 Project Objectives
- Explore and understand patterns in startup funding data.
- Engineer meaningful features that improve predictive performance.
- Use **clustering** to uncover latent startup profiles.
- Predict funding amounts using regression models.
- Reframe the problem as a **classification task** (high-funded vs. low-funded startups).
- Compare multiple classification models and select the best-performing one.
## πŸ“‚ Dataset
The dataset contains information about startups, including:
- Funding rounds and total funding raised
- Company characteristics and operational history
- Temporal and aggregated funding features
Basic preprocessing and cleaning steps were applied before modeling.
## ❓ Key Questions & Answers
**Q1: What factors are most associated with higher startup funding?**
Startups with more funding rounds, longer operating history, and later-stage investments tend to receive higher total funding.
**Q2: Can startups be meaningfully grouped using unsupervised learning?**
Yes. Clustering revealed clear groups representing early-stage, growth-stage, and highly funded startups.
**Q3: Does feature engineering improve model performance?**
Yes. Aggregated funding features and cluster-based features significantly improved both regression accuracy and classification F1-scores.
**Q4: Is reframing the problem as classification useful?**
Absolutely. Classifying startups into high-funded vs. low-funded provides a practical screening tool for investment decision-making.
## πŸ” Exploratory Data Analysis (EDA)
EDA focused on understanding the distribution of funding amounts, identifying skewness, and examining relationships between key variables.
### Funding Distribution
![image](https://cdn-uploads.huggingface.co/production/uploads/6911c99df0486574df1afe3d/nmblWzAYvGE1w66orTDyP.png)
Key insights:
- Funding amounts are highly right-skewed.
- Log transformations are beneficial for regression modeling.
- A small subset of startups accounts for most of the total funding.
## 🧠 Feature Engineering & Clustering
### Feature Engineering
Key feature engineering steps included:
- Aggregating funding rounds and amounts.
- Creating temporal features (startup age, years active).
- Scaling numeric features.
- Encoding categorical variables.
### Clustering
K-Means clustering was applied on scaled features to create a new categorical feature representing startup funding profiles.
### Clustering Visualization
![image](https://cdn-uploads.huggingface.co/production/uploads/6911c99df0486574df1afe3d/qT_PjNaX2W5HdaV_Wg60N.png)
Clustering helped distinguish between early-stage, growth-stage, and mature startups.
## πŸ“ˆ Modeling
### Regression
A regression model was trained to predict total funding amount.
The trained regression model was exported as a pickle file.
### Regression-to-Classification
The continuous funding target was converted into a binary classification problem using a **median split**:
- Class 0: Below median funding
- Class 1: At or above median funding
This strategy ensured well-balanced classes.
## βœ… Classification Models Trained
Three different classification models were trained and evaluated using the same engineered features:
- Logistic Regression
- Random Forest
- Gradient Boosting
## πŸ“Š Evaluation
Model performance was evaluated using:
- Precision
- Recall
- F1-score
- Confusion matrices
### Gradient Boosting – Confusion Matrix
![image](https://cdn-uploads.huggingface.co/production/uploads/6911c99df0486574df1afe3d/CXRohrGpv9sLBRWCxlCBE.png)
### Key Findings
- **Recall** was prioritized, as missing a high-funded startup is more costly than investigating a false positive.
- Gradient Boosting achieved the best balance between precision and recall.
- Most errors across models were false positives, aligning with the business goal.
## πŸ† Winner Model
βœ… **Gradient Boosting** was selected as the final classification model.
Reasons:
- Highest F1-score and recall.
- Strong handling of non-linear relationships.
- Best overall performance on unseen test data.
The trained classification model was exported as a pickle file.
## πŸ“¦ Repository Contents
- `notebook.ipynb` – Full analysis and modeling pipeline
- `README.md` – Project documentation
- `regression_model.pkl` – Trained regression model
- `classification_model.pkl` – Winning classification model
- Plot images used in the README
## πŸ” Lessons Learned & Reflections
- Feature engineering had a greater impact than model choice alone.
- Clustering enriched downstream supervised models.
- Classification framing provided clearer business value than raw regression.
- Careful metric selection (recall & F1-score) is crucial for decision-oriented ML tasks.
## ✨ Extra Work
- Cluster-based feature engineering
- Comparison between regression and classification formulations
- Business-oriented evaluation and interpretation