| # π Startup Funding Analysis β Regression & Classification | |
| ## π₯ Project Presentation Video | |
| <video controls width="700"> | |
| <source src="https://huggingface.co/danadvash/startupfunding/resolve/main/Assignment%202%20-%20Dana%20Dvash%20Intro%20to%20Data%20Science%20(1).mp4" type="video/mp4" /> | |
| Your browser does not support the video tag. | |
| </video> | |
| ## π Project Overview | |
| This project analyzes startup funding data with the goal of understanding the drivers of startup success and building predictive models that estimate funding outcomes. | |
| The project combines **exploratory data analysis**, **feature engineering**, **clustering**, and both **regression** and **classification models** to extract actionable insights for investors and analysts. | |
| ## π― Project Objectives | |
| - Explore and understand patterns in startup funding data. | |
| - Engineer meaningful features that improve predictive performance. | |
| - Use **clustering** to uncover latent startup profiles. | |
| - Predict funding amounts using regression models. | |
| - Reframe the problem as a **classification task** (high-funded vs. low-funded startups). | |
| - Compare multiple classification models and select the best-performing one. | |
| ## π Dataset | |
| The dataset contains information about startups, including: | |
| - Funding rounds and total funding raised | |
| - Company characteristics and operational history | |
| - Temporal and aggregated funding features | |
| Basic preprocessing and cleaning steps were applied before modeling. | |
| ## β Key Questions & Answers | |
| **Q1: What factors are most associated with higher startup funding?** | |
| Startups with more funding rounds, longer operating history, and later-stage investments tend to receive higher total funding. | |
| **Q2: Can startups be meaningfully grouped using unsupervised learning?** | |
| Yes. Clustering revealed clear groups representing early-stage, growth-stage, and highly funded startups. | |
| **Q3: Does feature engineering improve model performance?** | |
| Yes. Aggregated funding features and cluster-based features significantly improved both regression accuracy and classification F1-scores. | |
| **Q4: Is reframing the problem as classification useful?** | |
| Absolutely. Classifying startups into high-funded vs. low-funded provides a practical screening tool for investment decision-making. | |
| ## π Exploratory Data Analysis (EDA) | |
| EDA focused on understanding the distribution of funding amounts, identifying skewness, and examining relationships between key variables. | |
| ### Funding Distribution | |
|  | |
| Key insights: | |
| - Funding amounts are highly right-skewed. | |
| - Log transformations are beneficial for regression modeling. | |
| - A small subset of startups accounts for most of the total funding. | |
| ## π§ Feature Engineering & Clustering | |
| ### Feature Engineering | |
| Key feature engineering steps included: | |
| - Aggregating funding rounds and amounts. | |
| - Creating temporal features (startup age, years active). | |
| - Scaling numeric features. | |
| - Encoding categorical variables. | |
| ### Clustering | |
| K-Means clustering was applied on scaled features to create a new categorical feature representing startup funding profiles. | |
| ### Clustering Visualization | |
|  | |
| Clustering helped distinguish between early-stage, growth-stage, and mature startups. | |
| ## π Modeling | |
| ### Regression | |
| A regression model was trained to predict total funding amount. | |
| The trained regression model was exported as a pickle file. | |
| ### Regression-to-Classification | |
| The continuous funding target was converted into a binary classification problem using a **median split**: | |
| - Class 0: Below median funding | |
| - Class 1: At or above median funding | |
| This strategy ensured well-balanced classes. | |
| ## β Classification Models Trained | |
| Three different classification models were trained and evaluated using the same engineered features: | |
| - Logistic Regression | |
| - Random Forest | |
| - Gradient Boosting | |
| ## π Evaluation | |
| Model performance was evaluated using: | |
| - Precision | |
| - Recall | |
| - F1-score | |
| - Confusion matrices | |
| ### Gradient Boosting β Confusion Matrix | |
|  | |
| ### Key Findings | |
| - **Recall** was prioritized, as missing a high-funded startup is more costly than investigating a false positive. | |
| - Gradient Boosting achieved the best balance between precision and recall. | |
| - Most errors across models were false positives, aligning with the business goal. | |
| ## π Winner Model | |
| β **Gradient Boosting** was selected as the final classification model. | |
| Reasons: | |
| - Highest F1-score and recall. | |
| - Strong handling of non-linear relationships. | |
| - Best overall performance on unseen test data. | |
| The trained classification model was exported as a pickle file. | |
| ## π¦ Repository Contents | |
| - `notebook.ipynb` β Full analysis and modeling pipeline | |
| - `README.md` β Project documentation | |
| - `regression_model.pkl` β Trained regression model | |
| - `classification_model.pkl` β Winning classification model | |
| - Plot images used in the README | |
| ## π Lessons Learned & Reflections | |
| - Feature engineering had a greater impact than model choice alone. | |
| - Clustering enriched downstream supervised models. | |
| - Classification framing provided clearer business value than raw regression. | |
| - Careful metric selection (recall & F1-score) is crucial for decision-oriented ML tasks. | |
| ## β¨ Extra Work | |
| - Cluster-based feature engineering | |
| - Comparison between regression and classification formulations | |
| - Business-oriented evaluation and interpretation | |