# 🚀 Startup Funding Analysis – Regression & Classification ## 🎥 Project Presentation Video ## 📌 Project Overview This project analyzes startup funding data with the goal of understanding the drivers of startup success and building predictive models that estimate funding outcomes. The project combines **exploratory data analysis**, **feature engineering**, **clustering**, and both **regression** and **classification models** to extract actionable insights for investors and analysts. ## 🎯 Project Objectives - Explore and understand patterns in startup funding data. - Engineer meaningful features that improve predictive performance. - Use **clustering** to uncover latent startup profiles. - Predict funding amounts using regression models. - Reframe the problem as a **classification task** (high-funded vs. low-funded startups). - Compare multiple classification models and select the best-performing one. ## 📂 Dataset The dataset contains information about startups, including: - Funding rounds and total funding raised - Company characteristics and operational history - Temporal and aggregated funding features Basic preprocessing and cleaning steps were applied before modeling. ## ❓ Key Questions & Answers **Q1: What factors are most associated with higher startup funding?** Startups with more funding rounds, longer operating history, and later-stage investments tend to receive higher total funding. **Q2: Can startups be meaningfully grouped using unsupervised learning?** Yes. Clustering revealed clear groups representing early-stage, growth-stage, and highly funded startups. **Q3: Does feature engineering improve model performance?** Yes. Aggregated funding features and cluster-based features significantly improved both regression accuracy and classification F1-scores. **Q4: Is reframing the problem as classification useful?** Absolutely. Classifying startups into high-funded vs. low-funded provides a practical screening tool for investment decision-making. ## 🔍 Exploratory Data Analysis (EDA) EDA focused on understanding the distribution of funding amounts, identifying skewness, and examining relationships between key variables. ### Funding Distribution ![image](https://cdn-uploads.huggingface.co/production/uploads/6911c99df0486574df1afe3d/nmblWzAYvGE1w66orTDyP.png) Key insights: - Funding amounts are highly right-skewed. - Log transformations are beneficial for regression modeling. - A small subset of startups accounts for most of the total funding. ## 🧠 Feature Engineering & Clustering ### Feature Engineering Key feature engineering steps included: - Aggregating funding rounds and amounts. - Creating temporal features (startup age, years active). - Scaling numeric features. - Encoding categorical variables. ### Clustering K-Means clustering was applied on scaled features to create a new categorical feature representing startup funding profiles. ### Clustering Visualization ![image](https://cdn-uploads.huggingface.co/production/uploads/6911c99df0486574df1afe3d/qT_PjNaX2W5HdaV_Wg60N.png) Clustering helped distinguish between early-stage, growth-stage, and mature startups. ## 📈 Modeling ### Regression A regression model was trained to predict total funding amount. The trained regression model was exported as a pickle file. ### Regression-to-Classification The continuous funding target was converted into a binary classification problem using a **median split**: - Class 0: Below median funding - Class 1: At or above median funding This strategy ensured well-balanced classes. ## ✅ Classification Models Trained Three different classification models were trained and evaluated using the same engineered features: - Logistic Regression - Random Forest - Gradient Boosting ## 📊 Evaluation Model performance was evaluated using: - Precision - Recall - F1-score - Confusion matrices ### Gradient Boosting – Confusion Matrix ![image](https://cdn-uploads.huggingface.co/production/uploads/6911c99df0486574df1afe3d/CXRohrGpv9sLBRWCxlCBE.png) ### Key Findings - **Recall** was prioritized, as missing a high-funded startup is more costly than investigating a false positive. - Gradient Boosting achieved the best balance between precision and recall. - Most errors across models were false positives, aligning with the business goal. ## 🏆 Winner Model ✅ **Gradient Boosting** was selected as the final classification model. Reasons: - Highest F1-score and recall. - Strong handling of non-linear relationships. - Best overall performance on unseen test data. The trained classification model was exported as a pickle file. ## 📦 Repository Contents - `notebook.ipynb` – Full analysis and modeling pipeline - `README.md` – Project documentation - `regression_model.pkl` – Trained regression model - `classification_model.pkl` – Winning classification model - Plot images used in the README ## 🔁 Lessons Learned & Reflections - Feature engineering had a greater impact than model choice alone. - Clustering enriched downstream supervised models. - Classification framing provided clearer business value than raw regression. - Careful metric selection (recall & F1-score) is crucial for decision-oriented ML tasks. ## ✨ Extra Work - Cluster-based feature engineering - Comparison between regression and classification formulations - Business-oriented evaluation and interpretation