YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

πŸš€ Startup Funding Analysis – Regression & Classification

πŸŽ₯ Project Presentation Video

πŸ“Œ Project Overview

This project analyzes startup funding data with the goal of understanding the drivers of startup success and building predictive models that estimate funding outcomes.
The project combines exploratory data analysis, feature engineering, clustering, and both regression and classification models to extract actionable insights for investors and analysts.

🎯 Project Objectives

  • Explore and understand patterns in startup funding data.
  • Engineer meaningful features that improve predictive performance.
  • Use clustering to uncover latent startup profiles.
  • Predict funding amounts using regression models.
  • Reframe the problem as a classification task (high-funded vs. low-funded startups).
  • Compare multiple classification models and select the best-performing one.

πŸ“‚ Dataset

The dataset contains information about startups, including:

  • Funding rounds and total funding raised
  • Company characteristics and operational history
  • Temporal and aggregated funding features

Basic preprocessing and cleaning steps were applied before modeling.

❓ Key Questions & Answers

Q1: What factors are most associated with higher startup funding?
Startups with more funding rounds, longer operating history, and later-stage investments tend to receive higher total funding.

Q2: Can startups be meaningfully grouped using unsupervised learning?
Yes. Clustering revealed clear groups representing early-stage, growth-stage, and highly funded startups.

Q3: Does feature engineering improve model performance?
Yes. Aggregated funding features and cluster-based features significantly improved both regression accuracy and classification F1-scores.

Q4: Is reframing the problem as classification useful?
Absolutely. Classifying startups into high-funded vs. low-funded provides a practical screening tool for investment decision-making.

πŸ” Exploratory Data Analysis (EDA)

EDA focused on understanding the distribution of funding amounts, identifying skewness, and examining relationships between key variables.

Funding Distribution

image

Key insights:

  • Funding amounts are highly right-skewed.
  • Log transformations are beneficial for regression modeling.
  • A small subset of startups accounts for most of the total funding.

🧠 Feature Engineering & Clustering

Feature Engineering

Key feature engineering steps included:

  • Aggregating funding rounds and amounts.
  • Creating temporal features (startup age, years active).
  • Scaling numeric features.
  • Encoding categorical variables.

Clustering

K-Means clustering was applied on scaled features to create a new categorical feature representing startup funding profiles.

Clustering Visualization

image

Clustering helped distinguish between early-stage, growth-stage, and mature startups.

πŸ“ˆ Modeling

Regression

A regression model was trained to predict total funding amount.
The trained regression model was exported as a pickle file.

Regression-to-Classification

The continuous funding target was converted into a binary classification problem using a median split:

  • Class 0: Below median funding
  • Class 1: At or above median funding

This strategy ensured well-balanced classes.

βœ… Classification Models Trained

Three different classification models were trained and evaluated using the same engineered features:

  • Logistic Regression
  • Random Forest
  • Gradient Boosting

πŸ“Š Evaluation

Model performance was evaluated using:

  • Precision
  • Recall
  • F1-score
  • Confusion matrices

Gradient Boosting – Confusion Matrix

image

Key Findings

  • Recall was prioritized, as missing a high-funded startup is more costly than investigating a false positive.
  • Gradient Boosting achieved the best balance between precision and recall.
  • Most errors across models were false positives, aligning with the business goal.

πŸ† Winner Model

βœ… Gradient Boosting was selected as the final classification model.

Reasons:

  • Highest F1-score and recall.
  • Strong handling of non-linear relationships.
  • Best overall performance on unseen test data.

The trained classification model was exported as a pickle file.

πŸ“¦ Repository Contents

  • notebook.ipynb – Full analysis and modeling pipeline
  • README.md – Project documentation
  • regression_model.pkl – Trained regression model
  • classification_model.pkl – Winning classification model
  • Plot images used in the README

πŸ” Lessons Learned & Reflections

  • Feature engineering had a greater impact than model choice alone.
  • Clustering enriched downstream supervised models.
  • Classification framing provided clearer business value than raw regression.
  • Careful metric selection (recall & F1-score) is crucial for decision-oriented ML tasks.

✨ Extra Work

  • Cluster-based feature engineering
  • Comparison between regression and classification formulations
  • Business-oriented evaluation and interpretation
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support