Spaces:

Ram2005
/

StartupMap-India

Sleeping

App Files Files Community

StartupMap-India / README.md

Ram2005

Upload README.md

36ed259 verified 19 days ago

preview code

raw

history blame contribute delete

14.9 kB

metadata

title: Bharat Tech Atlas
emoji: 🗺️
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
suggested_hardware: cpu-basic
tags:
  - ml-intern

Bharat Tech Atlas v3.3

Mapping platform for India's startup ecosystem — curated dataset of startups, SMEs, college E-Cells, incubators, and accelerators. Now with ML-powered insights, ETL data pipelines, MLOps monitoring, and security hardening.

Data disclaimer: This is a curated subset. India has 223,000+ DPIIT-registered startups (as of April 2026). Stats shown here reflect only what's in our database. Sources: DPIIT, Tracxn, Crunchbase, LinkedIn.

What's New in v3.3

Security hardening: Centralized security.py with XSS detection, prompt injection scanning (27 patterns), URL validation (SSRF prevention), audit logging, rate limiting, CSP generation
Input validation: Stricter bounds on coordinates, funding amounts, years, startup names, query lengths
Frontend security: Removed dangerouslySetInnerHTML from social icons, added SafeSocialIcon React component, URL validation on all social links
Chat safety: Prompt injection detection rejects DAN/jailbreak attempts, output sanitization strips scripts
Audit logging: Every request gets a request_id, security events logged with severity levels
CORS tightened: Configurable origins, limited methods (GET/POST/HEAD/OPTIONS)
Security headers: X-Content-Type-Options, X-Frame-Options, Referrer-Policy, X-Request-ID, HSTS + CSP in production
Rate limits: Chat/Agent = 30 req/min per IP, General API = 120/min
Query timeouts: 10s PRAGMA busy_timeout on all entity queries
Tests: 15 security tests added (injection, XSS, CORS, rate limits, coordinate validation)

What's New in v3.2

Chat AI fix: 🤖 button now functional — keyword-based responses work instantly, LLM mode auto-loads when transformers available
Search Agent: Web search for startup analysis (gracefully degrades when duckduckgo-search not installed)
ML keyword fallback: All ML endpoints work without GPU/transformers — sector classification via keyword rules
Bug fixes: Fixed double-execution middleware, DB connection pool, missing router mounts, dead enrichment code
Schema migrations: Auto-migrate DB schema on startup, seed guard for faster restarts

What's New in v3.0

ML Inference API: NLP sector classification + startup growth prediction
ETL Pipeline: Automated Extract → Transform → Load from DPIIT, Tracxn, Crunchbase
MLOps Monitoring: Data drift detection, model health tracking, CI/CD pipelines
Model Serving: ONNX-optimized inference, caching, TorchServe/Triton-ready
125 Unicorns: Full Indian unicorn dataset with funding & investors

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        FRONTEND (React + MapLibre)                    │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐             │
│  │  Map UI  │ │ Sidebar  │ │ Analytics│ │  Search    │             │
│  └────┬─────┘ └────┬─────┘ └────┬─────┘ └─────┬──────┘             │
│       │             │            │              │                     │
│       └─────────────┴────────────┴──────────────┘                    │
│                         Async API Calls (Fetch)                       │
└──────────────────────────────────┬──────────────────────────────────┘
                                   │
┌──────────────────────────────────▼──────────────────────────────────┐
│                      BACKEND (FastAPI + Uvicorn)                      │
│                                                                       │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐      │
│  │  /api/entities  │  │    /api/ml      │  │   /api/mlops    │      │
│  │  GeoJSON, CRUD  │  │  Classification │  │  Drift, Monitor │      │
│  │  Clusters, Fmt  │  │  Growth Predict │  │  Model Registry │      │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘      │
│           │                     │                     │               │
│  ┌────────▼─────────────────────▼─────────────────────▼────────┐     │
│  │              DATABASE LAYER (SQLite + R-Tree)                 │     │
│  │   Spatial indexing · WAL mode · Optimized queries             │     │
│  └──────────────────────────────────────────────────────────────┘     │
│                                                                       │
│  ┌──────────────────────────────────────────────────────────────┐     │
│  │                    ETL PIPELINE                                │     │
│  │   DPIIT API → ┐                                              │     │
│  │   Tracxn API → ├→ Transform (geocode, normalize) → Load DB   │     │
│  │   Crunchbase → ┘                                              │     │
│  └──────────────────────────────────────────────────────────────┘     │
│                                                                       │
│  ┌──────────────────────────────────────────────────────────────┐     │
│  │                    ML MODEL SERVER                             │     │
│  │   Sector Classifier (BART/ONNX) + Growth Predictor (GBT)     │     │
│  │   LRU Cache · Batch Inference · Health Monitoring             │     │
│  └──────────────────────────────────────────────────────────────┘     │
└───────────────────────────────────────────────────────────────────────┘

1. Data Pipelines & API Architecture

ETL Pipeline (`backend/etl/`)

Stage	Module	Description
Extract	`extract.py`	Async API callers for DPIIT, Tracxn, Crunchbase with rate limiting
Transform	`transform.py`	Geocoding, sector normalization, deduplication, slug generation
Load	`load.py`	Batch upserts to SQLite, R-Tree spatial index maintenance
Orchestrator	`pipeline.py`	Full/incremental runs, scheduling, error recovery

API Endpoints — ETL Management

GET /api/etl/status — Pipeline status + DB stats
POST /api/etl/run — Trigger full ETL pipeline
POST /api/etl/run/incremental?since_hours=24 — Incremental update
GET /api/etl/history — Pipeline run history
GET /api/etl/sources — Configured data sources info

Data Flow

DPIIT Portal ──┐
Tracxn API ────┼──→ ETLExtractor (async, rate-limited)
Crunchbase ────┘         │
                         ▼
                  ETLTransformer
                  - Clean names (remove Pvt/Ltd/Inc)
                  - Normalize sectors → standard taxonomy
                  - Geocode addresses → lat/lng
                  - Deduplicate across sources
                         │
                         ▼
                    ETLLoader
                  - Batch INSERT/UPDATE
                  - R-Tree spatial index
                  - Stale record deactivation

2. Backend Frameworks & ML APIs

FastAPI Backend (Production-Grade)

Async request handling — concurrent ML inference without blocking
Rate limiting — 60 req/min general, 10 req/min for ML endpoints
Input sanitization — SQL injection prevention, query length limits
Auto-generated docs — Swagger UI at /docs

ML API Endpoints (`/api/ml/`)

GET /api/ml/classify/sector?description=...&top_k=3 — NLP sector classification
POST /api/ml/classify/sector/batch — Batch classification (up to 50)
GET /api/ml/predict/growth/{slug} — Growth prediction for an entity
GET /api/ml/predict/growth?sector=fintech&state=Karnataka — Ranked predictions
GET /api/ml/health — Model server health
GET /api/ml/sectors/taxonomy — Full sector taxonomy

3. Pre-Existing Models & Serving

Sector Classifier (`backend/ml/classifier.py`)

Mode	Model	Latency	Accuracy
Zero-shot	`facebook/bart-large-mnli`	~200ms	~85%
ONNX	Exported BART	~50ms	~85%
Keyword fallback	Rule-based	~1ms	~70%

Growth Predictor (`backend/ml/predictor.py`)

Features used:

Funding signal — log-normalized funding amount
Team signal — log-normalized employee count
Sector momentum — market trend score (AI=0.95, EdTech=0.65)
Ecosystem score — city startup density (Bengaluru=0.95)
Age signal — sweet spot 2-7 years
Recognition signals — DPIIT, NSA, unicorn status
Investor quality — presence of top-tier VCs

Model Serving (`backend/ml/serving.py`)

Production-ready serving with:

LRU Response Cache — avoid redundant inference
Batch processing — group requests for GPU efficiency
TorchServe adapter — for PyTorch model deployment at scale
NVIDIA Triton adapter — multi-model, dynamic batching
ONNX Runtime — 3-5x speedup on CPU

4. MLOps & Agile Integration

MLOps Module (`backend/mlops/`)

Component	Description
Data Drift Detection	PSI (categorical) + KS-test (numerical)
Model Monitor	Latency, errors, confidence tracking, auto-alerts
Model Registry	Version control, A/B comparison, promotion
CI/CD Pipeline	GitHub Actions workflow (test → validate → train → deploy)

MLOps API Endpoints (`/api/mlops/`)

GET /api/mlops/drift/check — Run drift detection on current data
GET /api/mlops/drift/history — Historical drift reports
GET /api/mlops/monitor/metrics — Real-time model metrics
GET /api/mlops/monitor/alerts — Recent alerts (latency, degradation)
GET /api/mlops/registry/models — All model versions
GET /api/mlops/registry/compare?model=...&v1=...&v2=... — Version diff
GET /api/mlops/cicd/workflow — GitHub Actions YAML
POST /api/mlops/cicd/trigger-retrain — Manual retraining trigger

Agile Sprint Structure

Sprint	Deliverable	Status
Sprint 1	Basic map UI + FastAPI backend + static data	✅ Done
Sprint 2	ETL pipeline (DPIIT/Tracxn/Crunchbase) + real data	✅ Done
Sprint 3	ML models (sector classifier + growth predictor)	✅ Done
Sprint 4	MLOps (drift detection, monitoring, CI/CD)	✅ Done
Sprint 5	Security hardening, XSS prevention, audit logging, rate limiting	✅ Done
Sprint 6	Scale: ONNX optimization, GPU serving, load testing	🔜 Next

CI/CD Pipeline

# Triggered on: push to backend/ml/**, scheduled weekly
test → data-validation → train → validate-performance → deploy

DVC for data version control
GitHub Actions for automated pipeline
HuggingFace Hub for model storage
Prometheus/Grafana compatible metrics

Tech Stack

Layer	Technology
Frontend	React 18 + Vite + MapLibre GL JS + Tailwind CSS
Backend	FastAPI + slowapi rate limiting
Database	SQLite with R-Tree spatial index
ML Inference	HuggingFace Transformers + ONNX Runtime
ETL	Async Python (aiohttp) + custom pipeline
MLOps	Data drift (PSI/KS) + Model registry + CI/CD
Map Tiles	CARTO Dark Matter
Deployment	Docker on Hugging Face Spaces

Map Features

Map Modes

Clusters (default): Server-side numbered bubble clusters at low zoom
Points: All entities as colored dots
Heatmap: Funding-weighted density visualization

Filtering

Entity types, sectors, DPIIT categories, business models, stage, location
Special filters: unicorns, women-led, rural impact, campus startups, NSA winners
State filter with fly-to

Data

Curated dataset across India:

125 Unicorns: Flipkart, PhonePe, Zerodha, Zomato, Swiggy, CRED, Zepto, Neysa + more
Women-led startups: Nykaa, Mamaearth, Sugar Cosmetics, OPEN Financial + more
Rural impact: DeHaat, CropIn, Stellapps, Aye Finance + more
Campus startups: Pixxel, TartanSense, Zepto + more
NSA winners: Tagged with award categories
DPIIT recognized: Government-recognized startups

Quick Start (Local Development)

# Backend
pip install -r requirements.txt
uvicorn backend.main:app --reload --port 7860

# Frontend (separate terminal)
cd frontend && npm install && npm run dev

# Run ETL pipeline
python -c "import asyncio; from backend.etl import ETLPipeline; asyncio.run(ETLPipeline({}).run())"

# Test ML classification
curl "http://localhost:7860/api/ml/classify/sector?description=AI-powered%20fraud%20detection%20for%20banks"

# Check data drift
curl "http://localhost:7860/api/mlops/drift/check"

# Run tests
pytest tests/test_api.py -v