StartupMap-India / README.md
Ram2005's picture
Upload README.md
36ed259 verified
metadata
title: Bharat Tech Atlas
emoji: πŸ—ΊοΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
suggested_hardware: cpu-basic
tags:
  - ml-intern

Bharat Tech Atlas v3.3

Mapping platform for India's startup ecosystem β€” curated dataset of startups, SMEs, college E-Cells, incubators, and accelerators. Now with ML-powered insights, ETL data pipelines, MLOps monitoring, and security hardening.

Data disclaimer: This is a curated subset. India has 223,000+ DPIIT-registered startups (as of April 2026). Stats shown here reflect only what's in our database. Sources: DPIIT, Tracxn, Crunchbase, LinkedIn.

What's New in v3.3

  • Security hardening: Centralized security.py with XSS detection, prompt injection scanning (27 patterns), URL validation (SSRF prevention), audit logging, rate limiting, CSP generation
  • Input validation: Stricter bounds on coordinates, funding amounts, years, startup names, query lengths
  • Frontend security: Removed dangerouslySetInnerHTML from social icons, added SafeSocialIcon React component, URL validation on all social links
  • Chat safety: Prompt injection detection rejects DAN/jailbreak attempts, output sanitization strips scripts
  • Audit logging: Every request gets a request_id, security events logged with severity levels
  • CORS tightened: Configurable origins, limited methods (GET/POST/HEAD/OPTIONS)
  • Security headers: X-Content-Type-Options, X-Frame-Options, Referrer-Policy, X-Request-ID, HSTS + CSP in production
  • Rate limits: Chat/Agent = 30 req/min per IP, General API = 120/min
  • Query timeouts: 10s PRAGMA busy_timeout on all entity queries
  • Tests: 15 security tests added (injection, XSS, CORS, rate limits, coordinate validation)

What's New in v3.2

  • Chat AI fix: πŸ€– button now functional β€” keyword-based responses work instantly, LLM mode auto-loads when transformers available
  • Search Agent: Web search for startup analysis (gracefully degrades when duckduckgo-search not installed)
  • ML keyword fallback: All ML endpoints work without GPU/transformers β€” sector classification via keyword rules
  • Bug fixes: Fixed double-execution middleware, DB connection pool, missing router mounts, dead enrichment code
  • Schema migrations: Auto-migrate DB schema on startup, seed guard for faster restarts

What's New in v3.0

  • ML Inference API: NLP sector classification + startup growth prediction
  • ETL Pipeline: Automated Extract β†’ Transform β†’ Load from DPIIT, Tracxn, Crunchbase
  • MLOps Monitoring: Data drift detection, model health tracking, CI/CD pipelines
  • Model Serving: ONNX-optimized inference, caching, TorchServe/Triton-ready
  • 125 Unicorns: Full Indian unicorn dataset with funding & investors

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        FRONTEND (React + MapLibre)                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
β”‚  β”‚  Map UI  β”‚ β”‚ Sidebar  β”‚ β”‚ Analyticsβ”‚ β”‚  Search    β”‚             β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜             β”‚
β”‚       β”‚             β”‚            β”‚              β”‚                     β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚                         Async API Calls (Fetch)                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      BACKEND (FastAPI + Uvicorn)                      β”‚
β”‚                                                                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚  /api/entities  β”‚  β”‚    /api/ml      β”‚  β”‚   /api/mlops    β”‚      β”‚
β”‚  β”‚  GeoJSON, CRUD  β”‚  β”‚  Classification β”‚  β”‚  Drift, Monitor β”‚      β”‚
β”‚  β”‚  Clusters, Fmt  β”‚  β”‚  Growth Predict β”‚  β”‚  Model Registry β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚           β”‚                     β”‚                     β”‚               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚              DATABASE LAYER (SQLite + R-Tree)                 β”‚     β”‚
β”‚  β”‚   Spatial indexing Β· WAL mode Β· Optimized queries             β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                                                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚                    ETL PIPELINE                                β”‚     β”‚
β”‚  β”‚   DPIIT API β†’ ┐                                              β”‚     β”‚
β”‚  β”‚   Tracxn API β†’ β”œβ†’ Transform (geocode, normalize) β†’ Load DB   β”‚     β”‚
β”‚  β”‚   Crunchbase β†’ β”˜                                              β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                                                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚                    ML MODEL SERVER                             β”‚     β”‚
β”‚  β”‚   Sector Classifier (BART/ONNX) + Growth Predictor (GBT)     β”‚     β”‚
β”‚  β”‚   LRU Cache Β· Batch Inference Β· Health Monitoring             β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1. Data Pipelines & API Architecture

ETL Pipeline (backend/etl/)

Stage Module Description
Extract extract.py Async API callers for DPIIT, Tracxn, Crunchbase with rate limiting
Transform transform.py Geocoding, sector normalization, deduplication, slug generation
Load load.py Batch upserts to SQLite, R-Tree spatial index maintenance
Orchestrator pipeline.py Full/incremental runs, scheduling, error recovery

API Endpoints β€” ETL Management

  • GET /api/etl/status β€” Pipeline status + DB stats
  • POST /api/etl/run β€” Trigger full ETL pipeline
  • POST /api/etl/run/incremental?since_hours=24 β€” Incremental update
  • GET /api/etl/history β€” Pipeline run history
  • GET /api/etl/sources β€” Configured data sources info

Data Flow

DPIIT Portal ──┐
Tracxn API ────┼──→ ETLExtractor (async, rate-limited)
Crunchbase β”€β”€β”€β”€β”˜         β”‚
                         β–Ό
                  ETLTransformer
                  - Clean names (remove Pvt/Ltd/Inc)
                  - Normalize sectors β†’ standard taxonomy
                  - Geocode addresses β†’ lat/lng
                  - Deduplicate across sources
                         β”‚
                         β–Ό
                    ETLLoader
                  - Batch INSERT/UPDATE
                  - R-Tree spatial index
                  - Stale record deactivation

2. Backend Frameworks & ML APIs

FastAPI Backend (Production-Grade)

  • Async request handling β€” concurrent ML inference without blocking
  • Rate limiting β€” 60 req/min general, 10 req/min for ML endpoints
  • Input sanitization β€” SQL injection prevention, query length limits
  • Auto-generated docs β€” Swagger UI at /docs

ML API Endpoints (/api/ml/)

  • GET /api/ml/classify/sector?description=...&top_k=3 β€” NLP sector classification
  • POST /api/ml/classify/sector/batch β€” Batch classification (up to 50)
  • GET /api/ml/predict/growth/{slug} β€” Growth prediction for an entity
  • GET /api/ml/predict/growth?sector=fintech&state=Karnataka β€” Ranked predictions
  • GET /api/ml/health β€” Model server health
  • GET /api/ml/sectors/taxonomy β€” Full sector taxonomy

3. Pre-Existing Models & Serving

Sector Classifier (backend/ml/classifier.py)

Mode Model Latency Accuracy
Zero-shot facebook/bart-large-mnli ~200ms ~85%
ONNX Exported BART ~50ms ~85%
Keyword fallback Rule-based ~1ms ~70%

Growth Predictor (backend/ml/predictor.py)

Features used:

  • Funding signal β€” log-normalized funding amount
  • Team signal β€” log-normalized employee count
  • Sector momentum β€” market trend score (AI=0.95, EdTech=0.65)
  • Ecosystem score β€” city startup density (Bengaluru=0.95)
  • Age signal β€” sweet spot 2-7 years
  • Recognition signals β€” DPIIT, NSA, unicorn status
  • Investor quality β€” presence of top-tier VCs

Model Serving (backend/ml/serving.py)

Production-ready serving with:

  • LRU Response Cache β€” avoid redundant inference
  • Batch processing β€” group requests for GPU efficiency
  • TorchServe adapter β€” for PyTorch model deployment at scale
  • NVIDIA Triton adapter β€” multi-model, dynamic batching
  • ONNX Runtime β€” 3-5x speedup on CPU

4. MLOps & Agile Integration

MLOps Module (backend/mlops/)

Component Description
Data Drift Detection PSI (categorical) + KS-test (numerical)
Model Monitor Latency, errors, confidence tracking, auto-alerts
Model Registry Version control, A/B comparison, promotion
CI/CD Pipeline GitHub Actions workflow (test β†’ validate β†’ train β†’ deploy)

MLOps API Endpoints (/api/mlops/)

  • GET /api/mlops/drift/check β€” Run drift detection on current data
  • GET /api/mlops/drift/history β€” Historical drift reports
  • GET /api/mlops/monitor/metrics β€” Real-time model metrics
  • GET /api/mlops/monitor/alerts β€” Recent alerts (latency, degradation)
  • GET /api/mlops/registry/models β€” All model versions
  • GET /api/mlops/registry/compare?model=...&v1=...&v2=... β€” Version diff
  • GET /api/mlops/cicd/workflow β€” GitHub Actions YAML
  • POST /api/mlops/cicd/trigger-retrain β€” Manual retraining trigger

Agile Sprint Structure

Sprint Deliverable Status
Sprint 1 Basic map UI + FastAPI backend + static data βœ… Done
Sprint 2 ETL pipeline (DPIIT/Tracxn/Crunchbase) + real data βœ… Done
Sprint 3 ML models (sector classifier + growth predictor) βœ… Done
Sprint 4 MLOps (drift detection, monitoring, CI/CD) βœ… Done
Sprint 5 Security hardening, XSS prevention, audit logging, rate limiting βœ… Done
Sprint 6 Scale: ONNX optimization, GPU serving, load testing πŸ”œ Next

CI/CD Pipeline

# Triggered on: push to backend/ml/**, scheduled weekly
test β†’ data-validation β†’ train β†’ validate-performance β†’ deploy
  • DVC for data version control
  • GitHub Actions for automated pipeline
  • HuggingFace Hub for model storage
  • Prometheus/Grafana compatible metrics

Tech Stack

Layer Technology
Frontend React 18 + Vite + MapLibre GL JS + Tailwind CSS
Backend FastAPI + slowapi rate limiting
Database SQLite with R-Tree spatial index
ML Inference HuggingFace Transformers + ONNX Runtime
ETL Async Python (aiohttp) + custom pipeline
MLOps Data drift (PSI/KS) + Model registry + CI/CD
Map Tiles CARTO Dark Matter
Deployment Docker on Hugging Face Spaces

Map Features

Map Modes

  • Clusters (default): Server-side numbered bubble clusters at low zoom
  • Points: All entities as colored dots
  • Heatmap: Funding-weighted density visualization

Filtering

  • Entity types, sectors, DPIIT categories, business models, stage, location
  • Special filters: unicorns, women-led, rural impact, campus startups, NSA winners
  • State filter with fly-to

Data

Curated dataset across India:

  • 125 Unicorns: Flipkart, PhonePe, Zerodha, Zomato, Swiggy, CRED, Zepto, Neysa + more
  • Women-led startups: Nykaa, Mamaearth, Sugar Cosmetics, OPEN Financial + more
  • Rural impact: DeHaat, CropIn, Stellapps, Aye Finance + more
  • Campus startups: Pixxel, TartanSense, Zepto + more
  • NSA winners: Tagged with award categories
  • DPIIT recognized: Government-recognized startups

Quick Start (Local Development)

# Backend
pip install -r requirements.txt
uvicorn backend.main:app --reload --port 7860

# Frontend (separate terminal)
cd frontend && npm install && npm run dev

# Run ETL pipeline
python -c "import asyncio; from backend.etl import ETLPipeline; asyncio.run(ETLPipeline({}).run())"

# Test ML classification
curl "http://localhost:7860/api/ml/classify/sector?description=AI-powered%20fraud%20detection%20for%20banks"

# Check data drift
curl "http://localhost:7860/api/mlops/drift/check"

# Run tests
pytest tests/test_api.py -v