--- title: Bharat Tech Atlas emoji: πŸ—ΊοΈ colorFrom: blue colorTo: purple sdk: docker app_port: 7860 suggested_hardware: cpu-basic tags: - ml-intern --- # Bharat Tech Atlas v3.3 Mapping platform for India's startup ecosystem β€” curated dataset of startups, SMEs, college E-Cells, incubators, and accelerators. Now with **ML-powered insights**, **ETL data pipelines**, **MLOps monitoring**, and **security hardening**. > **Data disclaimer**: This is a curated subset. India has 223,000+ DPIIT-registered startups (as of April 2026). Stats shown here reflect only what's in our database. Sources: DPIIT, Tracxn, Crunchbase, LinkedIn. ## What's New in v3.3 - **Security hardening**: Centralized `security.py` with XSS detection, prompt injection scanning (27 patterns), URL validation (SSRF prevention), audit logging, rate limiting, CSP generation - **Input validation**: Stricter bounds on coordinates, funding amounts, years, startup names, query lengths - **Frontend security**: Removed `dangerouslySetInnerHTML` from social icons, added `SafeSocialIcon` React component, URL validation on all social links - **Chat safety**: Prompt injection detection rejects DAN/jailbreak attempts, output sanitization strips scripts - **Audit logging**: Every request gets a `request_id`, security events logged with severity levels - **CORS tightened**: Configurable origins, limited methods (`GET/POST/HEAD/OPTIONS`) - **Security headers**: `X-Content-Type-Options`, `X-Frame-Options`, `Referrer-Policy`, `X-Request-ID`, HSTS + CSP in production - **Rate limits**: Chat/Agent = 30 req/min per IP, General API = 120/min - **Query timeouts**: 10s PRAGMA busy_timeout on all entity queries - **Tests**: 15 security tests added (injection, XSS, CORS, rate limits, coordinate validation) ## What's New in v3.2 - **Chat AI fix**: πŸ€– button now functional β€” keyword-based responses work instantly, LLM mode auto-loads when transformers available - **Search Agent**: Web search for startup analysis (gracefully degrades when duckduckgo-search not installed) - **ML keyword fallback**: All ML endpoints work without GPU/transformers β€” sector classification via keyword rules - **Bug fixes**: Fixed double-execution middleware, DB connection pool, missing router mounts, dead enrichment code - **Schema migrations**: Auto-migrate DB schema on startup, seed guard for faster restarts ## What's New in v3.0 - **ML Inference API**: NLP sector classification + startup growth prediction - **ETL Pipeline**: Automated Extract β†’ Transform β†’ Load from DPIIT, Tracxn, Crunchbase - **MLOps Monitoring**: Data drift detection, model health tracking, CI/CD pipelines - **Model Serving**: ONNX-optimized inference, caching, TorchServe/Triton-ready - **125 Unicorns**: Full Indian unicorn dataset with funding & investors --- ## Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ FRONTEND (React + MapLibre) β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Map UI β”‚ β”‚ Sidebar β”‚ β”‚ Analyticsβ”‚ β”‚ Search β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ Async API Calls (Fetch) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ BACKEND (FastAPI + Uvicorn) β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ /api/entities β”‚ β”‚ /api/ml β”‚ β”‚ /api/mlops β”‚ β”‚ β”‚ β”‚ GeoJSON, CRUD β”‚ β”‚ Classification β”‚ β”‚ Drift, Monitor β”‚ β”‚ β”‚ β”‚ Clusters, Fmt β”‚ β”‚ Growth Predict β”‚ β”‚ Model Registry β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ DATABASE LAYER (SQLite + R-Tree) β”‚ β”‚ β”‚ β”‚ Spatial indexing Β· WAL mode Β· Optimized queries β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ ETL PIPELINE β”‚ β”‚ β”‚ β”‚ DPIIT API β†’ ┐ β”‚ β”‚ β”‚ β”‚ Tracxn API β†’ β”œβ†’ Transform (geocode, normalize) β†’ Load DB β”‚ β”‚ β”‚ β”‚ Crunchbase β†’ β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ ML MODEL SERVER β”‚ β”‚ β”‚ β”‚ Sector Classifier (BART/ONNX) + Growth Predictor (GBT) β”‚ β”‚ β”‚ β”‚ LRU Cache Β· Batch Inference Β· Health Monitoring β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` --- ## 1. Data Pipelines & API Architecture ### ETL Pipeline (`backend/etl/`) | Stage | Module | Description | |-------|--------|-------------| | **Extract** | `extract.py` | Async API callers for DPIIT, Tracxn, Crunchbase with rate limiting | | **Transform** | `transform.py` | Geocoding, sector normalization, deduplication, slug generation | | **Load** | `load.py` | Batch upserts to SQLite, R-Tree spatial index maintenance | | **Orchestrator** | `pipeline.py` | Full/incremental runs, scheduling, error recovery | ### API Endpoints β€” ETL Management - `GET /api/etl/status` β€” Pipeline status + DB stats - `POST /api/etl/run` β€” Trigger full ETL pipeline - `POST /api/etl/run/incremental?since_hours=24` β€” Incremental update - `GET /api/etl/history` β€” Pipeline run history - `GET /api/etl/sources` β€” Configured data sources info ### Data Flow ``` DPIIT Portal ──┐ Tracxn API ────┼──→ ETLExtractor (async, rate-limited) Crunchbase β”€β”€β”€β”€β”˜ β”‚ β–Ό ETLTransformer - Clean names (remove Pvt/Ltd/Inc) - Normalize sectors β†’ standard taxonomy - Geocode addresses β†’ lat/lng - Deduplicate across sources β”‚ β–Ό ETLLoader - Batch INSERT/UPDATE - R-Tree spatial index - Stale record deactivation ``` --- ## 2. Backend Frameworks & ML APIs ### FastAPI Backend (Production-Grade) - **Async request handling** β€” concurrent ML inference without blocking - **Rate limiting** β€” 60 req/min general, 10 req/min for ML endpoints - **Input sanitization** β€” SQL injection prevention, query length limits - **Auto-generated docs** β€” Swagger UI at `/docs` ### ML API Endpoints (`/api/ml/`) - `GET /api/ml/classify/sector?description=...&top_k=3` β€” NLP sector classification - `POST /api/ml/classify/sector/batch` β€” Batch classification (up to 50) - `GET /api/ml/predict/growth/{slug}` β€” Growth prediction for an entity - `GET /api/ml/predict/growth?sector=fintech&state=Karnataka` β€” Ranked predictions - `GET /api/ml/health` β€” Model server health - `GET /api/ml/sectors/taxonomy` β€” Full sector taxonomy --- ## 3. Pre-Existing Models & Serving ### Sector Classifier (`backend/ml/classifier.py`) | Mode | Model | Latency | Accuracy | |------|-------|---------|----------| | Zero-shot | `facebook/bart-large-mnli` | ~200ms | ~85% | | ONNX | Exported BART | ~50ms | ~85% | | Keyword fallback | Rule-based | ~1ms | ~70% | ### Growth Predictor (`backend/ml/predictor.py`) Features used: - **Funding signal** β€” log-normalized funding amount - **Team signal** β€” log-normalized employee count - **Sector momentum** β€” market trend score (AI=0.95, EdTech=0.65) - **Ecosystem score** β€” city startup density (Bengaluru=0.95) - **Age signal** β€” sweet spot 2-7 years - **Recognition signals** β€” DPIIT, NSA, unicorn status - **Investor quality** β€” presence of top-tier VCs ### Model Serving (`backend/ml/serving.py`) Production-ready serving with: - **LRU Response Cache** β€” avoid redundant inference - **Batch processing** β€” group requests for GPU efficiency - **TorchServe adapter** β€” for PyTorch model deployment at scale - **NVIDIA Triton adapter** β€” multi-model, dynamic batching - **ONNX Runtime** β€” 3-5x speedup on CPU --- ## 4. MLOps & Agile Integration ### MLOps Module (`backend/mlops/`) | Component | Description | |-----------|-------------| | **Data Drift Detection** | PSI (categorical) + KS-test (numerical) | | **Model Monitor** | Latency, errors, confidence tracking, auto-alerts | | **Model Registry** | Version control, A/B comparison, promotion | | **CI/CD Pipeline** | GitHub Actions workflow (test β†’ validate β†’ train β†’ deploy) | ### MLOps API Endpoints (`/api/mlops/`) - `GET /api/mlops/drift/check` β€” Run drift detection on current data - `GET /api/mlops/drift/history` β€” Historical drift reports - `GET /api/mlops/monitor/metrics` β€” Real-time model metrics - `GET /api/mlops/monitor/alerts` β€” Recent alerts (latency, degradation) - `GET /api/mlops/registry/models` β€” All model versions - `GET /api/mlops/registry/compare?model=...&v1=...&v2=...` β€” Version diff - `GET /api/mlops/cicd/workflow` β€” GitHub Actions YAML - `POST /api/mlops/cicd/trigger-retrain` β€” Manual retraining trigger ### Agile Sprint Structure | Sprint | Deliverable | Status | |--------|-------------|--------| | Sprint 1 | Basic map UI + FastAPI backend + static data | βœ… Done | | Sprint 2 | ETL pipeline (DPIIT/Tracxn/Crunchbase) + real data | βœ… Done | | Sprint 3 | ML models (sector classifier + growth predictor) | βœ… Done | | Sprint 4 | MLOps (drift detection, monitoring, CI/CD) | βœ… Done | | Sprint 5 | Security hardening, XSS prevention, audit logging, rate limiting | βœ… Done | | Sprint 6 | Scale: ONNX optimization, GPU serving, load testing | πŸ”œ Next | ### CI/CD Pipeline ```yaml # Triggered on: push to backend/ml/**, scheduled weekly test β†’ data-validation β†’ train β†’ validate-performance β†’ deploy ``` - **DVC** for data version control - **GitHub Actions** for automated pipeline - **HuggingFace Hub** for model storage - **Prometheus/Grafana** compatible metrics --- ## Tech Stack | Layer | Technology | |-------|-----------| | Frontend | React 18 + Vite + MapLibre GL JS + Tailwind CSS | | Backend | FastAPI + slowapi rate limiting | | Database | SQLite with R-Tree spatial index | | ML Inference | HuggingFace Transformers + ONNX Runtime | | ETL | Async Python (aiohttp) + custom pipeline | | MLOps | Data drift (PSI/KS) + Model registry + CI/CD | | Map Tiles | CARTO Dark Matter | | Deployment | Docker on Hugging Face Spaces | --- ## Map Features ### Map Modes - **Clusters** (default): Server-side numbered bubble clusters at low zoom - **Points**: All entities as colored dots - **Heatmap**: Funding-weighted density visualization ### Filtering - Entity types, sectors, DPIIT categories, business models, stage, location - Special filters: unicorns, women-led, rural impact, campus startups, NSA winners - State filter with fly-to --- ## Data Curated dataset across India: - **125 Unicorns**: Flipkart, PhonePe, Zerodha, Zomato, Swiggy, CRED, Zepto, Neysa + more - **Women-led startups**: Nykaa, Mamaearth, Sugar Cosmetics, OPEN Financial + more - **Rural impact**: DeHaat, CropIn, Stellapps, Aye Finance + more - **Campus startups**: Pixxel, TartanSense, Zepto + more - **NSA winners**: Tagged with award categories - **DPIIT recognized**: Government-recognized startups --- ## Quick Start (Local Development) ```bash # Backend pip install -r requirements.txt uvicorn backend.main:app --reload --port 7860 # Frontend (separate terminal) cd frontend && npm install && npm run dev # Run ETL pipeline python -c "import asyncio; from backend.etl import ETLPipeline; asyncio.run(ETLPipeline({}).run())" # Test ML classification curl "http://localhost:7860/api/ml/classify/sector?description=AI-powered%20fraud%20detection%20for%20banks" # Check data drift curl "http://localhost:7860/api/mlops/drift/check" # Run tests pytest tests/test_api.py -v ```