Spaces:
Sleeping
Sleeping
metadata
title: Bharat Tech Atlas
emoji: πΊοΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
suggested_hardware: cpu-basic
tags:
- ml-intern
Bharat Tech Atlas v3.3
Mapping platform for India's startup ecosystem β curated dataset of startups, SMEs, college E-Cells, incubators, and accelerators. Now with ML-powered insights, ETL data pipelines, MLOps monitoring, and security hardening.
Data disclaimer: This is a curated subset. India has 223,000+ DPIIT-registered startups (as of April 2026). Stats shown here reflect only what's in our database. Sources: DPIIT, Tracxn, Crunchbase, LinkedIn.
What's New in v3.3
- Security hardening: Centralized
security.pywith XSS detection, prompt injection scanning (27 patterns), URL validation (SSRF prevention), audit logging, rate limiting, CSP generation - Input validation: Stricter bounds on coordinates, funding amounts, years, startup names, query lengths
- Frontend security: Removed
dangerouslySetInnerHTMLfrom social icons, addedSafeSocialIconReact component, URL validation on all social links - Chat safety: Prompt injection detection rejects DAN/jailbreak attempts, output sanitization strips scripts
- Audit logging: Every request gets a
request_id, security events logged with severity levels - CORS tightened: Configurable origins, limited methods (
GET/POST/HEAD/OPTIONS) - Security headers:
X-Content-Type-Options,X-Frame-Options,Referrer-Policy,X-Request-ID, HSTS + CSP in production - Rate limits: Chat/Agent = 30 req/min per IP, General API = 120/min
- Query timeouts: 10s PRAGMA busy_timeout on all entity queries
- Tests: 15 security tests added (injection, XSS, CORS, rate limits, coordinate validation)
What's New in v3.2
- Chat AI fix: π€ button now functional β keyword-based responses work instantly, LLM mode auto-loads when transformers available
- Search Agent: Web search for startup analysis (gracefully degrades when duckduckgo-search not installed)
- ML keyword fallback: All ML endpoints work without GPU/transformers β sector classification via keyword rules
- Bug fixes: Fixed double-execution middleware, DB connection pool, missing router mounts, dead enrichment code
- Schema migrations: Auto-migrate DB schema on startup, seed guard for faster restarts
What's New in v3.0
- ML Inference API: NLP sector classification + startup growth prediction
- ETL Pipeline: Automated Extract β Transform β Load from DPIIT, Tracxn, Crunchbase
- MLOps Monitoring: Data drift detection, model health tracking, CI/CD pipelines
- Model Serving: ONNX-optimized inference, caching, TorchServe/Triton-ready
- 125 Unicorns: Full Indian unicorn dataset with funding & investors
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FRONTEND (React + MapLibre) β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββ β
β β Map UI β β Sidebar β β Analyticsβ β Search β β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ βββββββ¬βββββββ β
β β β β β β
β βββββββββββββββ΄βββββββββββββ΄βββββββββββββββ β
β Async API Calls (Fetch) β
ββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ
β BACKEND (FastAPI + Uvicorn) β
β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β /api/entities β β /api/ml β β /api/mlops β β
β β GeoJSON, CRUD β β Classification β β Drift, Monitor β β
β β Clusters, Fmt β β Growth Predict β β Model Registry β β
β ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ β
β β β β β
β ββββββββββΌββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββ β
β β DATABASE LAYER (SQLite + R-Tree) β β
β β Spatial indexing Β· WAL mode Β· Optimized queries β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ETL PIPELINE β β
β β DPIIT API β β β β
β β Tracxn API β ββ Transform (geocode, normalize) β Load DB β β
β β Crunchbase β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ML MODEL SERVER β β
β β Sector Classifier (BART/ONNX) + Growth Predictor (GBT) β β
β β LRU Cache Β· Batch Inference Β· Health Monitoring β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. Data Pipelines & API Architecture
ETL Pipeline (backend/etl/)
| Stage | Module | Description |
|---|---|---|
| Extract | extract.py |
Async API callers for DPIIT, Tracxn, Crunchbase with rate limiting |
| Transform | transform.py |
Geocoding, sector normalization, deduplication, slug generation |
| Load | load.py |
Batch upserts to SQLite, R-Tree spatial index maintenance |
| Orchestrator | pipeline.py |
Full/incremental runs, scheduling, error recovery |
API Endpoints β ETL Management
GET /api/etl/statusβ Pipeline status + DB statsPOST /api/etl/runβ Trigger full ETL pipelinePOST /api/etl/run/incremental?since_hours=24β Incremental updateGET /api/etl/historyβ Pipeline run historyGET /api/etl/sourcesβ Configured data sources info
Data Flow
DPIIT Portal βββ
Tracxn API βββββΌβββ ETLExtractor (async, rate-limited)
Crunchbase βββββ β
βΌ
ETLTransformer
- Clean names (remove Pvt/Ltd/Inc)
- Normalize sectors β standard taxonomy
- Geocode addresses β lat/lng
- Deduplicate across sources
β
βΌ
ETLLoader
- Batch INSERT/UPDATE
- R-Tree spatial index
- Stale record deactivation
2. Backend Frameworks & ML APIs
FastAPI Backend (Production-Grade)
- Async request handling β concurrent ML inference without blocking
- Rate limiting β 60 req/min general, 10 req/min for ML endpoints
- Input sanitization β SQL injection prevention, query length limits
- Auto-generated docs β Swagger UI at
/docs
ML API Endpoints (/api/ml/)
GET /api/ml/classify/sector?description=...&top_k=3β NLP sector classificationPOST /api/ml/classify/sector/batchβ Batch classification (up to 50)GET /api/ml/predict/growth/{slug}β Growth prediction for an entityGET /api/ml/predict/growth?sector=fintech&state=Karnatakaβ Ranked predictionsGET /api/ml/healthβ Model server healthGET /api/ml/sectors/taxonomyβ Full sector taxonomy
3. Pre-Existing Models & Serving
Sector Classifier (backend/ml/classifier.py)
| Mode | Model | Latency | Accuracy |
|---|---|---|---|
| Zero-shot | facebook/bart-large-mnli |
~200ms | ~85% |
| ONNX | Exported BART | ~50ms | ~85% |
| Keyword fallback | Rule-based | ~1ms | ~70% |
Growth Predictor (backend/ml/predictor.py)
Features used:
- Funding signal β log-normalized funding amount
- Team signal β log-normalized employee count
- Sector momentum β market trend score (AI=0.95, EdTech=0.65)
- Ecosystem score β city startup density (Bengaluru=0.95)
- Age signal β sweet spot 2-7 years
- Recognition signals β DPIIT, NSA, unicorn status
- Investor quality β presence of top-tier VCs
Model Serving (backend/ml/serving.py)
Production-ready serving with:
- LRU Response Cache β avoid redundant inference
- Batch processing β group requests for GPU efficiency
- TorchServe adapter β for PyTorch model deployment at scale
- NVIDIA Triton adapter β multi-model, dynamic batching
- ONNX Runtime β 3-5x speedup on CPU
4. MLOps & Agile Integration
MLOps Module (backend/mlops/)
| Component | Description |
|---|---|
| Data Drift Detection | PSI (categorical) + KS-test (numerical) |
| Model Monitor | Latency, errors, confidence tracking, auto-alerts |
| Model Registry | Version control, A/B comparison, promotion |
| CI/CD Pipeline | GitHub Actions workflow (test β validate β train β deploy) |
MLOps API Endpoints (/api/mlops/)
GET /api/mlops/drift/checkβ Run drift detection on current dataGET /api/mlops/drift/historyβ Historical drift reportsGET /api/mlops/monitor/metricsβ Real-time model metricsGET /api/mlops/monitor/alertsβ Recent alerts (latency, degradation)GET /api/mlops/registry/modelsβ All model versionsGET /api/mlops/registry/compare?model=...&v1=...&v2=...β Version diffGET /api/mlops/cicd/workflowβ GitHub Actions YAMLPOST /api/mlops/cicd/trigger-retrainβ Manual retraining trigger
Agile Sprint Structure
| Sprint | Deliverable | Status |
|---|---|---|
| Sprint 1 | Basic map UI + FastAPI backend + static data | β Done |
| Sprint 2 | ETL pipeline (DPIIT/Tracxn/Crunchbase) + real data | β Done |
| Sprint 3 | ML models (sector classifier + growth predictor) | β Done |
| Sprint 4 | MLOps (drift detection, monitoring, CI/CD) | β Done |
| Sprint 5 | Security hardening, XSS prevention, audit logging, rate limiting | β Done |
| Sprint 6 | Scale: ONNX optimization, GPU serving, load testing | π Next |
CI/CD Pipeline
# Triggered on: push to backend/ml/**, scheduled weekly
test β data-validation β train β validate-performance β deploy
- DVC for data version control
- GitHub Actions for automated pipeline
- HuggingFace Hub for model storage
- Prometheus/Grafana compatible metrics
Tech Stack
| Layer | Technology |
|---|---|
| Frontend | React 18 + Vite + MapLibre GL JS + Tailwind CSS |
| Backend | FastAPI + slowapi rate limiting |
| Database | SQLite with R-Tree spatial index |
| ML Inference | HuggingFace Transformers + ONNX Runtime |
| ETL | Async Python (aiohttp) + custom pipeline |
| MLOps | Data drift (PSI/KS) + Model registry + CI/CD |
| Map Tiles | CARTO Dark Matter |
| Deployment | Docker on Hugging Face Spaces |
Map Features
Map Modes
- Clusters (default): Server-side numbered bubble clusters at low zoom
- Points: All entities as colored dots
- Heatmap: Funding-weighted density visualization
Filtering
- Entity types, sectors, DPIIT categories, business models, stage, location
- Special filters: unicorns, women-led, rural impact, campus startups, NSA winners
- State filter with fly-to
Data
Curated dataset across India:
- 125 Unicorns: Flipkart, PhonePe, Zerodha, Zomato, Swiggy, CRED, Zepto, Neysa + more
- Women-led startups: Nykaa, Mamaearth, Sugar Cosmetics, OPEN Financial + more
- Rural impact: DeHaat, CropIn, Stellapps, Aye Finance + more
- Campus startups: Pixxel, TartanSense, Zepto + more
- NSA winners: Tagged with award categories
- DPIIT recognized: Government-recognized startups
Quick Start (Local Development)
# Backend
pip install -r requirements.txt
uvicorn backend.main:app --reload --port 7860
# Frontend (separate terminal)
cd frontend && npm install && npm run dev
# Run ETL pipeline
python -c "import asyncio; from backend.etl import ETLPipeline; asyncio.run(ETLPipeline({}).run())"
# Test ML classification
curl "http://localhost:7860/api/ml/classify/sector?description=AI-powered%20fraud%20detection%20for%20banks"
# Check data drift
curl "http://localhost:7860/api/mlops/drift/check"
# Run tests
pytest tests/test_api.py -v