Spaces:
Sleeping
Sleeping
| title: Bharat Tech Atlas | |
| emoji: πΊοΈ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 7860 | |
| suggested_hardware: cpu-basic | |
| tags: | |
| - ml-intern | |
| # Bharat Tech Atlas v3.3 | |
| Mapping platform for India's startup ecosystem β curated dataset of startups, SMEs, college E-Cells, incubators, and accelerators. Now with **ML-powered insights**, **ETL data pipelines**, **MLOps monitoring**, and **security hardening**. | |
| > **Data disclaimer**: This is a curated subset. India has 223,000+ DPIIT-registered startups (as of April 2026). Stats shown here reflect only what's in our database. Sources: DPIIT, Tracxn, Crunchbase, LinkedIn. | |
| ## What's New in v3.3 | |
| - **Security hardening**: Centralized `security.py` with XSS detection, prompt injection scanning (27 patterns), URL validation (SSRF prevention), audit logging, rate limiting, CSP generation | |
| - **Input validation**: Stricter bounds on coordinates, funding amounts, years, startup names, query lengths | |
| - **Frontend security**: Removed `dangerouslySetInnerHTML` from social icons, added `SafeSocialIcon` React component, URL validation on all social links | |
| - **Chat safety**: Prompt injection detection rejects DAN/jailbreak attempts, output sanitization strips scripts | |
| - **Audit logging**: Every request gets a `request_id`, security events logged with severity levels | |
| - **CORS tightened**: Configurable origins, limited methods (`GET/POST/HEAD/OPTIONS`) | |
| - **Security headers**: `X-Content-Type-Options`, `X-Frame-Options`, `Referrer-Policy`, `X-Request-ID`, HSTS + CSP in production | |
| - **Rate limits**: Chat/Agent = 30 req/min per IP, General API = 120/min | |
| - **Query timeouts**: 10s PRAGMA busy_timeout on all entity queries | |
| - **Tests**: 15 security tests added (injection, XSS, CORS, rate limits, coordinate validation) | |
| ## What's New in v3.2 | |
| - **Chat AI fix**: π€ button now functional β keyword-based responses work instantly, LLM mode auto-loads when transformers available | |
| - **Search Agent**: Web search for startup analysis (gracefully degrades when duckduckgo-search not installed) | |
| - **ML keyword fallback**: All ML endpoints work without GPU/transformers β sector classification via keyword rules | |
| - **Bug fixes**: Fixed double-execution middleware, DB connection pool, missing router mounts, dead enrichment code | |
| - **Schema migrations**: Auto-migrate DB schema on startup, seed guard for faster restarts | |
| ## What's New in v3.0 | |
| - **ML Inference API**: NLP sector classification + startup growth prediction | |
| - **ETL Pipeline**: Automated Extract β Transform β Load from DPIIT, Tracxn, Crunchbase | |
| - **MLOps Monitoring**: Data drift detection, model health tracking, CI/CD pipelines | |
| - **Model Serving**: ONNX-optimized inference, caching, TorchServe/Triton-ready | |
| - **125 Unicorns**: Full Indian unicorn dataset with funding & investors | |
| --- | |
| ## Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β FRONTEND (React + MapLibre) β | |
| β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββ β | |
| β β Map UI β β Sidebar β β Analyticsβ β Search β β | |
| β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ βββββββ¬βββββββ β | |
| β β β β β β | |
| β βββββββββββββββ΄βββββββββββββ΄βββββββββββββββ β | |
| β Async API Calls (Fetch) β | |
| ββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ | |
| β | |
| ββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ | |
| β BACKEND (FastAPI + Uvicorn) β | |
| β β | |
| β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β | |
| β β /api/entities β β /api/ml β β /api/mlops β β | |
| β β GeoJSON, CRUD β β Classification β β Drift, Monitor β β | |
| β β Clusters, Fmt β β Growth Predict β β Model Registry β β | |
| β ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ β | |
| β β β β β | |
| β ββββββββββΌββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββ β | |
| β β DATABASE LAYER (SQLite + R-Tree) β β | |
| β β Spatial indexing Β· WAL mode Β· Optimized queries β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β ETL PIPELINE β β | |
| β β DPIIT API β β β β | |
| β β Tracxn API β ββ Transform (geocode, normalize) β Load DB β β | |
| β β Crunchbase β β β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β ML MODEL SERVER β β | |
| β β Sector Classifier (BART/ONNX) + Growth Predictor (GBT) β β | |
| β β LRU Cache Β· Batch Inference Β· Health Monitoring β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## 1. Data Pipelines & API Architecture | |
| ### ETL Pipeline (`backend/etl/`) | |
| | Stage | Module | Description | | |
| |-------|--------|-------------| | |
| | **Extract** | `extract.py` | Async API callers for DPIIT, Tracxn, Crunchbase with rate limiting | | |
| | **Transform** | `transform.py` | Geocoding, sector normalization, deduplication, slug generation | | |
| | **Load** | `load.py` | Batch upserts to SQLite, R-Tree spatial index maintenance | | |
| | **Orchestrator** | `pipeline.py` | Full/incremental runs, scheduling, error recovery | | |
| ### API Endpoints β ETL Management | |
| - `GET /api/etl/status` β Pipeline status + DB stats | |
| - `POST /api/etl/run` β Trigger full ETL pipeline | |
| - `POST /api/etl/run/incremental?since_hours=24` β Incremental update | |
| - `GET /api/etl/history` β Pipeline run history | |
| - `GET /api/etl/sources` β Configured data sources info | |
| ### Data Flow | |
| ``` | |
| DPIIT Portal βββ | |
| Tracxn API βββββΌβββ ETLExtractor (async, rate-limited) | |
| Crunchbase βββββ β | |
| βΌ | |
| ETLTransformer | |
| - Clean names (remove Pvt/Ltd/Inc) | |
| - Normalize sectors β standard taxonomy | |
| - Geocode addresses β lat/lng | |
| - Deduplicate across sources | |
| β | |
| βΌ | |
| ETLLoader | |
| - Batch INSERT/UPDATE | |
| - R-Tree spatial index | |
| - Stale record deactivation | |
| ``` | |
| --- | |
| ## 2. Backend Frameworks & ML APIs | |
| ### FastAPI Backend (Production-Grade) | |
| - **Async request handling** β concurrent ML inference without blocking | |
| - **Rate limiting** β 60 req/min general, 10 req/min for ML endpoints | |
| - **Input sanitization** β SQL injection prevention, query length limits | |
| - **Auto-generated docs** β Swagger UI at `/docs` | |
| ### ML API Endpoints (`/api/ml/`) | |
| - `GET /api/ml/classify/sector?description=...&top_k=3` β NLP sector classification | |
| - `POST /api/ml/classify/sector/batch` β Batch classification (up to 50) | |
| - `GET /api/ml/predict/growth/{slug}` β Growth prediction for an entity | |
| - `GET /api/ml/predict/growth?sector=fintech&state=Karnataka` β Ranked predictions | |
| - `GET /api/ml/health` β Model server health | |
| - `GET /api/ml/sectors/taxonomy` β Full sector taxonomy | |
| --- | |
| ## 3. Pre-Existing Models & Serving | |
| ### Sector Classifier (`backend/ml/classifier.py`) | |
| | Mode | Model | Latency | Accuracy | | |
| |------|-------|---------|----------| | |
| | Zero-shot | `facebook/bart-large-mnli` | ~200ms | ~85% | | |
| | ONNX | Exported BART | ~50ms | ~85% | | |
| | Keyword fallback | Rule-based | ~1ms | ~70% | | |
| ### Growth Predictor (`backend/ml/predictor.py`) | |
| Features used: | |
| - **Funding signal** β log-normalized funding amount | |
| - **Team signal** β log-normalized employee count | |
| - **Sector momentum** β market trend score (AI=0.95, EdTech=0.65) | |
| - **Ecosystem score** β city startup density (Bengaluru=0.95) | |
| - **Age signal** β sweet spot 2-7 years | |
| - **Recognition signals** β DPIIT, NSA, unicorn status | |
| - **Investor quality** β presence of top-tier VCs | |
| ### Model Serving (`backend/ml/serving.py`) | |
| Production-ready serving with: | |
| - **LRU Response Cache** β avoid redundant inference | |
| - **Batch processing** β group requests for GPU efficiency | |
| - **TorchServe adapter** β for PyTorch model deployment at scale | |
| - **NVIDIA Triton adapter** β multi-model, dynamic batching | |
| - **ONNX Runtime** β 3-5x speedup on CPU | |
| --- | |
| ## 4. MLOps & Agile Integration | |
| ### MLOps Module (`backend/mlops/`) | |
| | Component | Description | | |
| |-----------|-------------| | |
| | **Data Drift Detection** | PSI (categorical) + KS-test (numerical) | | |
| | **Model Monitor** | Latency, errors, confidence tracking, auto-alerts | | |
| | **Model Registry** | Version control, A/B comparison, promotion | | |
| | **CI/CD Pipeline** | GitHub Actions workflow (test β validate β train β deploy) | | |
| ### MLOps API Endpoints (`/api/mlops/`) | |
| - `GET /api/mlops/drift/check` β Run drift detection on current data | |
| - `GET /api/mlops/drift/history` β Historical drift reports | |
| - `GET /api/mlops/monitor/metrics` β Real-time model metrics | |
| - `GET /api/mlops/monitor/alerts` β Recent alerts (latency, degradation) | |
| - `GET /api/mlops/registry/models` β All model versions | |
| - `GET /api/mlops/registry/compare?model=...&v1=...&v2=...` β Version diff | |
| - `GET /api/mlops/cicd/workflow` β GitHub Actions YAML | |
| - `POST /api/mlops/cicd/trigger-retrain` β Manual retraining trigger | |
| ### Agile Sprint Structure | |
| | Sprint | Deliverable | Status | | |
| |--------|-------------|--------| | |
| | Sprint 1 | Basic map UI + FastAPI backend + static data | β Done | | |
| | Sprint 2 | ETL pipeline (DPIIT/Tracxn/Crunchbase) + real data | β Done | | |
| | Sprint 3 | ML models (sector classifier + growth predictor) | β Done | | |
| | Sprint 4 | MLOps (drift detection, monitoring, CI/CD) | β Done | | |
| | Sprint 5 | Security hardening, XSS prevention, audit logging, rate limiting | β Done | | |
| | Sprint 6 | Scale: ONNX optimization, GPU serving, load testing | π Next | | |
| ### CI/CD Pipeline | |
| ```yaml | |
| # Triggered on: push to backend/ml/**, scheduled weekly | |
| test β data-validation β train β validate-performance β deploy | |
| ``` | |
| - **DVC** for data version control | |
| - **GitHub Actions** for automated pipeline | |
| - **HuggingFace Hub** for model storage | |
| - **Prometheus/Grafana** compatible metrics | |
| --- | |
| ## Tech Stack | |
| | Layer | Technology | | |
| |-------|-----------| | |
| | Frontend | React 18 + Vite + MapLibre GL JS + Tailwind CSS | | |
| | Backend | FastAPI + slowapi rate limiting | | |
| | Database | SQLite with R-Tree spatial index | | |
| | ML Inference | HuggingFace Transformers + ONNX Runtime | | |
| | ETL | Async Python (aiohttp) + custom pipeline | | |
| | MLOps | Data drift (PSI/KS) + Model registry + CI/CD | | |
| | Map Tiles | CARTO Dark Matter | | |
| | Deployment | Docker on Hugging Face Spaces | | |
| --- | |
| ## Map Features | |
| ### Map Modes | |
| - **Clusters** (default): Server-side numbered bubble clusters at low zoom | |
| - **Points**: All entities as colored dots | |
| - **Heatmap**: Funding-weighted density visualization | |
| ### Filtering | |
| - Entity types, sectors, DPIIT categories, business models, stage, location | |
| - Special filters: unicorns, women-led, rural impact, campus startups, NSA winners | |
| - State filter with fly-to | |
| --- | |
| ## Data | |
| Curated dataset across India: | |
| - **125 Unicorns**: Flipkart, PhonePe, Zerodha, Zomato, Swiggy, CRED, Zepto, Neysa + more | |
| - **Women-led startups**: Nykaa, Mamaearth, Sugar Cosmetics, OPEN Financial + more | |
| - **Rural impact**: DeHaat, CropIn, Stellapps, Aye Finance + more | |
| - **Campus startups**: Pixxel, TartanSense, Zepto + more | |
| - **NSA winners**: Tagged with award categories | |
| - **DPIIT recognized**: Government-recognized startups | |
| --- | |
| ## Quick Start (Local Development) | |
| ```bash | |
| # Backend | |
| pip install -r requirements.txt | |
| uvicorn backend.main:app --reload --port 7860 | |
| # Frontend (separate terminal) | |
| cd frontend && npm install && npm run dev | |
| # Run ETL pipeline | |
| python -c "import asyncio; from backend.etl import ETLPipeline; asyncio.run(ETLPipeline({}).run())" | |
| # Test ML classification | |
| curl "http://localhost:7860/api/ml/classify/sector?description=AI-powered%20fraud%20detection%20for%20banks" | |
| # Check data drift | |
| curl "http://localhost:7860/api/mlops/drift/check" | |
| # Run tests | |
| pytest tests/test_api.py -v | |
| ``` | |