bharatgraph / README.md
abinazebinoy's picture
Add developer credit to README
98ba423 unverified
|
Raw
History Blame Contribute Delete
18.3 kB
metadata
title: BharatGraph
emoji: πŸ”
colorFrom: red
colorTo: green
sdk: docker
app_file: app.py
pinned: false

BharatGraph

India's open-source investigation intelligence platform

License: MIT Python FastAPI Version HuggingFace Frontend


What is BharatGraph

BharatGraph aggregates 575+ official and verified government data sources into a Neo4j knowledge graph and runs 15 parallel AI investigators to surface corruption patterns, procurement irregularities, conflicts of interest, proxy ownership structures, and institutional risk indicators across Indian public life.

Every output includes a traceable citation to its primary source document. The system never makes legal findings or accusations -- it identifies structural patterns in public data.


Live


Who can use this

User Purpose
Investigative journalists Search entities, generate sourced dossiers, export PDF
Academic researchers Query the knowledge graph, run pattern detection
Civil society organisations Monitor procurement, track affidavit wealth trajectories
Government agencies Deploy internally with full audit trails and RBAC
Developers Extend via the plugin system, add new data sources
Students Study the codebase -- each component has documented theory

Architecture

Browser (frontend/index.html -- vanilla HTML/CSS/JS, no build step)
    |
    v REST + WebSocket
FastAPI (api/main.py)
    |
    +-- 19 route modules
    +-- 15 parallel AI investigators (ThreadPoolExecutor)
    +-- 6-layer DeepInvestigator
    +-- ConnectionMapper (shortest path + WHY explanations)
    +-- RuntimeProfile (auto-detects hardware, assigns LOW/MEDIUM/HIGH)
    |
    v
Neo4j AuraDB / Custom Graph Engine (planned)
    20 node types, 12 relationship types, MERGE with SHA-256 IDs

Data Sources (575+)

Official Government Sources (India)

Source What it provides Official URL
MyNeta / ECI Candidate affidavits, assets, criminal cases myneta.info
MCA21 Company registrations, directorships, CIN mca.gov.in
GeM Government e-Marketplace contract awards gem.gov.in
CAG Comptroller audit reports and irregularities cag.gov.in
PIB Press Information Bureau press releases pib.gov.in
Lok Sabha Parliamentary questions, division votes loksabha.nic.in
SEBI Securities enforcement orders sebi.gov.in
ED Enforcement Directorate press releases enforcementdirectorate.gov.in
CVC Central Vigilance Commission circulars cvc.gov.in
Electoral Bonds Bond transaction data (post-SC order 2024) eci.gov.in
IBBI Insolvency and Bankruptcy Board orders ibbi.gov.in
NGO Darpan NGO registration and CSR receipts ngodarpan.gov.in
CPPP Central Public Procurement Portal tenders eprocure.gov.in
NCRB Crime statistics by state and year ncrb.gov.in
LGD Local Government Directory entity codes lgdirectory.gov.in
DataGov Unified government dataset portal data.gov.in
NJDG Court pendency statistics njdg.ecourts.gov.in
RTI Online Filed RTI applications and outcomes rtionline.gov.in
SFIO Serious Fraud Investigation Office orders sfio.nic.in
DGGI GST fraud enforcement cases cbic.gov.in
RBI Bank enforcement actions and NPA data rbi.org.in
IRDAI Insurance regulatory orders irdai.gov.in
DRI Directorate of Revenue Intelligence seizures dri.nic.in
BENAMI Benami Prohibition Unit attachment orders incometaxindia.gov.in
RERA (28 states) Real estate project registrations varies by state
MOSPI Statistical data via microdata portal microdata.gov.in
Lok Sabha votes Division vote records per MP loksabha.nic.in/Loksabha/Divisions
ECI Form 24A Audited party accounts eci.gov.in/disclosure-of-accounts
CPGRAMS Citizen grievances by ministry pgportal.gov.in
Social Audit MGNREGS wage theft, ghost worker data socialaudit.mahatma.net.in
NDAP NITI Aayog district SDG and scheme data ndap.nic.in
MIB registry Media channel license holders mib.gov.in
TRAI Telecom spectrum allocations trai.gov.in
NIC eBhumi Land records digitization dilrmp.gov.in
PM Kisan Scheme disbursement data pmkisan.gov.in
MGNREGS Wage disbursements and job cards nrega.nic.in
PM Awas Housing scheme completions pmaymis.gov.in

International / Cross-Border Sources (Official)

Source What it provides Official URL
ICIJ Offshore Leaks Panama Papers, Pandora Papers, HSBC files offshoreleaks.icij.org
OpenSanctions Global sanctions and PEP lists opensanctions.org
World Bank India Project disbursements and loan data projects.worldbank.org/en/api
ADB India Asian Development Bank project portfolio adb.org/projects/india
FATF India AML/CFT risk evaluation fatf-gafi.org
Wikidata Structured biographical data for politicians wikidata.org
OpenTimestamps Bitcoin-anchored document timestamps opentimestamps.org

Supplementary Sources

These sources are used for corroboration only and are marked as NON_OFFICIAL_RECORD in the graph. All findings from these sources require confirmation from at least one official source before being included in a HIGH-confidence finding.

Source What it provides Note
TCPD / LokDhaba Historical election data 1962-2024 Academic research dataset
github.com/tcpd/ppi Politician biographical records Academic, crowdsourced
github.com/in-rolls Indian politician bios (8000+) Community dataset
github.com/datameet Constituency-level election data Community dataset
IndiaVotes.com Constituency results 1952-2024 Private aggregator
Praja.org Legislator attendance and questions Civil society monitoring
Global Witness India Natural resource corruption reports NGO reports
India Election Affidavits Parsed affidavit data (github) github.com/Vonter
dataforindia.com Multi-source India indicators Aggregator
Wayback Machine CDX Historical snapshots of government pages Archive.org

Capabilities

Knowledge Graph

The graph models 20 node types and 12 relationship types with stable SHA-256-derived canonical IDs. All writes use MERGE semantics -- no duplicate nodes across pipeline runs.

Node types: Politician, Company, Contract, AuditReport, Ministry, Party, Scheme, PressRelease, Tender, RegulatoryOrder, EnforcementAction, ElectoralBond, InsolvencyOrder, NGO, ParliamentQuestion, VigilanceCircular, ICIJEntity, SanctionedEntity, CourtCase, LocalBody

Relationship types: MEMBER_OF, DIRECTOR_OF, CONTESTED_IN, WON_CONTRACT, AWARDED_BY, FLAGS, MENTIONS, ISSUED_BY, ASSOCIATED_WITH, TARGETS, AUDITS, SUBJECT_OF

AI Investigation Engine

15 parallel investigators each query the graph independently. Findings confirmed by 3+ independent investigators = HIGH confidence. Findings confirmed by 2 investigators = MODERATE confidence.

Investigator Focus Weight
Financial Asset growth anomaly, Benford's Law 0.12
Political Party-contract overlap, electoral proximity 0.10
Corporate Director networks, shell company patterns 0.10
Judicial Court cases, FIR patterns, PMLA exposure 0.08
Procurement Bid rigging, cartel rotation, cover bids 0.12
Network Graph centrality, bridge entities, Fiedler 0.08
Asset Affidavit trajectory, Kalman filter 0.10
International ICIJ links, sanctions, offshore jurisdictions 0.10
Media PIB mentions, controversy timeline 0.06
Historical 5-election career trajectory 0.08
Public Interest RTI outcome, CPGRAMS complaint rate 0.08
Doubt Forced counterevidence, adversarial probing 0.08
Math Fourier FFT, spectral Fiedler, Benford 0.08
Affidavit Kalman filter wealth trajectory 0.10
Benami 5-factor proxy ownership score 0.09

6-layer deep investigation:

  1. Direct evidence (depth 1 graph traversal)
  2. Relationship expansion (depth 2)
  3. Structural patterns (circular ownership, ghost company)
  4. Temporal investigation (timeline construction, burst detection)
  5. Network influence (betweenness, authority/hub, community)
  6. Cross-source validation (multi-dataset corroboration)

Forensic Modules

Benami proxy detection: Director age anomaly + surname clustering + address clustering + pre-contract formation + single-director structure. Score >= 65 = HIGH.

Affidavit wealth trajectory: Kalman filter on 5-election asset series. Expected growth = 8% FD return + 60% salary savings. Residual > 5x = VERY HIGH.

Procurement DNA: TF-IDF cosine >= 0.72 between bid documents from separate vendors = cover-bid signal. Cartel detection via award rotation.

Revolving door: Government-to-private career moves within 365-day cooling-off window. Pre-employment benefit scoring.

TBML indicators: Contract price 2.5-sigma anomaly, subcontract loop detection, director changes within 90 days of award.

Linguistic fingerprinting: Burrows Delta authorship attribution, template reuse detection, ghost-writing similarity scoring.

Policy-benefit causal analysis: Granger causality (lags 1-6 months), transfer entropy, CACA cross-ministry benefit chain.

Dark pattern detection: PrefixSpan on administrative event sequences. 6 pre-defined high-risk sequence patterns.

Runtime Auto-Scaling

At startup, BharatGraph detects CPU cores, RAM, GPU availability, free disk space, Docker environment, and Neo4j URI location. It assigns one of three profiles:

Profile CPU RAM Workers Batch Depth
LOW 1-2 <8GB 2 25 2
MEDIUM 4 8GB 4 100 3
HIGH 8+ 16GB 8 500 5

Force a profile: BHARATGRAPH_PROFILE=low|medium|high Check active profile: GET /runtime

Security

  • Sliding window rate limiter: 100/min search, 30/min investigation
  • IP addresses stored as SHA-256 hashes only -- never plain text
  • HTTP security headers: CSP, HSTS, X-Frame-Options
  • Input validator blocks Cypher injection patterns
  • Append-only SHA-256 hash-chained audit log at logs/audit.jsonl
  • CORS origins via CORS_ORIGINS environment variable
  • All outputs pass validate_language() -- no accusatory vocabulary

Multilingual

22 Indian scheduled languages supported across all API endpoints. Language auto-detection via Unicode block analysis. Helsinki-NLP translation models. Cross-script entity matching for all 22 languages.


Quick Start

Prerequisites

Python 3.10+
Neo4j AuraDB account (free tier: 50K nodes / 175K relationships)

Local setup

git clone https://github.com/abinaze/bharatgraph.git
cd bharatgraph

pip install -r requirements.txt

# Copy environment template and fill in secrets
cp .env.example .env
# Edit .env: set NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD

# Start the API
uvicorn api.main:app --reload --port 8000

# Open the frontend
open frontend/index.html
# (or visit http://localhost:8000/docs for the API)

Seed sample data

curl -X POST http://localhost:8000/admin/seed

Run the full pipeline

curl -X POST http://localhost:8000/admin/pipeline
# Check status:
curl http://localhost:8000/admin/pipeline/status

API Reference

Core routes

Method Route Description
GET /search?q= Full-text + label-scan search across all 20 node types
GET /profile/{id} Complete entity profile with all linked data
GET /risk/{id} Composite 0-100 risk score with factor breakdown
GET /investigate/{id} 6-layer deep investigation report
GET /affidavit/{id} Kalman filter wealth trajectory
GET /benami/{id} 5-factor proxy ownership score
GET /debate/{id} 7-agent structured debate
GET /adversarial/{id} Forced counterevidence report
GET /connection-map?a=&b= Shortest path with WHY explanations
GET /export/pdf/{id} Download SHA-256-signed PDF dossier
GET /runtime Hardware profile and active settings
GET /health Service health check
GET /stats Node and relationship counts
POST /admin/seed Load sample data
POST /admin/pipeline Trigger full 21-scraper pipeline
WS /ws/feed Real-time high-signal entity feed

Search example

curl "https://abinazebinoly-bharatgraph.hf.space/search?q=Adani&limit=5"

Investigation example

curl "https://abinazebinoly-bharatgraph.hf.space/investigate/pol_001"

Project Structure

bharatgraph/
+-- api/
|   +-- main.py               # FastAPI app, middleware, WS feed
|   +-- routes/               # 19 route modules
|   +-- middleware/            # Rate limiter, security headers, audit logger
|   +-- models.py             # Pydantic response models
|   +-- dependencies.py       # Neo4j driver injection
+-- ai/
|   +-- multi_investigator.py # 15 parallel investigators + synthesis
|   +-- deep_investigator.py  # 6-layer recursive investigation
|   +-- risk_scorer.py        # Composite 0-100 risk score
|   +-- investigators/        # 15 specialist investigator modules
|   +-- forensics/            # Benami, TBML, cartel, linguistic, policy
|   +-- self_learning/        # Pattern learner, weight optimiser, case memory
|   +-- graph_analytics.py    # PageRank, Louvain, centrality
|   +-- explainer.py          # validate_language() enforcement
+-- config/
|   +-- settings.py           # Environment config
|   +-- runtime_profile.py    # Hardware detector + profile assignment
|   +-- model_selector.py     # Profile-aware model selection
+-- graph/
|   +-- loader.py             # Neo4j loader for all 20 node types
|   +-- schema.py             # Constraints and indexes
|   +-- seed.py               # Sample data for /admin/seed
+-- processing/
|   +-- pipeline.py           # Parallel orchestrator (20 scrapers)
|   +-- cleaner.py            # Indian name normalisation
|   +-- entity_resolver.py    # Jaccard deduplication
+-- scrapers/                 # 21 scrapers for all data sources
+-- frontend/
|   +-- index.html            # Single-page app entry point
|   +-- js/                   # app.js, api.js, components.js, timeline.js
|   +-- css/                  # main.css, themes
|   +-- sw.js                 # Service worker (cache-first)
+-- blockchain/
|   +-- audit_chain.py        # Append-only SHA-256 hash chain
+-- tests/                    # pytest test suite
+-- issues/                   # GitHub issue templates per phase
+-- .github/workflows/        # CI, daily scrape, weekly learning

Deployment

HuggingFace Spaces (production)

# Set these secrets in HuggingFace Space settings:
NEO4J_URI=neo4j+s://...
NEO4J_USER=neo4j
NEO4J_PASSWORD=...
CORS_ORIGINS=https://abinaze.github.io

# Deploy
git remote add hf https://huggingface.co/spaces/abinazebinoly/bharatgraph
git push hf main --force

Docker

docker build -t bharatgraph .
docker run -p 8000:8000 \
  -e NEO4J_URI=... \
  -e NEO4J_PASSWORD=... \
  bharatgraph

Legal and Ethics

BharatGraph analyses structural patterns in official public data. It does not make legal findings, accusations, or moral judgements about any individual or organisation.

All outputs use neutral analytical language enforced programmatically by validate_language(). Forbidden words include: corrupt, guilty, criminal, fraud, accused (as a judgement), fraudster, and similar accusatory terms.

Every finding is labelled as a "structural indicator" and includes:

  • The specific data sources consulted
  • The confidence level (HIGH/MODERATE/LOW/INSUFFICIENT)
  • A disclaimer that this is an analytical report and not a legal finding

Users are responsible for verifying all findings through independent research before publishing. The platform is designed to assist investigation, not replace journalistic verification.

Data sourced from official government portals is used under the National Data Sharing and Accessibility Policy (NDSAP) and Open Government Data (OGD) platform terms. International data (ICIJ, OpenSanctions) is used under their respective open-data licenses.


Contributing

See CONTRIBUTING.md for the full contribution guide.

Branch naming: feature/phase-N-description or fix/issue-description

Every phase has:

  1. A GitHub issue created first (template in issues/)
  2. A feature branch
  3. Commits with descriptive messages referencing the issue number
  4. A PR with the standard description template
  5. A version tag after merge

All Python files must be pure ASCII (no Unicode in comments or strings). Run python3 -m py_compile <file> before every commit.


License

MIT License. See LICENSE.


Reference documents


Developed by Abinaze Binoy