# BharatGraph -- Complete Phase Roadmap All branches merge into `main`. Branch naming: `feature/phase-N-name` or `fix/description`. Each phase has a GitHub Issue (see `issues/` directory) and a PR description template. --- ## COMPLETED PHASES (1-31) ### Phase 1 -- Data Collection **Tag:** pre-v1 | 6 scrapers, 3,199+ records, base scraper with rate limiting and retry ### Phase 2 -- Data Processing **Tag:** pre-v2 | Indian name normalisation, Jaccard entity resolution, parallel pipeline ### Phase 3 -- Graph Database **Tag:** pre-v3 | Neo4j schema, 7 node types, stable MD5 IDs, 8 Cypher templates ### Phase 4 -- FastAPI Backend **Tag:** v0.12.0 | FastAPI + Pydantic + Neo4j dependency injection, source citations ### Phase 5 -- Risk Scoring Engine 5-indicator composite score, validate_language() forbidden-word enforcement ### Phase 6 -- Expanded Data Sources (13 scrapers) ICIJ, Wikidata, OpenSanctions, Lok Sabha, SEBI, Electoral Bonds added ### Phase 7 -- NLP Document Intelligence spaCy NER, Benford Law chi-squared, multilingual BERT NER, shadow draft detector ### Phase 8 -- Advanced Graph Analytics NetworkX betweenness/PageRank/Louvain, circular ownership, ghost company scorer ### Phase 9 -- Eight New Indian Sources (21 total) NJDG, ED, CVC, NCRB, LGD, IBBI, NGO Darpan, CPPP added with fallback samples ### Phase 10 -- Multi-Investigator AI Engine **Tag:** v0.10.0 | 12 parallel investigators, SHA-256 report hash, synthesis engine ### Phase 11 -- Multilingual Platform (22 Languages) All 22 Indian scheduled languages, auto-detection, Helsinki-NLP translation ### Phase 12 -- PDF Dossier Generator Jinja2 + WeasyPrint, SHA-256 integrity hash, GET /export/pdf/{id} ### Phase 13 -- Production Frontend Vanilla JS/HTML/CSS, D3.js force graph, 5 views, works offline from file:// ### Phase 14 -- Zero Cold-Start Deployment **Tag:** v0.14.0 | HuggingFace Spaces Docker, service worker cache, GitHub Pages CI/CD ### Phase 15 -- Mathematical Intelligence Engine **Tag:** v0.15.0 | Spectral Fiedler value, Fourier FFT, 13th investigator (math) ### Phase 16 -- Evidence Connection Map and Deep Investigation **Tag:** v0.16.0 | 6-layer recursive investigation, connection mapper, WHY explanations ### Phase 17 -- Security Hardening and Provenance Layer **Tag:** v0.17.0 | Rate limiter, CSP/HSTS headers, input validator, SHA-256 audit log ### Phase 18 -- Self-Learning System and Case Memory Schema learner, pattern learner, weight optimiser (+-0.01 per 3 confirmed cases) ### Phase 19 -- Affidavit Wealth Trajectory Engine **Tag:** v0.19.0 | Kalman filter, 5-election series, 14th investigator (affidavit) ### Phase 20 -- Biography Engine Chronological timeline, 5 temporal convergence window types, neutral narrative ### Phase 21 -- Benami Entity Detection 5-factor proxy score, thresholds HIGH>=65 MODERATE>=40, 15th investigator ### Phase 22 -- Procurement DNA, Cartel Detection, Full Pipeline TF-IDF cosine >=0.72, award rotation, co-bidding network, 21 scrapers ### Phase 23 -- Revolving Door and TBML Detection 365-day cooling-off, pre-employment benefit, 2.5-sigma TBML, subcontract loops ### Phase 24 -- Linguistic Fingerprinting Burrows Delta authorship, template reuse detection, ghost-writing detection ### Phase 25 -- Policy-Benefit Causal Analysis Granger causality (lags 1-6), transfer entropy, CACA cross-ministry chain ### Phase 26 -- Adversarial Counterevidence Forced disproof, competing hypotheses, uncertainty propagation ### Phase 27 -- Multi-Agent Debate Engine 7-agent 3-round debate, iMAD hesitation detection, minority dissent preserved ### Phase 28 -- Dark Pattern Detection PrefixSpan sequential mining, 6 pre-defined high-risk sequences ### Phase 29 -- UX Overhaul and i18n Evidence panel (4 tabs), D3 graph redesign, 22-language UI, timeline view ### Phase 30 -- Bug Fix Sprint **Tag:** v0.30.0 | 26 bugs resolved including BUG-1 (search crash), BUG-2 (7 missing loaders) ### Phase 31 -- Runtime Profile and Auto-Scaling **Tag:** v0.31.0 | Hardware detector, LOW/MEDIUM/HIGH profiles, GET /runtime endpoint **Branch:** `feature/phase-31-runtime-profile` **Files:** config/runtime_profile.py, config/model_selector.py, api/routes/runtime.py **Tests:** 15 unit tests in tests/test_runtime_profile.py **Profile assignment:** cpu*2 + ram*2 + gpu*2 + disk + docker + db_local (max 9) --- ## PLANNED PHASES --- ### Phase 32 -- Entity Resolution v2: Canonical Identity Engine **Branch:** `feature/phase-32-entity-resolution` **Priority:** CRITICAL -- fixes broken evidence chains across all phases **Problem:** Jaccard token similarity misses transliteration variants, honorific variations ("Sh. Ram Kumar" vs "Shri Ramkumar"), and cross-script name forms. The same person stored under 3+ IDs = broken evidence chains. **Algorithms:** - Jaro-Winkler (weight 0.30) -- character-level typo and transliteration - Jaccard token overlap (weight 0.20) -- word-order variations - Sentence-transformers cosine (weight 0.35) -- multilingual name variants - Exact PAN/CIN/GSTIN match (weight 1.0, overrides all) -- deterministic keys **New files:** - `processing/entity_resolver_v2.py` -- CanonicalIdentityEngine class - `processing/canonical_id.py` -- stable SHA-256 ID generation functions - `processing/alias_graph.py` -- AliasGraph: alias_name -> canonical_id lookup **Indian name normalisation added:** - Remove honorifics: Sh., Smt., Dr., Late, Sri, Shri, Er., Adv., Col. - Normalise suffixes: Private Limited -> Pvt Ltd, LLP, Ltd - Script-aware: Devanagari -> Latin transliteration for comparison **Integration:** pipeline.py resolve_dataset() upgraded to use v2 engine --- ### Phase 33 -- Custom Graph Engine: Eliminate Neo4j 50K Limit **Branch:** `feature/phase-33-custom-graph-engine` **Priority:** HIGH -- AuraDB free tier caps at 50K nodes / 175K relationships **Architecture:** ``` graph_engine/ +-- store.py -- LevelDB key-value backing store +-- hnsw.py -- HNSW vector index (M=16, ef=200) +-- query_planner.py -- Cypher-to-native query translator +-- temporal.py -- Time-weighted edge decay by relationship type +-- version_control.py -- Git-style diff log for graph mutations +-- compat_layer.py -- Translates all existing Cypher to native calls ``` **Temporal edge decay lambdas:** - court_order: 0.00005 (slowest -- court records are permanent) - cag_audit: 0.0002 - government_portal: 0.0005 - director_of: 0.0003 - member_of: 0.0005 - news_article: 0.001 - social_media: 0.01 (fastest decay) **Version control:** Every graph mutation is recorded as a diff with before/after hashes. Detects when government portals silently modify records post-publication. Anti-forensics pattern: commit A -> commit B (change) -> commit C (reverts to A) = flag --- ### Phase 34 -- Vector Search and Hybrid Retrieval **Branch:** `feature/phase-34-vector-search` **Problem:** Keyword search misses semantically similar documents. Searching "Maharashtra road contract irregularity" does not find CAG reports about "highway construction irregularity in Pune" even though they are the same topic. **Algorithms:** - FAISS (cpu) or Qdrant for vector index - BM25 for keyword ranking - Reciprocal Rank Fusion (k=60): RRF = sum(1 / (60 + rank)) - Query classifier routes to appropriate retrieval strategy **Query routing:** | Query type | Keywords | Retrieval mix | |-----------|---------|--------------| | factual | who is, what is, when did | BM25 70% + vector 30% | | relational | connected to, path from | Graph 80% + vector 20% | | temporal | before, after, election, contract date | Graph 60% + BM25 40% | | exploratory | similar to, pattern, cluster | Vector 60% + community 40% | **Embedding model:** paraphrase-multilingual-MiniLM-L12-v2 (covers all 22 languages) --- ### Phase 35 -- Plugin System and YAML Enrichers **Branch:** `feature/phase-35-plugins` **Lazy-loading plugin architecture** -- new data sources added by dropping a YAML file in `enrichers/` with no code changes. **Plugin registry also covers algorithms** -- new detection algorithms registered as plugins, enabling Phase 57 A/B testing. --- ### Phase 36 -- Sigma-Style YAML Rule Engine **Branch:** `feature/phase-36-rule-engine` **Problem:** Adding a new detection rule requires writing Python + Cypher. Non-developer investigators cannot contribute detection logic. **YAML -> Cypher compiler** -- a rule file specifies conditions, thresholds, and actions. The engine compiles it to Cypher at startup. **10 built-in rules shipped:** 1. `cartel_rotation.yaml` -- same vendor group rotates wins 2. `electoral_bond_proximity.yaml` -- bond + contract within 12 months (CRITICAL) 3. `family_directorship_web.yaml` -- politician's family = company director 4. `audit_contract_overlap.yaml` -- continued contracts after CAG audit flag 5. `shell_company_age_contract.yaml` -- company < 6 months old + large contract 6. `single_bidder_high_value.yaml` -- single bid above district average 7. `circular_ownership_3node.yaml` -- 3-node corporate ownership cycle 8. `revolving_door_365day.yaml` -- government to private within 1 year 9. `address_cluster_directors.yaml` -- 3+ companies same registered address 10. `pre_election_contract_surge.yaml` -- contract spend spike 90 days before poll --- ### Phase 37 -- Job Queue and Worker Pool **Branch:** `feature/phase-37-job-queue` **Redis-backed job queue** with state machine: INIT -> QUEUED -> RUNNING -> DONE **Algorithm job priorities:** - Priority 1 (immediate): entity_resolution, neurosymbolic_risk, rule_engine - Priority 2 (30s): gnn_tbml, election_burst, shap_explanation, graphrag_summary - Priority 3 (5min): corruption_dna, metapath_walk, community_detection, topic_modeling - Priority 4 (off-peak): fingerprint_index, gcpal_pretraining, wayback_drift --- ### Phase 38 -- DeepSeek-R1 Chain-of-Thought Reasoning **Branch:** `feature/phase-38-deepseek-r1` **Problem:** Current synthesis logic (3+ investigators agreeing = HIGH) is a vote count, not reasoning. No audit trail of how a conclusion was reached. **DeepSeek-R1 integration:** - Receives: graph findings + SHAP explanations + TruthChain evidence IDs - Generates: step-by-step reasoning chain citing specific evidence node IDs - Produces: 2 competing hypotheses with scores, then a final verdict - Verdict levels: CONFIRMED (>=80), PROBABLE (>=50), WEAK (>=20), INSUFFICIENT **Anti-hallucination enforcement:** - Every R1 claim must cite a TruthChain node_id (format: [EVIDENCE-XXXX]) - Post-generation validation: regex check for invented node IDs - Invalid citations are stripped before the report is returned **Fallback:** When DeepSeek API is unavailable, the existing multi-investigator synthesis provides the output. R1 augments -- it does not replace. --- ### Phase 38B -- GraphRAG: Graph-Guided LLM Retrieval (NEW) **Branch:** `feature/phase-38b-graphrag` **Problem:** R1 cannot answer global questions like "What are the main corruption themes across all 5,000 CAG audit reports?" Standard RAG retrieves isolated chunks. **GraphRAG approach:** 1. Run Leiden clustering over all scraped documents and graph nodes 2. For each community > 3 nodes, R1 generates a community summary 3. At query time: embed query -> retrieve top-k community summaries by cosine 4. Feed summaries + relevant subgraph as structured context to R1 **New files:** - `ai/graphrag/community_indexer.py` -- builds community summaries offline - `ai/graphrag/graphrag_retriever.py` -- query-time retrieval **Integration with Phase 38:** R1 receives GraphRAG community summaries instead of raw graph fragments -- dramatically reduces hallucination. --- ### Phase 39 -- DeepSeek-VL2 Visual Evidence Analysis **Branch:** `feature/phase-39-deepseek-vl2` Analyse scanned affidavit PDFs, audit report images, and newspaper clippings. Signature mismatch detection. Document image authenticity via Shannon entropy. OCR pipeline for non-digital government documents. --- ### Phase 40 -- DeepSeek-V3 Multilingual Dossier Generation **Branch:** `feature/phase-40-deepseek-v3` Generate full investigation reports in all 22 Indian languages. CONFIRMED/PROBABLE/WEAK/INSUFFICIENT grading on every finding. Length: 800-1200 words per report. Export to PDF with trilingual header. --- ### Phase 41 -- Legal Intelligence Pipeline **Branch:** `feature/phase-41-legal` **IPC Section Classifier:** - Algorithm: TF-IDF + OneVsRestClassifier(LogisticRegression) -- multi-label - 8 corruption-relevant IPC sections: 420, 409, 13, 7, 120B, 467, 468, 471 - Keyword fallback when model not trained **Crime triple extractor:** - Pattern: Subject -> Action -> Object from legal text - Store as directed evidence edges: (Company)-[:BRIBED]->(Official) **Semantic Role Labelling (SRL):** - ARG0 (agent) -> entity who acted - ARG2 (recipient) -> entity who benefited - V (predicate) -> action type: BRIBED, APPROVED, AWARDED **BK-tree** for out-of-vocabulary legal term repair. --- ### Phase 42 -- Forensic Content Intelligence **Branch:** `feature/phase-42-forensic-content` **Shannon entropy classifier:** | Document type | Expected range | |--------------|----------------| | government_order | 3.8 -- 5.2 bits | | cag_report | 4.0 -- 5.4 bits | | tender_document | 3.5 -- 5.0 bits | | court_order | 3.9 -- 5.3 bits | Documents outside expected range flagged as SUSPICIOUS or LIKELY_FABRICATED. **Perceptual hash (pHash)** for image-based document copy detection. **PAN/CIN/Aadhaar regex extraction** from document text. **Lexical diversity score** -- repetitive templates have diversity < 0.3. --- ### Phase 43 -- Pivot Recommendation Engine **Branch:** `feature/phase-43-pivot` **Problem:** After finding a suspicious entity, the next best investigation target is unclear. The pivot engine scores all connected entities. **6-factor scoring:** | Factor | Weight | Description | |--------|--------|-------------| | pagerank | 0.20 | How central is this entity? | | evidence_gap | 0.25 | How much do we NOT know? | | risk_signals | 0.20 | log(risk_signals + 1) | | connection_strength | 0.15 | Edge weight to current entity | | temporal_recency | 0.10 | Recently active? | | unexplored_depth | 0.10 | Unexplored 2-hop nodes | **Route:** `GET /pivot/{entity_id}?already_investigated=id1,id2` --- ### Phase 44 -- Geospatial Verification via Satellite **Branch:** `feature/phase-44-satellite` Sentinel-2 L2A time series for project verification. NDVI change detection for forest diversion claims. NDBI (built-up index) for construction completion verification. SAR (Sentinel-1) for flood infrastructure claims. Compare contract completion claims vs satellite-observable progress. --- ### Phase 45 -- W3C PROV-DM Provenance and TruthChain **Branch:** `feature/phase-45-provenance` **TruthChain algorithm:** - Each evidence node has: SHA-256 ID, source_type, content_hash, timestamp, status - Merkle tree over all evidence: root_hash changes if ANY evidence changes - Temporal decay: weight(E,t) = base_weight * exp(-lambda_type * days) - Status propagation: MODIFIED evidence propagates DEPENDS_ON_MODIFIED to descendants - Aggregate confidence = active_weight / total_weight **Decay rates by source:** - court_order: 0.0001 (permanent) - cag_audit: 0.0002 - government_portal: 0.0005 - news_article: 0.001 - social_media: 0.01 **Export:** JSON-LD using W3C PROV-DM ontology + Schema.org **Blockchain anchor:** Merkle root stored in audit_chain.py (Bitcoin via OpenTimestamps) --- ### Phase 46 -- Source Drift and Historical Record Analysis **Branch:** `feature/phase-46-source-drift` **Wayback CDX API** to detect when government records are silently modified. **7 fault types** (ISWC 2024 taxonomy): - node_disappearance: entity removed from portal - edge_rewiring: director change silently backdated - attribute_drift: contract amount modified post-publication - cluster_split: formerly linked entities disconnected - cluster_merge: separate networks joined - temporal_burst: sudden new relationship creation - isolation: previously connected entity becomes isolated **Anti-forensics detection:** commit A -> commit B (change) -> commit C (reverts) = SUPPRESS_ATTEMPT --- ### Phase 47 -- Predictive Risk Trajectory **Branch:** `feature/phase-47-predictive` **ARIMA(2,1,1) risk prediction:** - Fits on monthly risk score history (min 12 data points) - Forecasts 6 months ahead with 80% confidence intervals - Alert when predicted score crosses HIGH threshold **GCPAL contrastive pre-training for label scarcity:** - India's 1:707 confirmed-corruption ratio makes traditional supervised ML difficult - GCPAL mines supervised signals from the unlabelled relationship graph - Three augmented views: node feature dropout + edge dropout + KNN view - NT-Xent contrastive loss (temperature = 0.07) - Fine-tunes on confirmed cases from case_memory (min 5 needed) --- ### Phase 48 -- Watchlist, Alerts, and ARIMA Prediction **Branch:** `feature/phase-48-watchlist` WebSocket push alerts when risk score changes for watched entities. YAML alert rules (same format as Phase 36). Webhook support for journalist notification systems. --- ### Phase 49 -- Observability and Reliability **Branch:** `feature/phase-49-observability` Prometheus /metrics endpoint. Stale-data alerts when pipeline has not run in >7 days. Ingestion validator checks all 20 node types have recent data. /health upgraded to return per-source freshness status. --- ### Phase 50 -- Security v2: RBAC and JWT **Branch:** `feature/phase-50-security-v2` Role-based access control: Lead Investigator, Contributor, Reviewer, Observer. JWT authentication with refresh tokens. DPDP Act compliance (India Data Protection). Entity-level access control for sensitive investigations. --- ### Phase 51 -- Electoral Bond Causal Graph Engine **Branch:** `feature/phase-51-electoral-bond-causal` **Critical missing feature.** The data exists but the causal chain is not mapped. **Full graph path:** Corporate donor -> ElectoralBond -> Party -> Ministry -> Policy -> Contract -> Company **Algorithm:** Granger causality (from Phase 25) + Difference-in-Differences to establish whether policy changes statistically follow bond purchases. **New node type:** PolicyChange (date, ministry, beneficiaries) **New relationship:** FOLLOWED_BOND (lag_days, p_value, granger_f_stat) **New route:** `GET /electoral-bond/causal/{company_id}` --- ### Phase 52 -- Parliament Performance Analytics **Branch:** `feature/phase-52-parliament` **New data sources:** Lok Sabha division votes (loksabha.nic.in/Loksabha/Divisions), Rajya Sabha Q&A archive, Praja.org legislator data. **MP accountability score:** - Attendance rate (0.30 weight) - Questions asked per session (0.25 weight) - Vote consistency with party line vs independent votes (0.20 weight) - Bills sponsored (0.15 weight) - Starred questions with substantive follow-up (0.10 weight) **New route:** `GET /parliament/performance/{politician_id}` **New node type:** DivisionVote, ParliamentSession **New relationship:** VOTED_IN, ASKED_STARRED_QUESTION --- ### Phase 53 -- Media Ownership Graph **Branch:** `feature/phase-53-media-ownership` **New data sources:** MIB media license registry, TRAI spectrum allocations. **Graph paths:** - Channel -> Corporate parent -> Promoter -> Political donor - Channel -> Editorial stance correlation (NLP) -> Political entity **Editorial bias detection:** NLP sentiment analysis comparing coverage of political entities across channels with known ownership structures. **New node types:** MediaChannel, SpectrumLicense, EditorialEntity **New route:** `GET /media/ownership/{channel_id}` --- ### Phase 54 -- Constituency Development Index **Branch:** `feature/phase-54-constituency` **Data sources:** NDAP district SDG scores, MGNREGS employment data, PM Kisan disbursements, PM Awas completions, Swachh Bharat ODF data. **Algorithm:** Regression analysis -- does the constituency improve during the politician's tenure vs comparison period? **Pre-election spending surge detection:** CUSUM on district spending in 90 days before election vs annual baseline. **New route:** `GET /constituency/{id}/development` **Satellite verification:** Sentinel-2 images corroborate claimed completions. --- ### Phase 55 -- Family Dynasty and Nepotism Graph **Branch:** `feature/phase-55-dynasty` **Data source:** FAMILY_OF edges extracted from MyNeta affidavit declarations ("Spouse: X", "Dependent 1: Y"). Already partially available in existing data. **Dynasty depth score:** - Count of family members in government positions - Count of family-controlled companies with government contracts - Count of elections won by family members across generations - Geographic concentration (same constituency or district) **New relationship:** FAMILY_OF (role: spouse/child/sibling/parent) **New route:** `GET /dynasty/{politician_id}` --- ### Phase 56 -- RTI Intelligence Engine **Branch:** `feature/phase-56-rti` **RTI auto-filer:** System detects evidence gaps in any investigation and drafts the exact RTI application to fill them. **Gap detection algorithm:** - For each HIGH-risk finding: check if primary source data is available - If data missing: identify the correct Public Information Officer - Generate RTI draft citing the specific provisions (RTI Act 2005, Sections 6-8) **RTI outcome tracker:** Index filed RTI applications from RTI Online portal. Map outcomes to graph: PIOs who deny information for high-risk entities = flag. **New route:** `GET /rti/draft/{entity_id}` (generates RTI text) **New node type:** RTIApplication, PublicInformationOfficer --- ### Phase 57 -- A/B Algorithm Testing Framework (NEW) **Branch:** `feature/phase-57-ab-testing` **Multi-armed bandit (Thompson Sampling) for algorithm selection:** - Each algorithm arm has Beta(alpha, beta) prior over performance - alpha = times algorithm was "preferred" by human review - beta = times algorithm was "not preferred" - Select arm with highest sampled value at each request **Use case:** When upgrading from static risk scorer -> ML ensemble -> NeuroSymbolic, verify the new algorithm actually improves outcomes. **New route:** `GET /admin/algorithm-performance` --- ### Phase 58 -- Real-Time Stream Processing (NEW) **Branch:** `feature/phase-58-streaming` **Problem:** Pipeline runs in batches. Breaking leads appear hours late. **Redis Streams** (Kafka fallback) for real-time event ingestion. **CUSUM online anomaly detection** on the stream (no batch needed). **Sliding window aggregation** for real-time indicator updates. **Events processed in real-time:** - new_contract: immediate CUSUM check on contract value - new_audit_report: check if any tracked entities are mentioned - new_enforcement_action: update risk scores for named entities - source_modification: detect when a scraped page changes --- ### Phase 59 -- CorruptionDNA Fingerprint (NEW) **Branch:** `feature/phase-59-corruption-dna` **Problem:** Two entities in the same corruption network may have no direct graph edge -- different states, different directors, but identical patterns. **512-dim fingerprint = concat(:** - Node2Vec structural embedding (128d) - TF-IDF document vector (128d) - Benford's Law digit distribution (9d, padded to 16d) - Temporal burst vector (64d) - Linguistic fingerprint -- Burrows Delta (64d) - Entity type one-hot (16d) - Risk indicator vector (16d) - CAG audit TF-IDF (64d) - Institutional path vector (32d) **MinHash LSH** for efficient similarity search (cosine > 0.82 = same network). **New route:** `GET /dna/{entity_id}` and `GET /dna/similar/{entity_id}` --- ### Phase 60 -- ElectionProximityBurst Detector (NEW) **Branch:** `feature/phase-60-election-burst` **The only corruption detection algorithm that encodes the Indian electoral calendar as a statistical regression variable.** **Algorithm:** 1. Load full Indian electoral calendar (Lok Sabha + 28 state assemblies) 2. ARIMA(2,1,1) on monthly metric aggregates 3. PELT changepoint detection on ARIMA residuals 4. Match changepoints to election proximity (within 180 days) 5. CUSUM control chart with k=0.5, h=5.0 6. Granger causality: does election_proximity_days Granger-cause the metric? **Output:** burst_score (0-100), election_burst_flags, cusum_alerts, Granger p-value, interpretation in plain language. **Integrated as 16th investigator** (temporal, weight 0.10) --- ### Phase 61 -- BennamiGNN: Heterogeneous Graph Neural Network (NEW) **Branch:** `feature/phase-61-benami-gnn` **Problem:** 5-factor heuristic misses multi-hop benami: politician's cousin is director (not the politician), company has legitimate small contracts before being used for a large fraudulent one. **H-GNN architecture:** - 8 relation types: DIRECTOR_OF, WON_CONTRACT, SHARES_ADDRESS, RELATED_TO, AWARDED_BY, FAMILY_MEMBER_OF, APPEARS_IN_AUDIT, SANCTIONED_BY - Layer 0: Per-type linear projection to d=64 - Layer 1: Relation-aware message passing - Layer 2: Entity-type attention - Layer 3: Classification head -> benami_score in [0,1] **Fallback:** Always falls back to existing 5-factor heuristic when: - PyTorch not installed - Subgraph has < 5 nodes - Model not trained yet **Training:** Fine-tunes on confirmed benami cases from case_memory. --- ### Phase 62 -- CartelDNA Sequential Mining (NEW) **Branch:** `feature/phase-62-cartel-dna` **Problem:** Current cartel detector checks single-tender award rotation. Temporal cartels rotate wins across months and across ministries to avoid statistical detection within any one ministry. **CartelDNA = PrefixSpan + HITS + DBSCAN:** 1. PrefixSpan on bid event sequences (company, category, month, rank) 2. Detect alternating rank order patterns (length 2-6, min support 3) 3. HITS on co-bidding network: authority = real winners, hub = fake competitors 4. DBSCAN geographic clustering (epsilon = 50km, min_samples = 3) 5. Cartel confidence = 0.35*pattern + 0.25*alternation + 0.20*geo + 0.20*HITS **New route:** `GET /cartel/dna/{entity_id}` --- ### Phase 63 -- SHAP and LIME Explainability Layer (NEW) **Branch:** `feature/phase-63-explainability` **Problem:** Every risk score has no explanation. Journalists cannot publish "score: 67" without "why: politician_overlap drove +24 points." **SHAP TreeExplainer** on the ML ensemble from Phase 19 upgrade: - Feature contributions for each of the 5 indicators - Counterfactual: "If contract_concentration were 0, score would be 43" - Baseline score (expected value) **LIME** locally linear approximation for non-tree models. **New fields added to all risk responses:** - shap_top_drivers: [{feature, shap_value, direction}] - shap_counterfactual: plain-language minimum change to flip risk level - shap_baseline: expected value before any features **New route:** `GET /risk/explain/{entity_id}` --- ### Phase 64 -- Cross-Language Entity Disambiguation (NEW) **Branch:** `feature/phase-64-cross-lingual` **Problem:** "Modi" / "modi" / "modii" appear in 22 scripts -- potentially stored as separate graph nodes. Cross-lingual entity linker maps all variants to a single canonical node using Wikidata Q-numbers. **XLM-RoBERTa** zero-shot entity linking. **Wikidata SPARQL** for canonical Q-number lookup (existing scraper extended). **Transliteration confidence score** per script pair. --- ### Phase 65 -- Knowledge Graph Completion (Missing Link Prediction) (NEW) **Branch:** `feature/phase-65-kg-completion` **TransE** link prediction: h + r = t in d-dimensional space. Missing edge score: ||h + r - t|| (lower = more probable edge). **Use cases:** - (Politician, DIRECTOR_OF, ?) -- suggest companies likely controlled - (?, RELATED_TO, KnownShellCompany) -- find hidden associates - (Company, WON_CONTRACT, ?) -- predict future contract awards **Output:** List of probable missing edges with confidence scores, presented as "Suggested next investigation targets." --- ### Phase 66 -- LAS-GNN Temporal TBML Detection (NEW) **Branch:** `feature/phase-66-las-gnn` **Problem:** Current TBML detector uses threshold rules. Temporal money laundering (pre-election scatter-gather, below-threshold smurfing) is invisible to structural analysis. **LAS-GNN:** LSTM aggregator on directed transaction graphs. Learns sequential order of edges imposed by timestamps. Detects motifs: scatter-gather, fan-in/fan-out, layering, pre-election burst. **Indian-specific motifs:** - Pre-election scatter: funds split to many accounts < 6 months before election - Post-contract layering: payment -> N shell companies -> reconsolidated - Smurfing below threshold: many transactions < Rs 2 lakh (PMLA threshold) - Circular director rotation: A appoints X -> X at B -> B pays A --- ### Phase 67 -- NeuroSymbolic Risk Reasoning (NEW) **Branch:** `feature/phase-67-neurosymbolic` **Fuses three reasoning modes into one coherent system:** Stage 1 -- DEDUCTIVE (Phase 36 YAML rules): - Rules fire with certainty = 1.0 (logical certainty) - CRITICAL rule match -> score forced >= 75 Stage 2 -- INDUCTIVE (Phase 19 ML ensemble + SHAP): - GNN/ML soft score in [0,1] - SHAP feature contributions Stage 3 -- ABDUCTIVE (Phase 38 DeepSeek-R1): - Chain-of-thought synthesis citing TruthChain evidence IDs - 2 competing hypotheses with scores Stage 4 -- Integration: - final_score = 0.40*rule_certainty + 0.35*gnn_score + 0.25*r1_confidence - Adversarial override: if adversarial engine finds contradicting evidence -> cap at PROBABLE --- ### Phase 68 -- InstitutionMetapath2Vec Embeddings (NEW) **Branch:** `feature/phase-68-metapath` **5 Indian-specific metapaths** for structured random walks: 1. politician_enrichment: Politician-DIRECTOR_OF-Company-WON_CONTRACT-Contract 2. circular_enrichment: Politician-MEMBER_OF-Party-CONTROLS-Ministry-...-DIRECTOR_OF-Politician 3. audit_flag_circular: Company-WON_CONTRACT-Contract-MENTIONED_IN-AuditReport-AUDITS-Ministry 4. shell_address_cluster: Director-DIRECTOR_OF-Company-SHARES_ADDRESS-Company 5. constituency_benefit: Politician-REPRESENTS-Constituency-LOCATED_IN-District-HAS_PROJECT-Contract **128-dim entity embeddings** trained via Word2Vec skip-gram on guided walks. **find_similar_by_metapath()** finds entities with the same institutional role across different states -- invisible to structural graph analysis. --- ### Phase 69 -- Geospatial Risk Clustering (NEW) **Branch:** `feature/phase-69-geospatial` **Moran's I** spatial autocorrelation on district-level risk scores. I > 0 = spatial corruption hotspots cluster together. **LISA** (Local Indicators of Spatial Association): - High-High cluster: high-risk district surrounded by high-risk districts - Low-High outlier: low-risk district in high-risk region (potential evasion) - High-Low outlier: targeted corruption in otherwise clean district **Output:** District-level choropleth with cluster classification. **New route:** `GET /geospatial/risk-clusters` --- ### Phase 70 -- Dynamic Knowledge Graph Anomaly Detection (NEW) **Branch:** `feature/phase-70-dynamic-kg` Continuously monitors graph for unexpected structural changes. 7 fault types (ISWC 2024): node_disappearance, edge_rewiring, attribute_drift, cluster_split, cluster_merge, temporal_burst, isolation. **Contextual anomaly detection:** entity that was HIGH-risk 3 months ago is now suddenly LOW-risk = possible evidence suppression. --- ### Phase 71 -- GCPAL Contrastive Pre-Training (NEW) **Branch:** `feature/phase-71-gcpal` **Label scarcity problem:** India has very few confirmed corruption cases relative to the total number of entities (estimated 1:707 ratio). Standard supervised ML cannot train on this imbalance. **GCPAL solution:** NT-Xent contrastive loss on 3 augmented views: - View 1: node feature dropout (20%) - View 2: edge dropout (20%) - View 3: KNN implicit interactions (k=5) Pre-trains on unlabelled graph. Fine-tunes on case_memory confirmed cases. --- ### Phase 72 -- Automated Source Credibility Scoring (NEW) **Branch:** `feature/phase-72-source-credibility` Bayesian credibility model per source: - institutional_authority: government > NGO > news > social - historical_accuracy: confirmed vs denied past claims - methodology_transparency: does source explain collection method? - timeliness: freshness decay - cross_source_corroboration: independent corroboration count Bayesian update after each confirmed/denied case. --- ### Phase 73 -- Investigative RAG Over Case Memory (NEW) **Branch:** `feature/phase-73-rag-cases` **RAG over all past investigation reports** in case_memory. Query: "Past investigations involving electoral bonds and road contracts" -> Dense retrieval -> Top-k case summaries as context -> DeepSeek-R1 synthesizes commonalities and suggests strategy. --- ### Phase 74 -- Continuous Model Drift Detection (NEW) **Branch:** `feature/phase-74-drift` **Population Stability Index (PSI):** - PSI < 0.10: stable - PSI 0.10-0.25: monitor closely - PSI > 0.25: retrain required **ADWIN (Adaptive Windowing):** streaming concept drift detection. Auto-triggers GCPAL retraining job when drift detected. --- ### Phase 75 -- Ethics and Bias Audit System (NEW) **Branch:** `feature/phase-75-ethics` **Fairness metrics:** - Demographic parity: P(HIGH_RISK | party=A) approx= P(HIGH_RISK | party=B) - Equal opportunity: TPR equal across entity types - Predictive parity: PPV equal across geographic regions **Bias detection:** chi-squared test, disparate impact ratio, SHAP fairness. **Mitigation:** Reweighing, adversarial debiasing, calibration. **New route:** `GET /admin/bias-audit` --- ## BRANCH WORKFLOW ```bash # Before each new phase: git checkout main && git pull origin main git checkout -b feature/phase-N-name # After all commits: git push origin feature/phase-N-name # Open PR on GitHub -> merge -> pull main -> tag # Tag every completed phase: git tag -a vN.0.0 -m "Phase N: description" git push origin vN.0.0 # Deploy to HuggingFace after every merge: git push hf main --force # Reseed after every deploy: curl -X POST https://abinazebinoly-bharatgraph.hf.space/admin/seed ``` --- ## VERSION HISTORY | Version | Phase | Key addition | |---------|-------|--------------| | v0.30.0 | 30 | Bug fix sprint -- 26 bugs resolved | | v0.31.0 | 31 | Runtime profile auto-scaling | | v0.32.0 | 32 | Entity resolution v2 (planned) | | v0.33.0 | 33 | Custom graph engine (planned) | | v0.40.0 | 40 | DeepSeek-V3 multilingual reports (planned) | | v0.50.0 | 50 | Security v2 RBAC (planned) | | v1.0.0 | 75 | Full production launch (planned) | --- ## Developed by Abinaze Binoy