bharatgraph / PHASE_ROADMAP.md
abinazebinoy's picture
Update Phaseroadmap with new features
9f87a81 unverified
|
Raw
History Blame Contribute Delete
34.5 kB

BharatGraph -- Complete Phase Roadmap

All branches merge into main. Branch naming: feature/phase-N-name or fix/description. Each phase has a GitHub Issue (see issues/ directory) and a PR description template.


COMPLETED PHASES (1-31)

Phase 1 -- Data Collection

Tag: pre-v1 | 6 scrapers, 3,199+ records, base scraper with rate limiting and retry

Phase 2 -- Data Processing

Tag: pre-v2 | Indian name normalisation, Jaccard entity resolution, parallel pipeline

Phase 3 -- Graph Database

Tag: pre-v3 | Neo4j schema, 7 node types, stable MD5 IDs, 8 Cypher templates

Phase 4 -- FastAPI Backend

Tag: v0.12.0 | FastAPI + Pydantic + Neo4j dependency injection, source citations

Phase 5 -- Risk Scoring Engine

5-indicator composite score, validate_language() forbidden-word enforcement

Phase 6 -- Expanded Data Sources (13 scrapers)

ICIJ, Wikidata, OpenSanctions, Lok Sabha, SEBI, Electoral Bonds added

Phase 7 -- NLP Document Intelligence

spaCy NER, Benford Law chi-squared, multilingual BERT NER, shadow draft detector

Phase 8 -- Advanced Graph Analytics

NetworkX betweenness/PageRank/Louvain, circular ownership, ghost company scorer

Phase 9 -- Eight New Indian Sources (21 total)

NJDG, ED, CVC, NCRB, LGD, IBBI, NGO Darpan, CPPP added with fallback samples

Phase 10 -- Multi-Investigator AI Engine

Tag: v0.10.0 | 12 parallel investigators, SHA-256 report hash, synthesis engine

Phase 11 -- Multilingual Platform (22 Languages)

All 22 Indian scheduled languages, auto-detection, Helsinki-NLP translation

Phase 12 -- PDF Dossier Generator

Jinja2 + WeasyPrint, SHA-256 integrity hash, GET /export/pdf/{id}

Phase 13 -- Production Frontend

Vanilla JS/HTML/CSS, D3.js force graph, 5 views, works offline from file://

Phase 14 -- Zero Cold-Start Deployment

Tag: v0.14.0 | HuggingFace Spaces Docker, service worker cache, GitHub Pages CI/CD

Phase 15 -- Mathematical Intelligence Engine

Tag: v0.15.0 | Spectral Fiedler value, Fourier FFT, 13th investigator (math)

Phase 16 -- Evidence Connection Map and Deep Investigation

Tag: v0.16.0 | 6-layer recursive investigation, connection mapper, WHY explanations

Phase 17 -- Security Hardening and Provenance Layer

Tag: v0.17.0 | Rate limiter, CSP/HSTS headers, input validator, SHA-256 audit log

Phase 18 -- Self-Learning System and Case Memory

Schema learner, pattern learner, weight optimiser (+-0.01 per 3 confirmed cases)

Phase 19 -- Affidavit Wealth Trajectory Engine

Tag: v0.19.0 | Kalman filter, 5-election series, 14th investigator (affidavit)

Phase 20 -- Biography Engine

Chronological timeline, 5 temporal convergence window types, neutral narrative

Phase 21 -- Benami Entity Detection

5-factor proxy score, thresholds HIGH>=65 MODERATE>=40, 15th investigator

Phase 22 -- Procurement DNA, Cartel Detection, Full Pipeline

TF-IDF cosine >=0.72, award rotation, co-bidding network, 21 scrapers

Phase 23 -- Revolving Door and TBML Detection

365-day cooling-off, pre-employment benefit, 2.5-sigma TBML, subcontract loops

Phase 24 -- Linguistic Fingerprinting

Burrows Delta authorship, template reuse detection, ghost-writing detection

Phase 25 -- Policy-Benefit Causal Analysis

Granger causality (lags 1-6), transfer entropy, CACA cross-ministry chain

Phase 26 -- Adversarial Counterevidence

Forced disproof, competing hypotheses, uncertainty propagation

Phase 27 -- Multi-Agent Debate Engine

7-agent 3-round debate, iMAD hesitation detection, minority dissent preserved

Phase 28 -- Dark Pattern Detection

PrefixSpan sequential mining, 6 pre-defined high-risk sequences

Phase 29 -- UX Overhaul and i18n

Evidence panel (4 tabs), D3 graph redesign, 22-language UI, timeline view

Phase 30 -- Bug Fix Sprint

Tag: v0.30.0 | 26 bugs resolved including BUG-1 (search crash), BUG-2 (7 missing loaders)

Phase 31 -- Runtime Profile and Auto-Scaling

Tag: v0.31.0 | Hardware detector, LOW/MEDIUM/HIGH profiles, GET /runtime endpoint Branch: feature/phase-31-runtime-profile Files: config/runtime_profile.py, config/model_selector.py, api/routes/runtime.py Tests: 15 unit tests in tests/test_runtime_profile.py Profile assignment: cpu2 + ram2 + gpu*2 + disk + docker + db_local (max 9)


PLANNED PHASES


Phase 32 -- Entity Resolution v2: Canonical Identity Engine

Branch: feature/phase-32-entity-resolution Priority: CRITICAL -- fixes broken evidence chains across all phases

Problem: Jaccard token similarity misses transliteration variants, honorific variations ("Sh. Ram Kumar" vs "Shri Ramkumar"), and cross-script name forms. The same person stored under 3+ IDs = broken evidence chains.

Algorithms:

  • Jaro-Winkler (weight 0.30) -- character-level typo and transliteration
  • Jaccard token overlap (weight 0.20) -- word-order variations
  • Sentence-transformers cosine (weight 0.35) -- multilingual name variants
  • Exact PAN/CIN/GSTIN match (weight 1.0, overrides all) -- deterministic keys

New files:

  • processing/entity_resolver_v2.py -- CanonicalIdentityEngine class
  • processing/canonical_id.py -- stable SHA-256 ID generation functions
  • processing/alias_graph.py -- AliasGraph: alias_name -> canonical_id lookup

Indian name normalisation added:

  • Remove honorifics: Sh., Smt., Dr., Late, Sri, Shri, Er., Adv., Col.
  • Normalise suffixes: Private Limited -> Pvt Ltd, LLP, Ltd
  • Script-aware: Devanagari -> Latin transliteration for comparison

Integration: pipeline.py resolve_dataset() upgraded to use v2 engine


Phase 33 -- Custom Graph Engine: Eliminate Neo4j 50K Limit

Branch: feature/phase-33-custom-graph-engine Priority: HIGH -- AuraDB free tier caps at 50K nodes / 175K relationships

Architecture:

graph_engine/
+-- store.py          -- LevelDB key-value backing store
+-- hnsw.py           -- HNSW vector index (M=16, ef=200)
+-- query_planner.py  -- Cypher-to-native query translator
+-- temporal.py       -- Time-weighted edge decay by relationship type
+-- version_control.py -- Git-style diff log for graph mutations
+-- compat_layer.py   -- Translates all existing Cypher to native calls

Temporal edge decay lambdas:

  • court_order: 0.00005 (slowest -- court records are permanent)
  • cag_audit: 0.0002
  • government_portal: 0.0005
  • director_of: 0.0003
  • member_of: 0.0005
  • news_article: 0.001
  • social_media: 0.01 (fastest decay)

Version control: Every graph mutation is recorded as a diff with before/after hashes. Detects when government portals silently modify records post-publication. Anti-forensics pattern: commit A -> commit B (change) -> commit C (reverts to A) = flag


Phase 34 -- Vector Search and Hybrid Retrieval

Branch: feature/phase-34-vector-search

Problem: Keyword search misses semantically similar documents. Searching "Maharashtra road contract irregularity" does not find CAG reports about "highway construction irregularity in Pune" even though they are the same topic.

Algorithms:

  • FAISS (cpu) or Qdrant for vector index
  • BM25 for keyword ranking
  • Reciprocal Rank Fusion (k=60): RRF = sum(1 / (60 + rank))
  • Query classifier routes to appropriate retrieval strategy

Query routing:

Query type Keywords Retrieval mix
factual who is, what is, when did BM25 70% + vector 30%
relational connected to, path from Graph 80% + vector 20%
temporal before, after, election, contract date Graph 60% + BM25 40%
exploratory similar to, pattern, cluster Vector 60% + community 40%

Embedding model: paraphrase-multilingual-MiniLM-L12-v2 (covers all 22 languages)


Phase 35 -- Plugin System and YAML Enrichers

Branch: feature/phase-35-plugins

Lazy-loading plugin architecture -- new data sources added by dropping a YAML file in enrichers/ with no code changes.

Plugin registry also covers algorithms -- new detection algorithms registered as plugins, enabling Phase 57 A/B testing.


Phase 36 -- Sigma-Style YAML Rule Engine

Branch: feature/phase-36-rule-engine

Problem: Adding a new detection rule requires writing Python + Cypher. Non-developer investigators cannot contribute detection logic.

YAML -> Cypher compiler -- a rule file specifies conditions, thresholds, and actions. The engine compiles it to Cypher at startup.

10 built-in rules shipped:

  1. cartel_rotation.yaml -- same vendor group rotates wins
  2. electoral_bond_proximity.yaml -- bond + contract within 12 months (CRITICAL)
  3. family_directorship_web.yaml -- politician's family = company director
  4. audit_contract_overlap.yaml -- continued contracts after CAG audit flag
  5. shell_company_age_contract.yaml -- company < 6 months old + large contract
  6. single_bidder_high_value.yaml -- single bid above district average
  7. circular_ownership_3node.yaml -- 3-node corporate ownership cycle
  8. revolving_door_365day.yaml -- government to private within 1 year
  9. address_cluster_directors.yaml -- 3+ companies same registered address
  10. pre_election_contract_surge.yaml -- contract spend spike 90 days before poll

Phase 37 -- Job Queue and Worker Pool

Branch: feature/phase-37-job-queue

Redis-backed job queue with state machine: INIT -> QUEUED -> RUNNING -> DONE

Algorithm job priorities:

  • Priority 1 (immediate): entity_resolution, neurosymbolic_risk, rule_engine
  • Priority 2 (30s): gnn_tbml, election_burst, shap_explanation, graphrag_summary
  • Priority 3 (5min): corruption_dna, metapath_walk, community_detection, topic_modeling
  • Priority 4 (off-peak): fingerprint_index, gcpal_pretraining, wayback_drift

Phase 38 -- DeepSeek-R1 Chain-of-Thought Reasoning

Branch: feature/phase-38-deepseek-r1

Problem: Current synthesis logic (3+ investigators agreeing = HIGH) is a vote count, not reasoning. No audit trail of how a conclusion was reached.

DeepSeek-R1 integration:

  • Receives: graph findings + SHAP explanations + TruthChain evidence IDs
  • Generates: step-by-step reasoning chain citing specific evidence node IDs
  • Produces: 2 competing hypotheses with scores, then a final verdict
  • Verdict levels: CONFIRMED (>=80), PROBABLE (>=50), WEAK (>=20), INSUFFICIENT

Anti-hallucination enforcement:

  • Every R1 claim must cite a TruthChain node_id (format: [EVIDENCE-XXXX])
  • Post-generation validation: regex check for invented node IDs
  • Invalid citations are stripped before the report is returned

Fallback: When DeepSeek API is unavailable, the existing multi-investigator synthesis provides the output. R1 augments -- it does not replace.


Phase 38B -- GraphRAG: Graph-Guided LLM Retrieval (NEW)

Branch: feature/phase-38b-graphrag

Problem: R1 cannot answer global questions like "What are the main corruption themes across all 5,000 CAG audit reports?" Standard RAG retrieves isolated chunks.

GraphRAG approach:

  1. Run Leiden clustering over all scraped documents and graph nodes
  2. For each community > 3 nodes, R1 generates a community summary
  3. At query time: embed query -> retrieve top-k community summaries by cosine
  4. Feed summaries + relevant subgraph as structured context to R1

New files:

  • ai/graphrag/community_indexer.py -- builds community summaries offline
  • ai/graphrag/graphrag_retriever.py -- query-time retrieval

Integration with Phase 38: R1 receives GraphRAG community summaries instead of raw graph fragments -- dramatically reduces hallucination.


Phase 39 -- DeepSeek-VL2 Visual Evidence Analysis

Branch: feature/phase-39-deepseek-vl2

Analyse scanned affidavit PDFs, audit report images, and newspaper clippings. Signature mismatch detection. Document image authenticity via Shannon entropy. OCR pipeline for non-digital government documents.


Phase 40 -- DeepSeek-V3 Multilingual Dossier Generation

Branch: feature/phase-40-deepseek-v3

Generate full investigation reports in all 22 Indian languages. CONFIRMED/PROBABLE/WEAK/INSUFFICIENT grading on every finding. Length: 800-1200 words per report. Export to PDF with trilingual header.


Phase 41 -- Legal Intelligence Pipeline

Branch: feature/phase-41-legal

IPC Section Classifier:

  • Algorithm: TF-IDF + OneVsRestClassifier(LogisticRegression) -- multi-label
  • 8 corruption-relevant IPC sections: 420, 409, 13, 7, 120B, 467, 468, 471
  • Keyword fallback when model not trained

Crime triple extractor:

  • Pattern: Subject -> Action -> Object from legal text
  • Store as directed evidence edges: (Company)-[:BRIBED]->(Official)

Semantic Role Labelling (SRL):

  • ARG0 (agent) -> entity who acted
  • ARG2 (recipient) -> entity who benefited
  • V (predicate) -> action type: BRIBED, APPROVED, AWARDED

BK-tree for out-of-vocabulary legal term repair.


Phase 42 -- Forensic Content Intelligence

Branch: feature/phase-42-forensic-content

Shannon entropy classifier:

Document type Expected range
government_order 3.8 -- 5.2 bits
cag_report 4.0 -- 5.4 bits
tender_document 3.5 -- 5.0 bits
court_order 3.9 -- 5.3 bits

Documents outside expected range flagged as SUSPICIOUS or LIKELY_FABRICATED.

Perceptual hash (pHash) for image-based document copy detection. PAN/CIN/Aadhaar regex extraction from document text. Lexical diversity score -- repetitive templates have diversity < 0.3.


Phase 43 -- Pivot Recommendation Engine

Branch: feature/phase-43-pivot

Problem: After finding a suspicious entity, the next best investigation target is unclear. The pivot engine scores all connected entities.

6-factor scoring:

Factor Weight Description
pagerank 0.20 How central is this entity?
evidence_gap 0.25 How much do we NOT know?
risk_signals 0.20 log(risk_signals + 1)
connection_strength 0.15 Edge weight to current entity
temporal_recency 0.10 Recently active?
unexplored_depth 0.10 Unexplored 2-hop nodes

Route: GET /pivot/{entity_id}?already_investigated=id1,id2


Phase 44 -- Geospatial Verification via Satellite

Branch: feature/phase-44-satellite

Sentinel-2 L2A time series for project verification. NDVI change detection for forest diversion claims. NDBI (built-up index) for construction completion verification. SAR (Sentinel-1) for flood infrastructure claims. Compare contract completion claims vs satellite-observable progress.


Phase 45 -- W3C PROV-DM Provenance and TruthChain

Branch: feature/phase-45-provenance

TruthChain algorithm:

  • Each evidence node has: SHA-256 ID, source_type, content_hash, timestamp, status
  • Merkle tree over all evidence: root_hash changes if ANY evidence changes
  • Temporal decay: weight(E,t) = base_weight * exp(-lambda_type * days)
  • Status propagation: MODIFIED evidence propagates DEPENDS_ON_MODIFIED to descendants
  • Aggregate confidence = active_weight / total_weight

Decay rates by source:

  • court_order: 0.0001 (permanent)
  • cag_audit: 0.0002
  • government_portal: 0.0005
  • news_article: 0.001
  • social_media: 0.01

Export: JSON-LD using W3C PROV-DM ontology + Schema.org Blockchain anchor: Merkle root stored in audit_chain.py (Bitcoin via OpenTimestamps)


Phase 46 -- Source Drift and Historical Record Analysis

Branch: feature/phase-46-source-drift

Wayback CDX API to detect when government records are silently modified. 7 fault types (ISWC 2024 taxonomy):

  • node_disappearance: entity removed from portal
  • edge_rewiring: director change silently backdated
  • attribute_drift: contract amount modified post-publication
  • cluster_split: formerly linked entities disconnected
  • cluster_merge: separate networks joined
  • temporal_burst: sudden new relationship creation
  • isolation: previously connected entity becomes isolated

Anti-forensics detection: commit A -> commit B (change) -> commit C (reverts) = SUPPRESS_ATTEMPT


Phase 47 -- Predictive Risk Trajectory

Branch: feature/phase-47-predictive

ARIMA(2,1,1) risk prediction:

  • Fits on monthly risk score history (min 12 data points)
  • Forecasts 6 months ahead with 80% confidence intervals
  • Alert when predicted score crosses HIGH threshold

GCPAL contrastive pre-training for label scarcity:

  • India's 1:707 confirmed-corruption ratio makes traditional supervised ML difficult
  • GCPAL mines supervised signals from the unlabelled relationship graph
  • Three augmented views: node feature dropout + edge dropout + KNN view
  • NT-Xent contrastive loss (temperature = 0.07)
  • Fine-tunes on confirmed cases from case_memory (min 5 needed)

Phase 48 -- Watchlist, Alerts, and ARIMA Prediction

Branch: feature/phase-48-watchlist

WebSocket push alerts when risk score changes for watched entities. YAML alert rules (same format as Phase 36). Webhook support for journalist notification systems.


Phase 49 -- Observability and Reliability

Branch: feature/phase-49-observability

Prometheus /metrics endpoint. Stale-data alerts when pipeline has not run in >7 days. Ingestion validator checks all 20 node types have recent data. /health upgraded to return per-source freshness status.


Phase 50 -- Security v2: RBAC and JWT

Branch: feature/phase-50-security-v2

Role-based access control: Lead Investigator, Contributor, Reviewer, Observer. JWT authentication with refresh tokens. DPDP Act compliance (India Data Protection). Entity-level access control for sensitive investigations.


Phase 51 -- Electoral Bond Causal Graph Engine

Branch: feature/phase-51-electoral-bond-causal

Critical missing feature. The data exists but the causal chain is not mapped.

Full graph path: Corporate donor -> ElectoralBond -> Party -> Ministry -> Policy -> Contract -> Company

Algorithm: Granger causality (from Phase 25) + Difference-in-Differences to establish whether policy changes statistically follow bond purchases.

New node type: PolicyChange (date, ministry, beneficiaries) New relationship: FOLLOWED_BOND (lag_days, p_value, granger_f_stat)

New route: GET /electoral-bond/causal/{company_id}


Phase 52 -- Parliament Performance Analytics

Branch: feature/phase-52-parliament

New data sources: Lok Sabha division votes (loksabha.nic.in/Loksabha/Divisions), Rajya Sabha Q&A archive, Praja.org legislator data.

MP accountability score:

  • Attendance rate (0.30 weight)
  • Questions asked per session (0.25 weight)
  • Vote consistency with party line vs independent votes (0.20 weight)
  • Bills sponsored (0.15 weight)
  • Starred questions with substantive follow-up (0.10 weight)

New route: GET /parliament/performance/{politician_id} New node type: DivisionVote, ParliamentSession New relationship: VOTED_IN, ASKED_STARRED_QUESTION


Phase 53 -- Media Ownership Graph

Branch: feature/phase-53-media-ownership

New data sources: MIB media license registry, TRAI spectrum allocations.

Graph paths:

  • Channel -> Corporate parent -> Promoter -> Political donor
  • Channel -> Editorial stance correlation (NLP) -> Political entity

Editorial bias detection: NLP sentiment analysis comparing coverage of political entities across channels with known ownership structures.

New node types: MediaChannel, SpectrumLicense, EditorialEntity New route: GET /media/ownership/{channel_id}


Phase 54 -- Constituency Development Index

Branch: feature/phase-54-constituency

Data sources: NDAP district SDG scores, MGNREGS employment data, PM Kisan disbursements, PM Awas completions, Swachh Bharat ODF data.

Algorithm: Regression analysis -- does the constituency improve during the politician's tenure vs comparison period?

Pre-election spending surge detection: CUSUM on district spending in 90 days before election vs annual baseline.

New route: GET /constituency/{id}/development Satellite verification: Sentinel-2 images corroborate claimed completions.


Phase 55 -- Family Dynasty and Nepotism Graph

Branch: feature/phase-55-dynasty

Data source: FAMILY_OF edges extracted from MyNeta affidavit declarations ("Spouse: X", "Dependent 1: Y"). Already partially available in existing data.

Dynasty depth score:

  • Count of family members in government positions
  • Count of family-controlled companies with government contracts
  • Count of elections won by family members across generations
  • Geographic concentration (same constituency or district)

New relationship: FAMILY_OF (role: spouse/child/sibling/parent) New route: GET /dynasty/{politician_id}


Phase 56 -- RTI Intelligence Engine

Branch: feature/phase-56-rti

RTI auto-filer: System detects evidence gaps in any investigation and drafts the exact RTI application to fill them.

Gap detection algorithm:

  • For each HIGH-risk finding: check if primary source data is available
  • If data missing: identify the correct Public Information Officer
  • Generate RTI draft citing the specific provisions (RTI Act 2005, Sections 6-8)

RTI outcome tracker: Index filed RTI applications from RTI Online portal. Map outcomes to graph: PIOs who deny information for high-risk entities = flag.

New route: GET /rti/draft/{entity_id} (generates RTI text) New node type: RTIApplication, PublicInformationOfficer


Phase 57 -- A/B Algorithm Testing Framework (NEW)

Branch: feature/phase-57-ab-testing

Multi-armed bandit (Thompson Sampling) for algorithm selection:

  • Each algorithm arm has Beta(alpha, beta) prior over performance
  • alpha = times algorithm was "preferred" by human review
  • beta = times algorithm was "not preferred"
  • Select arm with highest sampled value at each request

Use case: When upgrading from static risk scorer -> ML ensemble -> NeuroSymbolic, verify the new algorithm actually improves outcomes.

New route: GET /admin/algorithm-performance


Phase 58 -- Real-Time Stream Processing (NEW)

Branch: feature/phase-58-streaming

Problem: Pipeline runs in batches. Breaking leads appear hours late.

Redis Streams (Kafka fallback) for real-time event ingestion. CUSUM online anomaly detection on the stream (no batch needed). Sliding window aggregation for real-time indicator updates.

Events processed in real-time:

  • new_contract: immediate CUSUM check on contract value
  • new_audit_report: check if any tracked entities are mentioned
  • new_enforcement_action: update risk scores for named entities
  • source_modification: detect when a scraped page changes

Phase 59 -- CorruptionDNA Fingerprint (NEW)

Branch: feature/phase-59-corruption-dna

Problem: Two entities in the same corruption network may have no direct graph edge -- different states, different directors, but identical patterns.

512-dim fingerprint = concat(:

  • Node2Vec structural embedding (128d)
  • TF-IDF document vector (128d)
  • Benford's Law digit distribution (9d, padded to 16d)
  • Temporal burst vector (64d)
  • Linguistic fingerprint -- Burrows Delta (64d)
  • Entity type one-hot (16d)
  • Risk indicator vector (16d)
  • CAG audit TF-IDF (64d)
  • Institutional path vector (32d)

MinHash LSH for efficient similarity search (cosine > 0.82 = same network). New route: GET /dna/{entity_id} and GET /dna/similar/{entity_id}


Phase 60 -- ElectionProximityBurst Detector (NEW)

Branch: feature/phase-60-election-burst

The only corruption detection algorithm that encodes the Indian electoral calendar as a statistical regression variable.

Algorithm:

  1. Load full Indian electoral calendar (Lok Sabha + 28 state assemblies)
  2. ARIMA(2,1,1) on monthly metric aggregates
  3. PELT changepoint detection on ARIMA residuals
  4. Match changepoints to election proximity (within 180 days)
  5. CUSUM control chart with k=0.5, h=5.0
  6. Granger causality: does election_proximity_days Granger-cause the metric?

Output: burst_score (0-100), election_burst_flags, cusum_alerts, Granger p-value, interpretation in plain language.

Integrated as 16th investigator (temporal, weight 0.10)


Phase 61 -- BennamiGNN: Heterogeneous Graph Neural Network (NEW)

Branch: feature/phase-61-benami-gnn

Problem: 5-factor heuristic misses multi-hop benami: politician's cousin is director (not the politician), company has legitimate small contracts before being used for a large fraudulent one.

H-GNN architecture:

  • 8 relation types: DIRECTOR_OF, WON_CONTRACT, SHARES_ADDRESS, RELATED_TO, AWARDED_BY, FAMILY_MEMBER_OF, APPEARS_IN_AUDIT, SANCTIONED_BY
  • Layer 0: Per-type linear projection to d=64
  • Layer 1: Relation-aware message passing
  • Layer 2: Entity-type attention
  • Layer 3: Classification head -> benami_score in [0,1]

Fallback: Always falls back to existing 5-factor heuristic when:

  • PyTorch not installed
  • Subgraph has < 5 nodes
  • Model not trained yet

Training: Fine-tunes on confirmed benami cases from case_memory.


Phase 62 -- CartelDNA Sequential Mining (NEW)

Branch: feature/phase-62-cartel-dna

Problem: Current cartel detector checks single-tender award rotation. Temporal cartels rotate wins across months and across ministries to avoid statistical detection within any one ministry.

CartelDNA = PrefixSpan + HITS + DBSCAN:

  1. PrefixSpan on bid event sequences (company, category, month, rank)
  2. Detect alternating rank order patterns (length 2-6, min support 3)
  3. HITS on co-bidding network: authority = real winners, hub = fake competitors
  4. DBSCAN geographic clustering (epsilon = 50km, min_samples = 3)
  5. Cartel confidence = 0.35pattern + 0.25alternation + 0.20geo + 0.20HITS

New route: GET /cartel/dna/{entity_id}


Phase 63 -- SHAP and LIME Explainability Layer (NEW)

Branch: feature/phase-63-explainability

Problem: Every risk score has no explanation. Journalists cannot publish "score: 67" without "why: politician_overlap drove +24 points."

SHAP TreeExplainer on the ML ensemble from Phase 19 upgrade:

  • Feature contributions for each of the 5 indicators
  • Counterfactual: "If contract_concentration were 0, score would be 43"
  • Baseline score (expected value)

LIME locally linear approximation for non-tree models.

New fields added to all risk responses:

  • shap_top_drivers: [{feature, shap_value, direction}]
  • shap_counterfactual: plain-language minimum change to flip risk level
  • shap_baseline: expected value before any features

New route: GET /risk/explain/{entity_id}


Phase 64 -- Cross-Language Entity Disambiguation (NEW)

Branch: feature/phase-64-cross-lingual

Problem: "Modi" / "modi" / "modii" appear in 22 scripts -- potentially stored as separate graph nodes. Cross-lingual entity linker maps all variants to a single canonical node using Wikidata Q-numbers.

XLM-RoBERTa zero-shot entity linking. Wikidata SPARQL for canonical Q-number lookup (existing scraper extended). Transliteration confidence score per script pair.


Phase 65 -- Knowledge Graph Completion (Missing Link Prediction) (NEW)

Branch: feature/phase-65-kg-completion

TransE link prediction: h + r = t in d-dimensional space. Missing edge score: ||h + r - t|| (lower = more probable edge).

Use cases:

  • (Politician, DIRECTOR_OF, ?) -- suggest companies likely controlled
  • (?, RELATED_TO, KnownShellCompany) -- find hidden associates
  • (Company, WON_CONTRACT, ?) -- predict future contract awards

Output: List of probable missing edges with confidence scores, presented as "Suggested next investigation targets."


Phase 66 -- LAS-GNN Temporal TBML Detection (NEW)

Branch: feature/phase-66-las-gnn

Problem: Current TBML detector uses threshold rules. Temporal money laundering (pre-election scatter-gather, below-threshold smurfing) is invisible to structural analysis.

LAS-GNN: LSTM aggregator on directed transaction graphs. Learns sequential order of edges imposed by timestamps. Detects motifs: scatter-gather, fan-in/fan-out, layering, pre-election burst.

Indian-specific motifs:

  • Pre-election scatter: funds split to many accounts < 6 months before election
  • Post-contract layering: payment -> N shell companies -> reconsolidated
  • Smurfing below threshold: many transactions < Rs 2 lakh (PMLA threshold)
  • Circular director rotation: A appoints X -> X at B -> B pays A

Phase 67 -- NeuroSymbolic Risk Reasoning (NEW)

Branch: feature/phase-67-neurosymbolic

Fuses three reasoning modes into one coherent system:

Stage 1 -- DEDUCTIVE (Phase 36 YAML rules):

  • Rules fire with certainty = 1.0 (logical certainty)
  • CRITICAL rule match -> score forced >= 75

Stage 2 -- INDUCTIVE (Phase 19 ML ensemble + SHAP):

  • GNN/ML soft score in [0,1]
  • SHAP feature contributions

Stage 3 -- ABDUCTIVE (Phase 38 DeepSeek-R1):

  • Chain-of-thought synthesis citing TruthChain evidence IDs
  • 2 competing hypotheses with scores

Stage 4 -- Integration:

  • final_score = 0.40rule_certainty + 0.35gnn_score + 0.25*r1_confidence
  • Adversarial override: if adversarial engine finds contradicting evidence -> cap at PROBABLE

Phase 68 -- InstitutionMetapath2Vec Embeddings (NEW)

Branch: feature/phase-68-metapath

5 Indian-specific metapaths for structured random walks:

  1. politician_enrichment: Politician-DIRECTOR_OF-Company-WON_CONTRACT-Contract
  2. circular_enrichment: Politician-MEMBER_OF-Party-CONTROLS-Ministry-...-DIRECTOR_OF-Politician
  3. audit_flag_circular: Company-WON_CONTRACT-Contract-MENTIONED_IN-AuditReport-AUDITS-Ministry
  4. shell_address_cluster: Director-DIRECTOR_OF-Company-SHARES_ADDRESS-Company
  5. constituency_benefit: Politician-REPRESENTS-Constituency-LOCATED_IN-District-HAS_PROJECT-Contract

128-dim entity embeddings trained via Word2Vec skip-gram on guided walks. find_similar_by_metapath() finds entities with the same institutional role across different states -- invisible to structural graph analysis.


Phase 69 -- Geospatial Risk Clustering (NEW)

Branch: feature/phase-69-geospatial

Moran's I spatial autocorrelation on district-level risk scores. I > 0 = spatial corruption hotspots cluster together.

LISA (Local Indicators of Spatial Association):

  • High-High cluster: high-risk district surrounded by high-risk districts
  • Low-High outlier: low-risk district in high-risk region (potential evasion)
  • High-Low outlier: targeted corruption in otherwise clean district

Output: District-level choropleth with cluster classification. New route: GET /geospatial/risk-clusters


Phase 70 -- Dynamic Knowledge Graph Anomaly Detection (NEW)

Branch: feature/phase-70-dynamic-kg

Continuously monitors graph for unexpected structural changes. 7 fault types (ISWC 2024): node_disappearance, edge_rewiring, attribute_drift, cluster_split, cluster_merge, temporal_burst, isolation.

Contextual anomaly detection: entity that was HIGH-risk 3 months ago is now suddenly LOW-risk = possible evidence suppression.


Phase 71 -- GCPAL Contrastive Pre-Training (NEW)

Branch: feature/phase-71-gcpal

Label scarcity problem: India has very few confirmed corruption cases relative to the total number of entities (estimated 1:707 ratio). Standard supervised ML cannot train on this imbalance.

GCPAL solution: NT-Xent contrastive loss on 3 augmented views:

  • View 1: node feature dropout (20%)
  • View 2: edge dropout (20%)
  • View 3: KNN implicit interactions (k=5)

Pre-trains on unlabelled graph. Fine-tunes on case_memory confirmed cases.


Phase 72 -- Automated Source Credibility Scoring (NEW)

Branch: feature/phase-72-source-credibility

Bayesian credibility model per source:

  • institutional_authority: government > NGO > news > social
  • historical_accuracy: confirmed vs denied past claims
  • methodology_transparency: does source explain collection method?
  • timeliness: freshness decay
  • cross_source_corroboration: independent corroboration count

Bayesian update after each confirmed/denied case.


Phase 73 -- Investigative RAG Over Case Memory (NEW)

Branch: feature/phase-73-rag-cases

RAG over all past investigation reports in case_memory. Query: "Past investigations involving electoral bonds and road contracts" -> Dense retrieval -> Top-k case summaries as context -> DeepSeek-R1 synthesizes commonalities and suggests strategy.


Phase 74 -- Continuous Model Drift Detection (NEW)

Branch: feature/phase-74-drift

Population Stability Index (PSI):

  • PSI < 0.10: stable
  • PSI 0.10-0.25: monitor closely
  • PSI > 0.25: retrain required

ADWIN (Adaptive Windowing): streaming concept drift detection. Auto-triggers GCPAL retraining job when drift detected.


Phase 75 -- Ethics and Bias Audit System (NEW)

Branch: feature/phase-75-ethics

Fairness metrics:

  • Demographic parity: P(HIGH_RISK | party=A) approx= P(HIGH_RISK | party=B)
  • Equal opportunity: TPR equal across entity types
  • Predictive parity: PPV equal across geographic regions

Bias detection: chi-squared test, disparate impact ratio, SHAP fairness. Mitigation: Reweighing, adversarial debiasing, calibration.

New route: GET /admin/bias-audit


BRANCH WORKFLOW

# Before each new phase:
git checkout main && git pull origin main
git checkout -b feature/phase-N-name

# After all commits:
git push origin feature/phase-N-name
# Open PR on GitHub -> merge -> pull main -> tag

# Tag every completed phase:
git tag -a vN.0.0 -m "Phase N: description"
git push origin vN.0.0

# Deploy to HuggingFace after every merge:
git push hf main --force

# Reseed after every deploy:
curl -X POST https://abinazebinoly-bharatgraph.hf.space/admin/seed

VERSION HISTORY

Version Phase Key addition
v0.30.0 30 Bug fix sprint -- 26 bugs resolved
v0.31.0 31 Runtime profile auto-scaling
v0.32.0 32 Entity resolution v2 (planned)
v0.33.0 33 Custom graph engine (planned)
v0.40.0 40 DeepSeek-V3 multilingual reports (planned)
v0.50.0 50 Security v2 RBAC (planned)
v1.0.0 75 Full production launch (planned)

Developed by Abinaze Binoy