Spaces:

abinazebinoy
/

bharatgraph

Running

App Files Files Community

bharatgraph / PHASE_ROADMAP.md

abinazebinoy

Update Phaseroadmap with new features

9f87a81 unverified 2 months ago

preview code

Raw

History Blame Contribute Delete

34.5 kB

BharatGraph -- Complete Phase Roadmap

All branches merge into main. Branch naming: feature/phase-N-name or fix/description. Each phase has a GitHub Issue (see issues/ directory) and a PR description template.

COMPLETED PHASES (1-31)

Phase 1 -- Data Collection

Tag: pre-v1 | 6 scrapers, 3,199+ records, base scraper with rate limiting and retry

Phase 2 -- Data Processing

Tag: pre-v2 | Indian name normalisation, Jaccard entity resolution, parallel pipeline

Phase 3 -- Graph Database

Tag: pre-v3 | Neo4j schema, 7 node types, stable MD5 IDs, 8 Cypher templates

Phase 4 -- FastAPI Backend

Tag: v0.12.0 | FastAPI + Pydantic + Neo4j dependency injection, source citations

Phase 5 -- Risk Scoring Engine

5-indicator composite score, validate_language() forbidden-word enforcement

Phase 6 -- Expanded Data Sources (13 scrapers)

ICIJ, Wikidata, OpenSanctions, Lok Sabha, SEBI, Electoral Bonds added

Phase 7 -- NLP Document Intelligence

spaCy NER, Benford Law chi-squared, multilingual BERT NER, shadow draft detector

Phase 8 -- Advanced Graph Analytics

NetworkX betweenness/PageRank/Louvain, circular ownership, ghost company scorer

Phase 9 -- Eight New Indian Sources (21 total)

NJDG, ED, CVC, NCRB, LGD, IBBI, NGO Darpan, CPPP added with fallback samples

Phase 10 -- Multi-Investigator AI Engine

Tag: v0.10.0 | 12 parallel investigators, SHA-256 report hash, synthesis engine

Phase 11 -- Multilingual Platform (22 Languages)

All 22 Indian scheduled languages, auto-detection, Helsinki-NLP translation

Phase 12 -- PDF Dossier Generator

Jinja2 + WeasyPrint, SHA-256 integrity hash, GET /export/pdf/{id}

Phase 13 -- Production Frontend

Vanilla JS/HTML/CSS, D3.js force graph, 5 views, works offline from file://

Phase 14 -- Zero Cold-Start Deployment

Tag: v0.14.0 | HuggingFace Spaces Docker, service worker cache, GitHub Pages CI/CD

Phase 15 -- Mathematical Intelligence Engine

Tag: v0.15.0 | Spectral Fiedler value, Fourier FFT, 13th investigator (math)

Phase 16 -- Evidence Connection Map and Deep Investigation

Tag: v0.16.0 | 6-layer recursive investigation, connection mapper, WHY explanations

Phase 17 -- Security Hardening and Provenance Layer

Tag: v0.17.0 | Rate limiter, CSP/HSTS headers, input validator, SHA-256 audit log

Phase 18 -- Self-Learning System and Case Memory

Schema learner, pattern learner, weight optimiser (+-0.01 per 3 confirmed cases)

Phase 19 -- Affidavit Wealth Trajectory Engine

Tag: v0.19.0 | Kalman filter, 5-election series, 14th investigator (affidavit)

Phase 20 -- Biography Engine

Chronological timeline, 5 temporal convergence window types, neutral narrative

Phase 21 -- Benami Entity Detection

5-factor proxy score, thresholds HIGH>=65 MODERATE>=40, 15th investigator

Phase 22 -- Procurement DNA, Cartel Detection, Full Pipeline

TF-IDF cosine >=0.72, award rotation, co-bidding network, 21 scrapers

Phase 23 -- Revolving Door and TBML Detection

365-day cooling-off, pre-employment benefit, 2.5-sigma TBML, subcontract loops

Phase 24 -- Linguistic Fingerprinting

Burrows Delta authorship, template reuse detection, ghost-writing detection

Phase 25 -- Policy-Benefit Causal Analysis

Granger causality (lags 1-6), transfer entropy, CACA cross-ministry chain

Phase 26 -- Adversarial Counterevidence

Forced disproof, competing hypotheses, uncertainty propagation

Phase 27 -- Multi-Agent Debate Engine

7-agent 3-round debate, iMAD hesitation detection, minority dissent preserved

Phase 28 -- Dark Pattern Detection

PrefixSpan sequential mining, 6 pre-defined high-risk sequences

Phase 29 -- UX Overhaul and i18n

Evidence panel (4 tabs), D3 graph redesign, 22-language UI, timeline view

Phase 30 -- Bug Fix Sprint

Tag: v0.30.0 | 26 bugs resolved including BUG-1 (search crash), BUG-2 (7 missing loaders)

Phase 31 -- Runtime Profile and Auto-Scaling

Tag: v0.31.0 | Hardware detector, LOW/MEDIUM/HIGH profiles, GET /runtime endpoint Branch: feature/phase-31-runtime-profile Files: config/runtime_profile.py, config/model_selector.py, api/routes/runtime.py Tests: 15 unit tests in tests/test_runtime_profile.py Profile assignment: cpu2 + ram2 + gpu*2 + disk + docker + db_local (max 9)

PLANNED PHASES

Phase 32 -- Entity Resolution v2: Canonical Identity Engine

Branch: feature/phase-32-entity-resolution Priority: CRITICAL -- fixes broken evidence chains across all phases

Problem: Jaccard token similarity misses transliteration variants, honorific variations ("Sh. Ram Kumar" vs "Shri Ramkumar"), and cross-script name forms. The same person stored under 3+ IDs = broken evidence chains.

Algorithms:

Jaro-Winkler (weight 0.30) -- character-level typo and transliteration
Jaccard token overlap (weight 0.20) -- word-order variations
Sentence-transformers cosine (weight 0.35) -- multilingual name variants
Exact PAN/CIN/GSTIN match (weight 1.0, overrides all) -- deterministic keys

New files:

processing/entity_resolver_v2.py -- CanonicalIdentityEngine class
processing/canonical_id.py -- stable SHA-256 ID generation functions
processing/alias_graph.py -- AliasGraph: alias_name -> canonical_id lookup

Indian name normalisation added:

Remove honorifics: Sh., Smt., Dr., Late, Sri, Shri, Er., Adv., Col.
Normalise suffixes: Private Limited -> Pvt Ltd, LLP, Ltd
Script-aware: Devanagari -> Latin transliteration for comparison

Integration: pipeline.py resolve_dataset() upgraded to use v2 engine

Phase 33 -- Custom Graph Engine: Eliminate Neo4j 50K Limit

Branch: feature/phase-33-custom-graph-engine Priority: HIGH -- AuraDB free tier caps at 50K nodes / 175K relationships

Architecture:

graph_engine/
+-- store.py          -- LevelDB key-value backing store
+-- hnsw.py           -- HNSW vector index (M=16, ef=200)
+-- query_planner.py  -- Cypher-to-native query translator
+-- temporal.py       -- Time-weighted edge decay by relationship type
+-- version_control.py -- Git-style diff log for graph mutations
+-- compat_layer.py   -- Translates all existing Cypher to native calls

Temporal edge decay lambdas:

court_order: 0.00005 (slowest -- court records are permanent)
cag_audit: 0.0002
government_portal: 0.0005
director_of: 0.0003
member_of: 0.0005
news_article: 0.001
social_media: 0.01 (fastest decay)

Version control: Every graph mutation is recorded as a diff with before/after hashes. Detects when government portals silently modify records post-publication. Anti-forensics pattern: commit A -> commit B (change) -> commit C (reverts to A) = flag

Phase 34 -- Vector Search and Hybrid Retrieval

Branch: feature/phase-34-vector-search

Problem: Keyword search misses semantically similar documents. Searching "Maharashtra road contract irregularity" does not find CAG reports about "highway construction irregularity in Pune" even though they are the same topic.

Algorithms:

FAISS (cpu) or Qdrant for vector index
BM25 for keyword ranking
Reciprocal Rank Fusion (k=60): RRF = sum(1 / (60 + rank))
Query classifier routes to appropriate retrieval strategy

Query routing:

Query type	Keywords	Retrieval mix
factual	who is, what is, when did	BM25 70% + vector 30%
relational	connected to, path from	Graph 80% + vector 20%
temporal	before, after, election, contract date	Graph 60% + BM25 40%
exploratory	similar to, pattern, cluster	Vector 60% + community 40%

Embedding model: paraphrase-multilingual-MiniLM-L12-v2 (covers all 22 languages)

Phase 35 -- Plugin System and YAML Enrichers

Branch: feature/phase-35-plugins

Lazy-loading plugin architecture -- new data sources added by dropping a YAML file in enrichers/ with no code changes.

Plugin registry also covers algorithms -- new detection algorithms registered as plugins, enabling Phase 57 A/B testing.

Phase 36 -- Sigma-Style YAML Rule Engine

Branch: feature/phase-36-rule-engine

Problem: Adding a new detection rule requires writing Python + Cypher. Non-developer investigators cannot contribute detection logic.

YAML -> Cypher compiler -- a rule file specifies conditions, thresholds, and actions. The engine compiles it to Cypher at startup.

10 built-in rules shipped:

cartel_rotation.yaml -- same vendor group rotates wins
electoral_bond_proximity.yaml -- bond + contract within 12 months (CRITICAL)
family_directorship_web.yaml -- politician's family = company director
audit_contract_overlap.yaml -- continued contracts after CAG audit flag
shell_company_age_contract.yaml -- company < 6 months old + large contract
single_bidder_high_value.yaml -- single bid above district average
circular_ownership_3node.yaml -- 3-node corporate ownership cycle
revolving_door_365day.yaml -- government to private within 1 year
address_cluster_directors.yaml -- 3+ companies same registered address
pre_election_contract_surge.yaml -- contract spend spike 90 days before poll

Phase 37 -- Job Queue and Worker Pool

Branch: feature/phase-37-job-queue

Redis-backed job queue with state machine: INIT -> QUEUED -> RUNNING -> DONE

Algorithm job priorities:

Priority 1 (immediate): entity_resolution, neurosymbolic_risk, rule_engine
Priority 2 (30s): gnn_tbml, election_burst, shap_explanation, graphrag_summary
Priority 3 (5min): corruption_dna, metapath_walk, community_detection, topic_modeling
Priority 4 (off-peak): fingerprint_index, gcpal_pretraining, wayback_drift

Phase 38 -- DeepSeek-R1 Chain-of-Thought Reasoning

Branch: feature/phase-38-deepseek-r1

Problem: Current synthesis logic (3+ investigators agreeing = HIGH) is a vote count, not reasoning. No audit trail of how a conclusion was reached.

DeepSeek-R1 integration:

Receives: graph findings + SHAP explanations + TruthChain evidence IDs
Generates: step-by-step reasoning chain citing specific evidence node IDs
Produces: 2 competing hypotheses with scores, then a final verdict
Verdict levels: CONFIRMED (>=80), PROBABLE (>=50), WEAK (>=20), INSUFFICIENT

Anti-hallucination enforcement:

Every R1 claim must cite a TruthChain node_id (format: [EVIDENCE-XXXX])
Post-generation validation: regex check for invented node IDs
Invalid citations are stripped before the report is returned

Fallback: When DeepSeek API is unavailable, the existing multi-investigator synthesis provides the output. R1 augments -- it does not replace.

Phase 38B -- GraphRAG: Graph-Guided LLM Retrieval (NEW)

Branch: feature/phase-38b-graphrag

Problem: R1 cannot answer global questions like "What are the main corruption themes across all 5,000 CAG audit reports?" Standard RAG retrieves isolated chunks.

GraphRAG approach:

Run Leiden clustering over all scraped documents and graph nodes
For each community > 3 nodes, R1 generates a community summary
At query time: embed query -> retrieve top-k community summaries by cosine
Feed summaries + relevant subgraph as structured context to R1

New files:

ai/graphrag/community_indexer.py -- builds community summaries offline
ai/graphrag/graphrag_retriever.py -- query-time retrieval

Integration with Phase 38: R1 receives GraphRAG community summaries instead of raw graph fragments -- dramatically reduces hallucination.

Phase 39 -- DeepSeek-VL2 Visual Evidence Analysis

Branch: feature/phase-39-deepseek-vl2

Analyse scanned affidavit PDFs, audit report images, and newspaper clippings. Signature mismatch detection. Document image authenticity via Shannon entropy. OCR pipeline for non-digital government documents.

Phase 40 -- DeepSeek-V3 Multilingual Dossier Generation

Branch: feature/phase-40-deepseek-v3

Generate full investigation reports in all 22 Indian languages. CONFIRMED/PROBABLE/WEAK/INSUFFICIENT grading on every finding. Length: 800-1200 words per report. Export to PDF with trilingual header.

Phase 41 -- Legal Intelligence Pipeline

Branch: feature/phase-41-legal

IPC Section Classifier:

Algorithm: TF-IDF + OneVsRestClassifier(LogisticRegression) -- multi-label
8 corruption-relevant IPC sections: 420, 409, 13, 7, 120B, 467, 468, 471
Keyword fallback when model not trained

Crime triple extractor:

Pattern: Subject -> Action -> Object from legal text
Store as directed evidence edges: (Company)-[:BRIBED]->(Official)

Semantic Role Labelling (SRL):

ARG0 (agent) -> entity who acted
ARG2 (recipient) -> entity who benefited
V (predicate) -> action type: BRIBED, APPROVED, AWARDED

BK-tree for out-of-vocabulary legal term repair.

Phase 42 -- Forensic Content Intelligence

Branch: feature/phase-42-forensic-content

Shannon entropy classifier:

Document type	Expected range
government_order	3.8 -- 5.2 bits
cag_report	4.0 -- 5.4 bits
tender_document	3.5 -- 5.0 bits
court_order	3.9 -- 5.3 bits

Documents outside expected range flagged as SUSPICIOUS or LIKELY_FABRICATED.

Perceptual hash (pHash) for image-based document copy detection. PAN/CIN/Aadhaar regex extraction from document text. Lexical diversity score -- repetitive templates have diversity < 0.3.

Phase 43 -- Pivot Recommendation Engine

Branch: feature/phase-43-pivot

Problem: After finding a suspicious entity, the next best investigation target is unclear. The pivot engine scores all connected entities.

6-factor scoring:

Factor	Weight	Description
pagerank	0.20	How central is this entity?
evidence_gap	0.25	How much do we NOT know?
risk_signals	0.20	log(risk_signals + 1)
connection_strength	0.15	Edge weight to current entity
temporal_recency	0.10	Recently active?
unexplored_depth	0.10	Unexplored 2-hop nodes

Route: GET /pivot/{entity_id}?already_investigated=id1,id2

Phase 44 -- Geospatial Verification via Satellite

Branch: feature/phase-44-satellite

Sentinel-2 L2A time series for project verification. NDVI change detection for forest diversion claims. NDBI (built-up index) for construction completion verification. SAR (Sentinel-1) for flood infrastructure claims. Compare contract completion claims vs satellite-observable progress.

Phase 45 -- W3C PROV-DM Provenance and TruthChain

Branch: feature/phase-45-provenance

TruthChain algorithm:

Each evidence node has: SHA-256 ID, source_type, content_hash, timestamp, status
Merkle tree over all evidence: root_hash changes if ANY evidence changes
Temporal decay: weight(E,t) = base_weight * exp(-lambda_type * days)
Status propagation: MODIFIED evidence propagates DEPENDS_ON_MODIFIED to descendants
Aggregate confidence = active_weight / total_weight

Decay rates by source:

court_order: 0.0001 (permanent)
cag_audit: 0.0002
government_portal: 0.0005
news_article: 0.001
social_media: 0.01

Export: JSON-LD using W3C PROV-DM ontology + Schema.org Blockchain anchor: Merkle root stored in audit_chain.py (Bitcoin via OpenTimestamps)

Phase 46 -- Source Drift and Historical Record Analysis

Branch: feature/phase-46-source-drift

Wayback CDX API to detect when government records are silently modified. 7 fault types (ISWC 2024 taxonomy):

node_disappearance: entity removed from portal
edge_rewiring: director change silently backdated
attribute_drift: contract amount modified post-publication
cluster_split: formerly linked entities disconnected
cluster_merge: separate networks joined
temporal_burst: sudden new relationship creation
isolation: previously connected entity becomes isolated

Anti-forensics detection: commit A -> commit B (change) -> commit C (reverts) = SUPPRESS_ATTEMPT

Phase 47 -- Predictive Risk Trajectory

Branch: feature/phase-47-predictive

ARIMA(2,1,1) risk prediction:

Fits on monthly risk score history (min 12 data points)
Forecasts 6 months ahead with 80% confidence intervals
Alert when predicted score crosses HIGH threshold

GCPAL contrastive pre-training for label scarcity:

India's 1:707 confirmed-corruption ratio makes traditional supervised ML difficult
GCPAL mines supervised signals from the unlabelled relationship graph
Three augmented views: node feature dropout + edge dropout + KNN view
NT-Xent contrastive loss (temperature = 0.07)
Fine-tunes on confirmed cases from case_memory (min 5 needed)

Phase 48 -- Watchlist, Alerts, and ARIMA Prediction

Branch: feature/phase-48-watchlist

WebSocket push alerts when risk score changes for watched entities. YAML alert rules (same format as Phase 36). Webhook support for journalist notification systems.

Phase 49 -- Observability and Reliability

Branch: feature/phase-49-observability

Prometheus /metrics endpoint. Stale-data alerts when pipeline has not run in >7 days. Ingestion validator checks all 20 node types have recent data. /health upgraded to return per-source freshness status.

Phase 50 -- Security v2: RBAC and JWT

Branch: feature/phase-50-security-v2

Role-based access control: Lead Investigator, Contributor, Reviewer, Observer. JWT authentication with refresh tokens. DPDP Act compliance (India Data Protection). Entity-level access control for sensitive investigations.

Phase 51 -- Electoral Bond Causal Graph Engine

Branch: feature/phase-51-electoral-bond-causal

Critical missing feature. The data exists but the causal chain is not mapped.

Full graph path: Corporate donor -> ElectoralBond -> Party -> Ministry -> Policy -> Contract -> Company

Algorithm: Granger causality (from Phase 25) + Difference-in-Differences to establish whether policy changes statistically follow bond purchases.

New node type: PolicyChange (date, ministry, beneficiaries) New relationship: FOLLOWED_BOND (lag_days, p_value, granger_f_stat)

New route: GET /electoral-bond/causal/{company_id}

Phase 52 -- Parliament Performance Analytics

Branch: feature/phase-52-parliament

New data sources: Lok Sabha division votes (loksabha.nic.in/Loksabha/Divisions), Rajya Sabha Q&A archive, Praja.org legislator data.

MP accountability score:

Attendance rate (0.30 weight)
Questions asked per session (0.25 weight)
Vote consistency with party line vs independent votes (0.20 weight)
Bills sponsored (0.15 weight)
Starred questions with substantive follow-up (0.10 weight)

New route: GET /parliament/performance/{politician_id} New node type: DivisionVote, ParliamentSession New relationship: VOTED_IN, ASKED_STARRED_QUESTION

Phase 53 -- Media Ownership Graph

Branch: feature/phase-53-media-ownership

New data sources: MIB media license registry, TRAI spectrum allocations.

Graph paths:

Channel -> Corporate parent -> Promoter -> Political donor
Channel -> Editorial stance correlation (NLP) -> Political entity

Editorial bias detection: NLP sentiment analysis comparing coverage of political entities across channels with known ownership structures.

New node types: MediaChannel, SpectrumLicense, EditorialEntity New route: GET /media/ownership/{channel_id}

Phase 54 -- Constituency Development Index

Branch: feature/phase-54-constituency

Data sources: NDAP district SDG scores, MGNREGS employment data, PM Kisan disbursements, PM Awas completions, Swachh Bharat ODF data.

Algorithm: Regression analysis -- does the constituency improve during the politician's tenure vs comparison period?

Pre-election spending surge detection: CUSUM on district spending in 90 days before election vs annual baseline.

New route: GET /constituency/{id}/development Satellite verification: Sentinel-2 images corroborate claimed completions.

Phase 55 -- Family Dynasty and Nepotism Graph

Branch: feature/phase-55-dynasty

Data source: FAMILY_OF edges extracted from MyNeta affidavit declarations ("Spouse: X", "Dependent 1: Y"). Already partially available in existing data.

Dynasty depth score:

Count of family members in government positions
Count of family-controlled companies with government contracts
Count of elections won by family members across generations
Geographic concentration (same constituency or district)

New relationship: FAMILY_OF (role: spouse/child/sibling/parent) New route: GET /dynasty/{politician_id}

Phase 56 -- RTI Intelligence Engine

Branch: feature/phase-56-rti

RTI auto-filer: System detects evidence gaps in any investigation and drafts the exact RTI application to fill them.

Gap detection algorithm:

For each HIGH-risk finding: check if primary source data is available
If data missing: identify the correct Public Information Officer
Generate RTI draft citing the specific provisions (RTI Act 2005, Sections 6-8)

RTI outcome tracker: Index filed RTI applications from RTI Online portal. Map outcomes to graph: PIOs who deny information for high-risk entities = flag.

New route: GET /rti/draft/{entity_id} (generates RTI text) New node type: RTIApplication, PublicInformationOfficer

Phase 57 -- A/B Algorithm Testing Framework (NEW)

Branch: feature/phase-57-ab-testing

Multi-armed bandit (Thompson Sampling) for algorithm selection:

Each algorithm arm has Beta(alpha, beta) prior over performance
alpha = times algorithm was "preferred" by human review
beta = times algorithm was "not preferred"
Select arm with highest sampled value at each request

Use case: When upgrading from static risk scorer -> ML ensemble -> NeuroSymbolic, verify the new algorithm actually improves outcomes.

New route: GET /admin/algorithm-performance

Phase 58 -- Real-Time Stream Processing (NEW)

Branch: feature/phase-58-streaming

Problem: Pipeline runs in batches. Breaking leads appear hours late.

Redis Streams (Kafka fallback) for real-time event ingestion. CUSUM online anomaly detection on the stream (no batch needed). Sliding window aggregation for real-time indicator updates.

Events processed in real-time:

new_contract: immediate CUSUM check on contract value
new_audit_report: check if any tracked entities are mentioned
new_enforcement_action: update risk scores for named entities
source_modification: detect when a scraped page changes

Phase 59 -- CorruptionDNA Fingerprint (NEW)

Branch: feature/phase-59-corruption-dna

Problem: Two entities in the same corruption network may have no direct graph edge -- different states, different directors, but identical patterns.

512-dim fingerprint = concat(:

Node2Vec structural embedding (128d)
TF-IDF document vector (128d)
Benford's Law digit distribution (9d, padded to 16d)
Temporal burst vector (64d)
Linguistic fingerprint -- Burrows Delta (64d)
Entity type one-hot (16d)
Risk indicator vector (16d)
CAG audit TF-IDF (64d)
Institutional path vector (32d)

MinHash LSH for efficient similarity search (cosine > 0.82 = same network). New route: GET /dna/{entity_id} and GET /dna/similar/{entity_id}

Phase 60 -- ElectionProximityBurst Detector (NEW)

Branch: feature/phase-60-election-burst

The only corruption detection algorithm that encodes the Indian electoral calendar as a statistical regression variable.

Algorithm:

Load full Indian electoral calendar (Lok Sabha + 28 state assemblies)
ARIMA(2,1,1) on monthly metric aggregates
PELT changepoint detection on ARIMA residuals
Match changepoints to election proximity (within 180 days)
CUSUM control chart with k=0.5, h=5.0
Granger causality: does election_proximity_days Granger-cause the metric?

Output: burst_score (0-100), election_burst_flags, cusum_alerts, Granger p-value, interpretation in plain language.

Integrated as 16th investigator (temporal, weight 0.10)

Phase 61 -- BennamiGNN: Heterogeneous Graph Neural Network (NEW)

Branch: feature/phase-61-benami-gnn

Problem: 5-factor heuristic misses multi-hop benami: politician's cousin is director (not the politician), company has legitimate small contracts before being used for a large fraudulent one.

H-GNN architecture:

8 relation types: DIRECTOR_OF, WON_CONTRACT, SHARES_ADDRESS, RELATED_TO, AWARDED_BY, FAMILY_MEMBER_OF, APPEARS_IN_AUDIT, SANCTIONED_BY
Layer 0: Per-type linear projection to d=64
Layer 1: Relation-aware message passing
Layer 2: Entity-type attention
Layer 3: Classification head -> benami_score in [0,1]

Fallback: Always falls back to existing 5-factor heuristic when:

PyTorch not installed
Subgraph has < 5 nodes
Model not trained yet

Training: Fine-tunes on confirmed benami cases from case_memory.

Phase 62 -- CartelDNA Sequential Mining (NEW)

Branch: feature/phase-62-cartel-dna

Problem: Current cartel detector checks single-tender award rotation. Temporal cartels rotate wins across months and across ministries to avoid statistical detection within any one ministry.

CartelDNA = PrefixSpan + HITS + DBSCAN:

PrefixSpan on bid event sequences (company, category, month, rank)
Detect alternating rank order patterns (length 2-6, min support 3)
HITS on co-bidding network: authority = real winners, hub = fake competitors
DBSCAN geographic clustering (epsilon = 50km, min_samples = 3)
Cartel confidence = 0.35pattern + 0.25alternation + 0.20geo + 0.20HITS

New route: GET /cartel/dna/{entity_id}

Phase 63 -- SHAP and LIME Explainability Layer (NEW)

Branch: feature/phase-63-explainability

Problem: Every risk score has no explanation. Journalists cannot publish "score: 67" without "why: politician_overlap drove +24 points."

SHAP TreeExplainer on the ML ensemble from Phase 19 upgrade:

Feature contributions for each of the 5 indicators
Counterfactual: "If contract_concentration were 0, score would be 43"
Baseline score (expected value)

LIME locally linear approximation for non-tree models.

New fields added to all risk responses:

shap_top_drivers: [{feature, shap_value, direction}]
shap_counterfactual: plain-language minimum change to flip risk level
shap_baseline: expected value before any features

New route: GET /risk/explain/{entity_id}

Phase 64 -- Cross-Language Entity Disambiguation (NEW)

Branch: feature/phase-64-cross-lingual

Problem: "Modi" / "modi" / "modii" appear in 22 scripts -- potentially stored as separate graph nodes. Cross-lingual entity linker maps all variants to a single canonical node using Wikidata Q-numbers.

XLM-RoBERTa zero-shot entity linking. Wikidata SPARQL for canonical Q-number lookup (existing scraper extended). Transliteration confidence score per script pair.

Phase 65 -- Knowledge Graph Completion (Missing Link Prediction) (NEW)

Branch: feature/phase-65-kg-completion

TransE link prediction: h + r = t in d-dimensional space. Missing edge score: ||h + r - t|| (lower = more probable edge).

Use cases:

(Politician, DIRECTOR_OF, ?) -- suggest companies likely controlled
(?, RELATED_TO, KnownShellCompany) -- find hidden associates
(Company, WON_CONTRACT, ?) -- predict future contract awards

Output: List of probable missing edges with confidence scores, presented as "Suggested next investigation targets."

Phase 66 -- LAS-GNN Temporal TBML Detection (NEW)

Branch: feature/phase-66-las-gnn

Problem: Current TBML detector uses threshold rules. Temporal money laundering (pre-election scatter-gather, below-threshold smurfing) is invisible to structural analysis.

LAS-GNN: LSTM aggregator on directed transaction graphs. Learns sequential order of edges imposed by timestamps. Detects motifs: scatter-gather, fan-in/fan-out, layering, pre-election burst.

Indian-specific motifs:

Pre-election scatter: funds split to many accounts < 6 months before election
Post-contract layering: payment -> N shell companies -> reconsolidated
Smurfing below threshold: many transactions < Rs 2 lakh (PMLA threshold)
Circular director rotation: A appoints X -> X at B -> B pays A

Phase 67 -- NeuroSymbolic Risk Reasoning (NEW)

Branch: feature/phase-67-neurosymbolic

Fuses three reasoning modes into one coherent system:

Stage 1 -- DEDUCTIVE (Phase 36 YAML rules):

Rules fire with certainty = 1.0 (logical certainty)
CRITICAL rule match -> score forced >= 75

Stage 2 -- INDUCTIVE (Phase 19 ML ensemble + SHAP):

GNN/ML soft score in [0,1]
SHAP feature contributions

Stage 3 -- ABDUCTIVE (Phase 38 DeepSeek-R1):

Chain-of-thought synthesis citing TruthChain evidence IDs
2 competing hypotheses with scores

Stage 4 -- Integration:

final_score = 0.40rule_certainty + 0.35gnn_score + 0.25*r1_confidence
Adversarial override: if adversarial engine finds contradicting evidence -> cap at PROBABLE

Phase 68 -- InstitutionMetapath2Vec Embeddings (NEW)

Branch: feature/phase-68-metapath

5 Indian-specific metapaths for structured random walks:

politician_enrichment: Politician-DIRECTOR_OF-Company-WON_CONTRACT-Contract
circular_enrichment: Politician-MEMBER_OF-Party-CONTROLS-Ministry-...-DIRECTOR_OF-Politician
audit_flag_circular: Company-WON_CONTRACT-Contract-MENTIONED_IN-AuditReport-AUDITS-Ministry
shell_address_cluster: Director-DIRECTOR_OF-Company-SHARES_ADDRESS-Company
constituency_benefit: Politician-REPRESENTS-Constituency-LOCATED_IN-District-HAS_PROJECT-Contract

128-dim entity embeddings trained via Word2Vec skip-gram on guided walks. find_similar_by_metapath() finds entities with the same institutional role across different states -- invisible to structural graph analysis.

Phase 69 -- Geospatial Risk Clustering (NEW)

Branch: feature/phase-69-geospatial

Moran's I spatial autocorrelation on district-level risk scores. I > 0 = spatial corruption hotspots cluster together.

LISA (Local Indicators of Spatial Association):

High-High cluster: high-risk district surrounded by high-risk districts
Low-High outlier: low-risk district in high-risk region (potential evasion)
High-Low outlier: targeted corruption in otherwise clean district

Output: District-level choropleth with cluster classification. New route: GET /geospatial/risk-clusters

Phase 70 -- Dynamic Knowledge Graph Anomaly Detection (NEW)

Branch: feature/phase-70-dynamic-kg

Continuously monitors graph for unexpected structural changes. 7 fault types (ISWC 2024): node_disappearance, edge_rewiring, attribute_drift, cluster_split, cluster_merge, temporal_burst, isolation.

Contextual anomaly detection: entity that was HIGH-risk 3 months ago is now suddenly LOW-risk = possible evidence suppression.

Phase 71 -- GCPAL Contrastive Pre-Training (NEW)

Branch: feature/phase-71-gcpal

Label scarcity problem: India has very few confirmed corruption cases relative to the total number of entities (estimated 1:707 ratio). Standard supervised ML cannot train on this imbalance.

GCPAL solution: NT-Xent contrastive loss on 3 augmented views:

View 1: node feature dropout (20%)
View 2: edge dropout (20%)
View 3: KNN implicit interactions (k=5)

Pre-trains on unlabelled graph. Fine-tunes on case_memory confirmed cases.

Phase 72 -- Automated Source Credibility Scoring (NEW)

Branch: feature/phase-72-source-credibility

Bayesian credibility model per source:

institutional_authority: government > NGO > news > social
historical_accuracy: confirmed vs denied past claims
methodology_transparency: does source explain collection method?
timeliness: freshness decay
cross_source_corroboration: independent corroboration count

Bayesian update after each confirmed/denied case.

Phase 73 -- Investigative RAG Over Case Memory (NEW)

Branch: feature/phase-73-rag-cases

RAG over all past investigation reports in case_memory. Query: "Past investigations involving electoral bonds and road contracts" -> Dense retrieval -> Top-k case summaries as context -> DeepSeek-R1 synthesizes commonalities and suggests strategy.

Phase 74 -- Continuous Model Drift Detection (NEW)

Branch: feature/phase-74-drift

Population Stability Index (PSI):

PSI < 0.10: stable
PSI 0.10-0.25: monitor closely
PSI > 0.25: retrain required

ADWIN (Adaptive Windowing): streaming concept drift detection. Auto-triggers GCPAL retraining job when drift detected.

Phase 75 -- Ethics and Bias Audit System (NEW)

Branch: feature/phase-75-ethics

Fairness metrics:

Demographic parity: P(HIGH_RISK | party=A) approx= P(HIGH_RISK | party=B)
Equal opportunity: TPR equal across entity types
Predictive parity: PPV equal across geographic regions

Bias detection: chi-squared test, disparate impact ratio, SHAP fairness. Mitigation: Reweighing, adversarial debiasing, calibration.

New route: GET /admin/bias-audit

BRANCH WORKFLOW

# Before each new phase:
git checkout main && git pull origin main
git checkout -b feature/phase-N-name

# After all commits:
git push origin feature/phase-N-name
# Open PR on GitHub -> merge -> pull main -> tag

# Tag every completed phase:
git tag -a vN.0.0 -m "Phase N: description"
git push origin vN.0.0

# Deploy to HuggingFace after every merge:
git push hf main --force

# Reseed after every deploy:
curl -X POST https://abinazebinoly-bharatgraph.hf.space/admin/seed

VERSION HISTORY

Version	Phase	Key addition
v0.30.0	30	Bug fix sprint -- 26 bugs resolved
v0.31.0	31	Runtime profile auto-scaling
v0.32.0	32	Entity resolution v2 (planned)
v0.33.0	33	Custom graph engine (planned)
v0.40.0	40	DeepSeek-V3 multilingual reports (planned)
v0.50.0	50	Security v2 RBAC (planned)
v1.0.0	75	Full production launch (planned)