Spaces:
Running
BharatGraph -- Complete Phase Roadmap
All branches merge into main. Branch naming: feature/phase-N-name or fix/description.
Each phase has a GitHub Issue (see issues/ directory) and a PR description template.
COMPLETED PHASES (1-31)
Phase 1 -- Data Collection
Tag: pre-v1 | 6 scrapers, 3,199+ records, base scraper with rate limiting and retry
Phase 2 -- Data Processing
Tag: pre-v2 | Indian name normalisation, Jaccard entity resolution, parallel pipeline
Phase 3 -- Graph Database
Tag: pre-v3 | Neo4j schema, 7 node types, stable MD5 IDs, 8 Cypher templates
Phase 4 -- FastAPI Backend
Tag: v0.12.0 | FastAPI + Pydantic + Neo4j dependency injection, source citations
Phase 5 -- Risk Scoring Engine
5-indicator composite score, validate_language() forbidden-word enforcement
Phase 6 -- Expanded Data Sources (13 scrapers)
ICIJ, Wikidata, OpenSanctions, Lok Sabha, SEBI, Electoral Bonds added
Phase 7 -- NLP Document Intelligence
spaCy NER, Benford Law chi-squared, multilingual BERT NER, shadow draft detector
Phase 8 -- Advanced Graph Analytics
NetworkX betweenness/PageRank/Louvain, circular ownership, ghost company scorer
Phase 9 -- Eight New Indian Sources (21 total)
NJDG, ED, CVC, NCRB, LGD, IBBI, NGO Darpan, CPPP added with fallback samples
Phase 10 -- Multi-Investigator AI Engine
Tag: v0.10.0 | 12 parallel investigators, SHA-256 report hash, synthesis engine
Phase 11 -- Multilingual Platform (22 Languages)
All 22 Indian scheduled languages, auto-detection, Helsinki-NLP translation
Phase 12 -- PDF Dossier Generator
Jinja2 + WeasyPrint, SHA-256 integrity hash, GET /export/pdf/{id}
Phase 13 -- Production Frontend
Vanilla JS/HTML/CSS, D3.js force graph, 5 views, works offline from file://
Phase 14 -- Zero Cold-Start Deployment
Tag: v0.14.0 | HuggingFace Spaces Docker, service worker cache, GitHub Pages CI/CD
Phase 15 -- Mathematical Intelligence Engine
Tag: v0.15.0 | Spectral Fiedler value, Fourier FFT, 13th investigator (math)
Phase 16 -- Evidence Connection Map and Deep Investigation
Tag: v0.16.0 | 6-layer recursive investigation, connection mapper, WHY explanations
Phase 17 -- Security Hardening and Provenance Layer
Tag: v0.17.0 | Rate limiter, CSP/HSTS headers, input validator, SHA-256 audit log
Phase 18 -- Self-Learning System and Case Memory
Schema learner, pattern learner, weight optimiser (+-0.01 per 3 confirmed cases)
Phase 19 -- Affidavit Wealth Trajectory Engine
Tag: v0.19.0 | Kalman filter, 5-election series, 14th investigator (affidavit)
Phase 20 -- Biography Engine
Chronological timeline, 5 temporal convergence window types, neutral narrative
Phase 21 -- Benami Entity Detection
5-factor proxy score, thresholds HIGH>=65 MODERATE>=40, 15th investigator
Phase 22 -- Procurement DNA, Cartel Detection, Full Pipeline
TF-IDF cosine >=0.72, award rotation, co-bidding network, 21 scrapers
Phase 23 -- Revolving Door and TBML Detection
365-day cooling-off, pre-employment benefit, 2.5-sigma TBML, subcontract loops
Phase 24 -- Linguistic Fingerprinting
Burrows Delta authorship, template reuse detection, ghost-writing detection
Phase 25 -- Policy-Benefit Causal Analysis
Granger causality (lags 1-6), transfer entropy, CACA cross-ministry chain
Phase 26 -- Adversarial Counterevidence
Forced disproof, competing hypotheses, uncertainty propagation
Phase 27 -- Multi-Agent Debate Engine
7-agent 3-round debate, iMAD hesitation detection, minority dissent preserved
Phase 28 -- Dark Pattern Detection
PrefixSpan sequential mining, 6 pre-defined high-risk sequences
Phase 29 -- UX Overhaul and i18n
Evidence panel (4 tabs), D3 graph redesign, 22-language UI, timeline view
Phase 30 -- Bug Fix Sprint
Tag: v0.30.0 | 26 bugs resolved including BUG-1 (search crash), BUG-2 (7 missing loaders)
Phase 31 -- Runtime Profile and Auto-Scaling
Tag: v0.31.0 | Hardware detector, LOW/MEDIUM/HIGH profiles, GET /runtime endpoint
Branch: feature/phase-31-runtime-profile
Files: config/runtime_profile.py, config/model_selector.py, api/routes/runtime.py
Tests: 15 unit tests in tests/test_runtime_profile.py
Profile assignment: cpu2 + ram2 + gpu*2 + disk + docker + db_local (max 9)
PLANNED PHASES
Phase 32 -- Entity Resolution v2: Canonical Identity Engine
Branch: feature/phase-32-entity-resolution
Priority: CRITICAL -- fixes broken evidence chains across all phases
Problem: Jaccard token similarity misses transliteration variants, honorific variations ("Sh. Ram Kumar" vs "Shri Ramkumar"), and cross-script name forms. The same person stored under 3+ IDs = broken evidence chains.
Algorithms:
- Jaro-Winkler (weight 0.30) -- character-level typo and transliteration
- Jaccard token overlap (weight 0.20) -- word-order variations
- Sentence-transformers cosine (weight 0.35) -- multilingual name variants
- Exact PAN/CIN/GSTIN match (weight 1.0, overrides all) -- deterministic keys
New files:
processing/entity_resolver_v2.py-- CanonicalIdentityEngine classprocessing/canonical_id.py-- stable SHA-256 ID generation functionsprocessing/alias_graph.py-- AliasGraph: alias_name -> canonical_id lookup
Indian name normalisation added:
- Remove honorifics: Sh., Smt., Dr., Late, Sri, Shri, Er., Adv., Col.
- Normalise suffixes: Private Limited -> Pvt Ltd, LLP, Ltd
- Script-aware: Devanagari -> Latin transliteration for comparison
Integration: pipeline.py resolve_dataset() upgraded to use v2 engine
Phase 33 -- Custom Graph Engine: Eliminate Neo4j 50K Limit
Branch: feature/phase-33-custom-graph-engine
Priority: HIGH -- AuraDB free tier caps at 50K nodes / 175K relationships
Architecture:
graph_engine/
+-- store.py -- LevelDB key-value backing store
+-- hnsw.py -- HNSW vector index (M=16, ef=200)
+-- query_planner.py -- Cypher-to-native query translator
+-- temporal.py -- Time-weighted edge decay by relationship type
+-- version_control.py -- Git-style diff log for graph mutations
+-- compat_layer.py -- Translates all existing Cypher to native calls
Temporal edge decay lambdas:
- court_order: 0.00005 (slowest -- court records are permanent)
- cag_audit: 0.0002
- government_portal: 0.0005
- director_of: 0.0003
- member_of: 0.0005
- news_article: 0.001
- social_media: 0.01 (fastest decay)
Version control: Every graph mutation is recorded as a diff with before/after hashes. Detects when government portals silently modify records post-publication. Anti-forensics pattern: commit A -> commit B (change) -> commit C (reverts to A) = flag
Phase 34 -- Vector Search and Hybrid Retrieval
Branch: feature/phase-34-vector-search
Problem: Keyword search misses semantically similar documents. Searching "Maharashtra road contract irregularity" does not find CAG reports about "highway construction irregularity in Pune" even though they are the same topic.
Algorithms:
- FAISS (cpu) or Qdrant for vector index
- BM25 for keyword ranking
- Reciprocal Rank Fusion (k=60): RRF = sum(1 / (60 + rank))
- Query classifier routes to appropriate retrieval strategy
Query routing:
| Query type | Keywords | Retrieval mix |
|---|---|---|
| factual | who is, what is, when did | BM25 70% + vector 30% |
| relational | connected to, path from | Graph 80% + vector 20% |
| temporal | before, after, election, contract date | Graph 60% + BM25 40% |
| exploratory | similar to, pattern, cluster | Vector 60% + community 40% |
Embedding model: paraphrase-multilingual-MiniLM-L12-v2 (covers all 22 languages)
Phase 35 -- Plugin System and YAML Enrichers
Branch: feature/phase-35-plugins
Lazy-loading plugin architecture -- new data sources added by dropping
a YAML file in enrichers/ with no code changes.
Plugin registry also covers algorithms -- new detection algorithms registered as plugins, enabling Phase 57 A/B testing.
Phase 36 -- Sigma-Style YAML Rule Engine
Branch: feature/phase-36-rule-engine
Problem: Adding a new detection rule requires writing Python + Cypher. Non-developer investigators cannot contribute detection logic.
YAML -> Cypher compiler -- a rule file specifies conditions, thresholds, and actions. The engine compiles it to Cypher at startup.
10 built-in rules shipped:
cartel_rotation.yaml-- same vendor group rotates winselectoral_bond_proximity.yaml-- bond + contract within 12 months (CRITICAL)family_directorship_web.yaml-- politician's family = company directoraudit_contract_overlap.yaml-- continued contracts after CAG audit flagshell_company_age_contract.yaml-- company < 6 months old + large contractsingle_bidder_high_value.yaml-- single bid above district averagecircular_ownership_3node.yaml-- 3-node corporate ownership cyclerevolving_door_365day.yaml-- government to private within 1 yearaddress_cluster_directors.yaml-- 3+ companies same registered addresspre_election_contract_surge.yaml-- contract spend spike 90 days before poll
Phase 37 -- Job Queue and Worker Pool
Branch: feature/phase-37-job-queue
Redis-backed job queue with state machine: INIT -> QUEUED -> RUNNING -> DONE
Algorithm job priorities:
- Priority 1 (immediate): entity_resolution, neurosymbolic_risk, rule_engine
- Priority 2 (30s): gnn_tbml, election_burst, shap_explanation, graphrag_summary
- Priority 3 (5min): corruption_dna, metapath_walk, community_detection, topic_modeling
- Priority 4 (off-peak): fingerprint_index, gcpal_pretraining, wayback_drift
Phase 38 -- DeepSeek-R1 Chain-of-Thought Reasoning
Branch: feature/phase-38-deepseek-r1
Problem: Current synthesis logic (3+ investigators agreeing = HIGH) is a vote count, not reasoning. No audit trail of how a conclusion was reached.
DeepSeek-R1 integration:
- Receives: graph findings + SHAP explanations + TruthChain evidence IDs
- Generates: step-by-step reasoning chain citing specific evidence node IDs
- Produces: 2 competing hypotheses with scores, then a final verdict
- Verdict levels: CONFIRMED (>=80), PROBABLE (>=50), WEAK (>=20), INSUFFICIENT
Anti-hallucination enforcement:
- Every R1 claim must cite a TruthChain node_id (format: [EVIDENCE-XXXX])
- Post-generation validation: regex check for invented node IDs
- Invalid citations are stripped before the report is returned
Fallback: When DeepSeek API is unavailable, the existing multi-investigator synthesis provides the output. R1 augments -- it does not replace.
Phase 38B -- GraphRAG: Graph-Guided LLM Retrieval (NEW)
Branch: feature/phase-38b-graphrag
Problem: R1 cannot answer global questions like "What are the main corruption themes across all 5,000 CAG audit reports?" Standard RAG retrieves isolated chunks.
GraphRAG approach:
- Run Leiden clustering over all scraped documents and graph nodes
- For each community > 3 nodes, R1 generates a community summary
- At query time: embed query -> retrieve top-k community summaries by cosine
- Feed summaries + relevant subgraph as structured context to R1
New files:
ai/graphrag/community_indexer.py-- builds community summaries offlineai/graphrag/graphrag_retriever.py-- query-time retrieval
Integration with Phase 38: R1 receives GraphRAG community summaries instead of raw graph fragments -- dramatically reduces hallucination.
Phase 39 -- DeepSeek-VL2 Visual Evidence Analysis
Branch: feature/phase-39-deepseek-vl2
Analyse scanned affidavit PDFs, audit report images, and newspaper clippings. Signature mismatch detection. Document image authenticity via Shannon entropy. OCR pipeline for non-digital government documents.
Phase 40 -- DeepSeek-V3 Multilingual Dossier Generation
Branch: feature/phase-40-deepseek-v3
Generate full investigation reports in all 22 Indian languages. CONFIRMED/PROBABLE/WEAK/INSUFFICIENT grading on every finding. Length: 800-1200 words per report. Export to PDF with trilingual header.
Phase 41 -- Legal Intelligence Pipeline
Branch: feature/phase-41-legal
IPC Section Classifier:
- Algorithm: TF-IDF + OneVsRestClassifier(LogisticRegression) -- multi-label
- 8 corruption-relevant IPC sections: 420, 409, 13, 7, 120B, 467, 468, 471
- Keyword fallback when model not trained
Crime triple extractor:
- Pattern: Subject -> Action -> Object from legal text
- Store as directed evidence edges: (Company)-[:BRIBED]->(Official)
Semantic Role Labelling (SRL):
- ARG0 (agent) -> entity who acted
- ARG2 (recipient) -> entity who benefited
- V (predicate) -> action type: BRIBED, APPROVED, AWARDED
BK-tree for out-of-vocabulary legal term repair.
Phase 42 -- Forensic Content Intelligence
Branch: feature/phase-42-forensic-content
Shannon entropy classifier:
| Document type | Expected range |
|---|---|
| government_order | 3.8 -- 5.2 bits |
| cag_report | 4.0 -- 5.4 bits |
| tender_document | 3.5 -- 5.0 bits |
| court_order | 3.9 -- 5.3 bits |
Documents outside expected range flagged as SUSPICIOUS or LIKELY_FABRICATED.
Perceptual hash (pHash) for image-based document copy detection. PAN/CIN/Aadhaar regex extraction from document text. Lexical diversity score -- repetitive templates have diversity < 0.3.
Phase 43 -- Pivot Recommendation Engine
Branch: feature/phase-43-pivot
Problem: After finding a suspicious entity, the next best investigation target is unclear. The pivot engine scores all connected entities.
6-factor scoring:
| Factor | Weight | Description |
|---|---|---|
| pagerank | 0.20 | How central is this entity? |
| evidence_gap | 0.25 | How much do we NOT know? |
| risk_signals | 0.20 | log(risk_signals + 1) |
| connection_strength | 0.15 | Edge weight to current entity |
| temporal_recency | 0.10 | Recently active? |
| unexplored_depth | 0.10 | Unexplored 2-hop nodes |
Route: GET /pivot/{entity_id}?already_investigated=id1,id2
Phase 44 -- Geospatial Verification via Satellite
Branch: feature/phase-44-satellite
Sentinel-2 L2A time series for project verification. NDVI change detection for forest diversion claims. NDBI (built-up index) for construction completion verification. SAR (Sentinel-1) for flood infrastructure claims. Compare contract completion claims vs satellite-observable progress.
Phase 45 -- W3C PROV-DM Provenance and TruthChain
Branch: feature/phase-45-provenance
TruthChain algorithm:
- Each evidence node has: SHA-256 ID, source_type, content_hash, timestamp, status
- Merkle tree over all evidence: root_hash changes if ANY evidence changes
- Temporal decay: weight(E,t) = base_weight * exp(-lambda_type * days)
- Status propagation: MODIFIED evidence propagates DEPENDS_ON_MODIFIED to descendants
- Aggregate confidence = active_weight / total_weight
Decay rates by source:
- court_order: 0.0001 (permanent)
- cag_audit: 0.0002
- government_portal: 0.0005
- news_article: 0.001
- social_media: 0.01
Export: JSON-LD using W3C PROV-DM ontology + Schema.org Blockchain anchor: Merkle root stored in audit_chain.py (Bitcoin via OpenTimestamps)
Phase 46 -- Source Drift and Historical Record Analysis
Branch: feature/phase-46-source-drift
Wayback CDX API to detect when government records are silently modified. 7 fault types (ISWC 2024 taxonomy):
- node_disappearance: entity removed from portal
- edge_rewiring: director change silently backdated
- attribute_drift: contract amount modified post-publication
- cluster_split: formerly linked entities disconnected
- cluster_merge: separate networks joined
- temporal_burst: sudden new relationship creation
- isolation: previously connected entity becomes isolated
Anti-forensics detection: commit A -> commit B (change) -> commit C (reverts) = SUPPRESS_ATTEMPT
Phase 47 -- Predictive Risk Trajectory
Branch: feature/phase-47-predictive
ARIMA(2,1,1) risk prediction:
- Fits on monthly risk score history (min 12 data points)
- Forecasts 6 months ahead with 80% confidence intervals
- Alert when predicted score crosses HIGH threshold
GCPAL contrastive pre-training for label scarcity:
- India's 1:707 confirmed-corruption ratio makes traditional supervised ML difficult
- GCPAL mines supervised signals from the unlabelled relationship graph
- Three augmented views: node feature dropout + edge dropout + KNN view
- NT-Xent contrastive loss (temperature = 0.07)
- Fine-tunes on confirmed cases from case_memory (min 5 needed)
Phase 48 -- Watchlist, Alerts, and ARIMA Prediction
Branch: feature/phase-48-watchlist
WebSocket push alerts when risk score changes for watched entities. YAML alert rules (same format as Phase 36). Webhook support for journalist notification systems.
Phase 49 -- Observability and Reliability
Branch: feature/phase-49-observability
Prometheus /metrics endpoint. Stale-data alerts when pipeline has not run in >7 days. Ingestion validator checks all 20 node types have recent data. /health upgraded to return per-source freshness status.
Phase 50 -- Security v2: RBAC and JWT
Branch: feature/phase-50-security-v2
Role-based access control: Lead Investigator, Contributor, Reviewer, Observer. JWT authentication with refresh tokens. DPDP Act compliance (India Data Protection). Entity-level access control for sensitive investigations.
Phase 51 -- Electoral Bond Causal Graph Engine
Branch: feature/phase-51-electoral-bond-causal
Critical missing feature. The data exists but the causal chain is not mapped.
Full graph path: Corporate donor -> ElectoralBond -> Party -> Ministry -> Policy -> Contract -> Company
Algorithm: Granger causality (from Phase 25) + Difference-in-Differences to establish whether policy changes statistically follow bond purchases.
New node type: PolicyChange (date, ministry, beneficiaries) New relationship: FOLLOWED_BOND (lag_days, p_value, granger_f_stat)
New route: GET /electoral-bond/causal/{company_id}
Phase 52 -- Parliament Performance Analytics
Branch: feature/phase-52-parliament
New data sources: Lok Sabha division votes (loksabha.nic.in/Loksabha/Divisions), Rajya Sabha Q&A archive, Praja.org legislator data.
MP accountability score:
- Attendance rate (0.30 weight)
- Questions asked per session (0.25 weight)
- Vote consistency with party line vs independent votes (0.20 weight)
- Bills sponsored (0.15 weight)
- Starred questions with substantive follow-up (0.10 weight)
New route: GET /parliament/performance/{politician_id}
New node type: DivisionVote, ParliamentSession
New relationship: VOTED_IN, ASKED_STARRED_QUESTION
Phase 53 -- Media Ownership Graph
Branch: feature/phase-53-media-ownership
New data sources: MIB media license registry, TRAI spectrum allocations.
Graph paths:
- Channel -> Corporate parent -> Promoter -> Political donor
- Channel -> Editorial stance correlation (NLP) -> Political entity
Editorial bias detection: NLP sentiment analysis comparing coverage of political entities across channels with known ownership structures.
New node types: MediaChannel, SpectrumLicense, EditorialEntity
New route: GET /media/ownership/{channel_id}
Phase 54 -- Constituency Development Index
Branch: feature/phase-54-constituency
Data sources: NDAP district SDG scores, MGNREGS employment data, PM Kisan disbursements, PM Awas completions, Swachh Bharat ODF data.
Algorithm: Regression analysis -- does the constituency improve during the politician's tenure vs comparison period?
Pre-election spending surge detection: CUSUM on district spending in 90 days before election vs annual baseline.
New route: GET /constituency/{id}/development
Satellite verification: Sentinel-2 images corroborate claimed completions.
Phase 55 -- Family Dynasty and Nepotism Graph
Branch: feature/phase-55-dynasty
Data source: FAMILY_OF edges extracted from MyNeta affidavit declarations ("Spouse: X", "Dependent 1: Y"). Already partially available in existing data.
Dynasty depth score:
- Count of family members in government positions
- Count of family-controlled companies with government contracts
- Count of elections won by family members across generations
- Geographic concentration (same constituency or district)
New relationship: FAMILY_OF (role: spouse/child/sibling/parent)
New route: GET /dynasty/{politician_id}
Phase 56 -- RTI Intelligence Engine
Branch: feature/phase-56-rti
RTI auto-filer: System detects evidence gaps in any investigation and drafts the exact RTI application to fill them.
Gap detection algorithm:
- For each HIGH-risk finding: check if primary source data is available
- If data missing: identify the correct Public Information Officer
- Generate RTI draft citing the specific provisions (RTI Act 2005, Sections 6-8)
RTI outcome tracker: Index filed RTI applications from RTI Online portal. Map outcomes to graph: PIOs who deny information for high-risk entities = flag.
New route: GET /rti/draft/{entity_id} (generates RTI text)
New node type: RTIApplication, PublicInformationOfficer
Phase 57 -- A/B Algorithm Testing Framework (NEW)
Branch: feature/phase-57-ab-testing
Multi-armed bandit (Thompson Sampling) for algorithm selection:
- Each algorithm arm has Beta(alpha, beta) prior over performance
- alpha = times algorithm was "preferred" by human review
- beta = times algorithm was "not preferred"
- Select arm with highest sampled value at each request
Use case: When upgrading from static risk scorer -> ML ensemble -> NeuroSymbolic, verify the new algorithm actually improves outcomes.
New route: GET /admin/algorithm-performance
Phase 58 -- Real-Time Stream Processing (NEW)
Branch: feature/phase-58-streaming
Problem: Pipeline runs in batches. Breaking leads appear hours late.
Redis Streams (Kafka fallback) for real-time event ingestion. CUSUM online anomaly detection on the stream (no batch needed). Sliding window aggregation for real-time indicator updates.
Events processed in real-time:
- new_contract: immediate CUSUM check on contract value
- new_audit_report: check if any tracked entities are mentioned
- new_enforcement_action: update risk scores for named entities
- source_modification: detect when a scraped page changes
Phase 59 -- CorruptionDNA Fingerprint (NEW)
Branch: feature/phase-59-corruption-dna
Problem: Two entities in the same corruption network may have no direct graph edge -- different states, different directors, but identical patterns.
512-dim fingerprint = concat(:
- Node2Vec structural embedding (128d)
- TF-IDF document vector (128d)
- Benford's Law digit distribution (9d, padded to 16d)
- Temporal burst vector (64d)
- Linguistic fingerprint -- Burrows Delta (64d)
- Entity type one-hot (16d)
- Risk indicator vector (16d)
- CAG audit TF-IDF (64d)
- Institutional path vector (32d)
MinHash LSH for efficient similarity search (cosine > 0.82 = same network).
New route: GET /dna/{entity_id} and GET /dna/similar/{entity_id}
Phase 60 -- ElectionProximityBurst Detector (NEW)
Branch: feature/phase-60-election-burst
The only corruption detection algorithm that encodes the Indian electoral calendar as a statistical regression variable.
Algorithm:
- Load full Indian electoral calendar (Lok Sabha + 28 state assemblies)
- ARIMA(2,1,1) on monthly metric aggregates
- PELT changepoint detection on ARIMA residuals
- Match changepoints to election proximity (within 180 days)
- CUSUM control chart with k=0.5, h=5.0
- Granger causality: does election_proximity_days Granger-cause the metric?
Output: burst_score (0-100), election_burst_flags, cusum_alerts, Granger p-value, interpretation in plain language.
Integrated as 16th investigator (temporal, weight 0.10)
Phase 61 -- BennamiGNN: Heterogeneous Graph Neural Network (NEW)
Branch: feature/phase-61-benami-gnn
Problem: 5-factor heuristic misses multi-hop benami: politician's cousin is director (not the politician), company has legitimate small contracts before being used for a large fraudulent one.
H-GNN architecture:
- 8 relation types: DIRECTOR_OF, WON_CONTRACT, SHARES_ADDRESS, RELATED_TO, AWARDED_BY, FAMILY_MEMBER_OF, APPEARS_IN_AUDIT, SANCTIONED_BY
- Layer 0: Per-type linear projection to d=64
- Layer 1: Relation-aware message passing
- Layer 2: Entity-type attention
- Layer 3: Classification head -> benami_score in [0,1]
Fallback: Always falls back to existing 5-factor heuristic when:
- PyTorch not installed
- Subgraph has < 5 nodes
- Model not trained yet
Training: Fine-tunes on confirmed benami cases from case_memory.
Phase 62 -- CartelDNA Sequential Mining (NEW)
Branch: feature/phase-62-cartel-dna
Problem: Current cartel detector checks single-tender award rotation. Temporal cartels rotate wins across months and across ministries to avoid statistical detection within any one ministry.
CartelDNA = PrefixSpan + HITS + DBSCAN:
- PrefixSpan on bid event sequences (company, category, month, rank)
- Detect alternating rank order patterns (length 2-6, min support 3)
- HITS on co-bidding network: authority = real winners, hub = fake competitors
- DBSCAN geographic clustering (epsilon = 50km, min_samples = 3)
- Cartel confidence = 0.35pattern + 0.25alternation + 0.20geo + 0.20HITS
New route: GET /cartel/dna/{entity_id}
Phase 63 -- SHAP and LIME Explainability Layer (NEW)
Branch: feature/phase-63-explainability
Problem: Every risk score has no explanation. Journalists cannot publish "score: 67" without "why: politician_overlap drove +24 points."
SHAP TreeExplainer on the ML ensemble from Phase 19 upgrade:
- Feature contributions for each of the 5 indicators
- Counterfactual: "If contract_concentration were 0, score would be 43"
- Baseline score (expected value)
LIME locally linear approximation for non-tree models.
New fields added to all risk responses:
- shap_top_drivers: [{feature, shap_value, direction}]
- shap_counterfactual: plain-language minimum change to flip risk level
- shap_baseline: expected value before any features
New route: GET /risk/explain/{entity_id}
Phase 64 -- Cross-Language Entity Disambiguation (NEW)
Branch: feature/phase-64-cross-lingual
Problem: "Modi" / "modi" / "modii" appear in 22 scripts -- potentially stored as separate graph nodes. Cross-lingual entity linker maps all variants to a single canonical node using Wikidata Q-numbers.
XLM-RoBERTa zero-shot entity linking. Wikidata SPARQL for canonical Q-number lookup (existing scraper extended). Transliteration confidence score per script pair.
Phase 65 -- Knowledge Graph Completion (Missing Link Prediction) (NEW)
Branch: feature/phase-65-kg-completion
TransE link prediction: h + r = t in d-dimensional space. Missing edge score: ||h + r - t|| (lower = more probable edge).
Use cases:
- (Politician, DIRECTOR_OF, ?) -- suggest companies likely controlled
- (?, RELATED_TO, KnownShellCompany) -- find hidden associates
- (Company, WON_CONTRACT, ?) -- predict future contract awards
Output: List of probable missing edges with confidence scores, presented as "Suggested next investigation targets."
Phase 66 -- LAS-GNN Temporal TBML Detection (NEW)
Branch: feature/phase-66-las-gnn
Problem: Current TBML detector uses threshold rules. Temporal money laundering (pre-election scatter-gather, below-threshold smurfing) is invisible to structural analysis.
LAS-GNN: LSTM aggregator on directed transaction graphs. Learns sequential order of edges imposed by timestamps. Detects motifs: scatter-gather, fan-in/fan-out, layering, pre-election burst.
Indian-specific motifs:
- Pre-election scatter: funds split to many accounts < 6 months before election
- Post-contract layering: payment -> N shell companies -> reconsolidated
- Smurfing below threshold: many transactions < Rs 2 lakh (PMLA threshold)
- Circular director rotation: A appoints X -> X at B -> B pays A
Phase 67 -- NeuroSymbolic Risk Reasoning (NEW)
Branch: feature/phase-67-neurosymbolic
Fuses three reasoning modes into one coherent system:
Stage 1 -- DEDUCTIVE (Phase 36 YAML rules):
- Rules fire with certainty = 1.0 (logical certainty)
- CRITICAL rule match -> score forced >= 75
Stage 2 -- INDUCTIVE (Phase 19 ML ensemble + SHAP):
- GNN/ML soft score in [0,1]
- SHAP feature contributions
Stage 3 -- ABDUCTIVE (Phase 38 DeepSeek-R1):
- Chain-of-thought synthesis citing TruthChain evidence IDs
- 2 competing hypotheses with scores
Stage 4 -- Integration:
- final_score = 0.40rule_certainty + 0.35gnn_score + 0.25*r1_confidence
- Adversarial override: if adversarial engine finds contradicting evidence -> cap at PROBABLE
Phase 68 -- InstitutionMetapath2Vec Embeddings (NEW)
Branch: feature/phase-68-metapath
5 Indian-specific metapaths for structured random walks:
- politician_enrichment: Politician-DIRECTOR_OF-Company-WON_CONTRACT-Contract
- circular_enrichment: Politician-MEMBER_OF-Party-CONTROLS-Ministry-...-DIRECTOR_OF-Politician
- audit_flag_circular: Company-WON_CONTRACT-Contract-MENTIONED_IN-AuditReport-AUDITS-Ministry
- shell_address_cluster: Director-DIRECTOR_OF-Company-SHARES_ADDRESS-Company
- constituency_benefit: Politician-REPRESENTS-Constituency-LOCATED_IN-District-HAS_PROJECT-Contract
128-dim entity embeddings trained via Word2Vec skip-gram on guided walks. find_similar_by_metapath() finds entities with the same institutional role across different states -- invisible to structural graph analysis.
Phase 69 -- Geospatial Risk Clustering (NEW)
Branch: feature/phase-69-geospatial
Moran's I spatial autocorrelation on district-level risk scores. I > 0 = spatial corruption hotspots cluster together.
LISA (Local Indicators of Spatial Association):
- High-High cluster: high-risk district surrounded by high-risk districts
- Low-High outlier: low-risk district in high-risk region (potential evasion)
- High-Low outlier: targeted corruption in otherwise clean district
Output: District-level choropleth with cluster classification.
New route: GET /geospatial/risk-clusters
Phase 70 -- Dynamic Knowledge Graph Anomaly Detection (NEW)
Branch: feature/phase-70-dynamic-kg
Continuously monitors graph for unexpected structural changes. 7 fault types (ISWC 2024): node_disappearance, edge_rewiring, attribute_drift, cluster_split, cluster_merge, temporal_burst, isolation.
Contextual anomaly detection: entity that was HIGH-risk 3 months ago is now suddenly LOW-risk = possible evidence suppression.
Phase 71 -- GCPAL Contrastive Pre-Training (NEW)
Branch: feature/phase-71-gcpal
Label scarcity problem: India has very few confirmed corruption cases relative to the total number of entities (estimated 1:707 ratio). Standard supervised ML cannot train on this imbalance.
GCPAL solution: NT-Xent contrastive loss on 3 augmented views:
- View 1: node feature dropout (20%)
- View 2: edge dropout (20%)
- View 3: KNN implicit interactions (k=5)
Pre-trains on unlabelled graph. Fine-tunes on case_memory confirmed cases.
Phase 72 -- Automated Source Credibility Scoring (NEW)
Branch: feature/phase-72-source-credibility
Bayesian credibility model per source:
- institutional_authority: government > NGO > news > social
- historical_accuracy: confirmed vs denied past claims
- methodology_transparency: does source explain collection method?
- timeliness: freshness decay
- cross_source_corroboration: independent corroboration count
Bayesian update after each confirmed/denied case.
Phase 73 -- Investigative RAG Over Case Memory (NEW)
Branch: feature/phase-73-rag-cases
RAG over all past investigation reports in case_memory. Query: "Past investigations involving electoral bonds and road contracts" -> Dense retrieval -> Top-k case summaries as context -> DeepSeek-R1 synthesizes commonalities and suggests strategy.
Phase 74 -- Continuous Model Drift Detection (NEW)
Branch: feature/phase-74-drift
Population Stability Index (PSI):
- PSI < 0.10: stable
- PSI 0.10-0.25: monitor closely
- PSI > 0.25: retrain required
ADWIN (Adaptive Windowing): streaming concept drift detection. Auto-triggers GCPAL retraining job when drift detected.
Phase 75 -- Ethics and Bias Audit System (NEW)
Branch: feature/phase-75-ethics
Fairness metrics:
- Demographic parity: P(HIGH_RISK | party=A) approx= P(HIGH_RISK | party=B)
- Equal opportunity: TPR equal across entity types
- Predictive parity: PPV equal across geographic regions
Bias detection: chi-squared test, disparate impact ratio, SHAP fairness. Mitigation: Reweighing, adversarial debiasing, calibration.
New route: GET /admin/bias-audit
BRANCH WORKFLOW
# Before each new phase:
git checkout main && git pull origin main
git checkout -b feature/phase-N-name
# After all commits:
git push origin feature/phase-N-name
# Open PR on GitHub -> merge -> pull main -> tag
# Tag every completed phase:
git tag -a vN.0.0 -m "Phase N: description"
git push origin vN.0.0
# Deploy to HuggingFace after every merge:
git push hf main --force
# Reseed after every deploy:
curl -X POST https://abinazebinoly-bharatgraph.hf.space/admin/seed
VERSION HISTORY
| Version | Phase | Key addition |
|---|---|---|
| v0.30.0 | 30 | Bug fix sprint -- 26 bugs resolved |
| v0.31.0 | 31 | Runtime profile auto-scaling |
| v0.32.0 | 32 | Entity resolution v2 (planned) |
| v0.33.0 | 33 | Custom graph engine (planned) |
| v0.40.0 | 40 | DeepSeek-V3 multilingual reports (planned) |
| v0.50.0 | 50 | Security v2 RBAC (planned) |
| v1.0.0 | 75 | Full production launch (planned) |