Spaces:

abinazebinoy
/

bharatgraph

Running

App Files Files Community

bharatgraph / PHASE_ROADMAP.md

abinazebinoy

Update Phaseroadmap with new features

9f87a81 unverified 2 months ago

preview code

Raw

History Blame Contribute Delete

34.5 kB

	# BharatGraph -- Complete Phase Roadmap

	All branches merge into `main`. Branch naming: `feature/phase-N-name` or `fix/description`.
	Each phase has a GitHub Issue (see `issues/` directory) and a PR description template.

	---

	## COMPLETED PHASES (1-31)

	### Phase 1 -- Data Collection
	Tag: pre-v1 \| 6 scrapers, 3,199+ records, base scraper with rate limiting and retry

	### Phase 2 -- Data Processing
	Tag: pre-v2 \| Indian name normalisation, Jaccard entity resolution, parallel pipeline

	### Phase 3 -- Graph Database
	Tag: pre-v3 \| Neo4j schema, 7 node types, stable MD5 IDs, 8 Cypher templates

	### Phase 4 -- FastAPI Backend
	Tag: v0.12.0 \| FastAPI + Pydantic + Neo4j dependency injection, source citations

	### Phase 5 -- Risk Scoring Engine
	5-indicator composite score, validate_language() forbidden-word enforcement

	### Phase 6 -- Expanded Data Sources (13 scrapers)
	ICIJ, Wikidata, OpenSanctions, Lok Sabha, SEBI, Electoral Bonds added

	### Phase 7 -- NLP Document Intelligence
	spaCy NER, Benford Law chi-squared, multilingual BERT NER, shadow draft detector

	### Phase 8 -- Advanced Graph Analytics
	NetworkX betweenness/PageRank/Louvain, circular ownership, ghost company scorer

	### Phase 9 -- Eight New Indian Sources (21 total)
	NJDG, ED, CVC, NCRB, LGD, IBBI, NGO Darpan, CPPP added with fallback samples

	### Phase 10 -- Multi-Investigator AI Engine
	Tag: v0.10.0 \| 12 parallel investigators, SHA-256 report hash, synthesis engine

	### Phase 11 -- Multilingual Platform (22 Languages)
	All 22 Indian scheduled languages, auto-detection, Helsinki-NLP translation

	### Phase 12 -- PDF Dossier Generator
	Jinja2 + WeasyPrint, SHA-256 integrity hash, GET /export/pdf/{id}

	### Phase 13 -- Production Frontend
	Vanilla JS/HTML/CSS, D3.js force graph, 5 views, works offline from file://

	### Phase 14 -- Zero Cold-Start Deployment
	Tag: v0.14.0 \| HuggingFace Spaces Docker, service worker cache, GitHub Pages CI/CD

	### Phase 15 -- Mathematical Intelligence Engine
	Tag: v0.15.0 \| Spectral Fiedler value, Fourier FFT, 13th investigator (math)

	### Phase 16 -- Evidence Connection Map and Deep Investigation
	Tag: v0.16.0 \| 6-layer recursive investigation, connection mapper, WHY explanations

	### Phase 17 -- Security Hardening and Provenance Layer
	Tag: v0.17.0 \| Rate limiter, CSP/HSTS headers, input validator, SHA-256 audit log

	### Phase 18 -- Self-Learning System and Case Memory
	Schema learner, pattern learner, weight optimiser (+-0.01 per 3 confirmed cases)

	### Phase 19 -- Affidavit Wealth Trajectory Engine
	Tag: v0.19.0 \| Kalman filter, 5-election series, 14th investigator (affidavit)

	### Phase 20 -- Biography Engine
	Chronological timeline, 5 temporal convergence window types, neutral narrative

	### Phase 21 -- Benami Entity Detection
	5-factor proxy score, thresholds HIGH>=65 MODERATE>=40, 15th investigator

	### Phase 22 -- Procurement DNA, Cartel Detection, Full Pipeline
	TF-IDF cosine >=0.72, award rotation, co-bidding network, 21 scrapers

	### Phase 23 -- Revolving Door and TBML Detection
	365-day cooling-off, pre-employment benefit, 2.5-sigma TBML, subcontract loops

	### Phase 24 -- Linguistic Fingerprinting
	Burrows Delta authorship, template reuse detection, ghost-writing detection

	### Phase 25 -- Policy-Benefit Causal Analysis
	Granger causality (lags 1-6), transfer entropy, CACA cross-ministry chain

	### Phase 26 -- Adversarial Counterevidence
	Forced disproof, competing hypotheses, uncertainty propagation

	### Phase 27 -- Multi-Agent Debate Engine
	7-agent 3-round debate, iMAD hesitation detection, minority dissent preserved

	### Phase 28 -- Dark Pattern Detection
	PrefixSpan sequential mining, 6 pre-defined high-risk sequences

	### Phase 29 -- UX Overhaul and i18n
	Evidence panel (4 tabs), D3 graph redesign, 22-language UI, timeline view

	### Phase 30 -- Bug Fix Sprint
	Tag: v0.30.0 \| 26 bugs resolved including BUG-1 (search crash), BUG-2 (7 missing loaders)

	### Phase 31 -- Runtime Profile and Auto-Scaling
	Tag: v0.31.0 \| Hardware detector, LOW/MEDIUM/HIGH profiles, GET /runtime endpoint
	Branch: `feature/phase-31-runtime-profile`
	Files: config/runtime_profile.py, config/model_selector.py, api/routes/runtime.py
	Tests: 15 unit tests in tests/test_runtime_profile.py
	Profile assignment: cpu2 + ram2 + gpu*2 + disk + docker + db_local (max 9)

	---

	## PLANNED PHASES

	---

	### Phase 32 -- Entity Resolution v2: Canonical Identity Engine
	Branch: `feature/phase-32-entity-resolution`
	Priority: CRITICAL -- fixes broken evidence chains across all phases

	Problem: Jaccard token similarity misses transliteration variants, honorific
	variations ("Sh. Ram Kumar" vs "Shri Ramkumar"), and cross-script name forms.
	The same person stored under 3+ IDs = broken evidence chains.

	Algorithms:
	- Jaro-Winkler (weight 0.30) -- character-level typo and transliteration
	- Jaccard token overlap (weight 0.20) -- word-order variations
	- Sentence-transformers cosine (weight 0.35) -- multilingual name variants
	- Exact PAN/CIN/GSTIN match (weight 1.0, overrides all) -- deterministic keys

	New files:
	- `processing/entity_resolver_v2.py` -- CanonicalIdentityEngine class
	- `processing/canonical_id.py` -- stable SHA-256 ID generation functions
	- `processing/alias_graph.py` -- AliasGraph: alias_name -> canonical_id lookup

	Indian name normalisation added:
	- Remove honorifics: Sh., Smt., Dr., Late, Sri, Shri, Er., Adv., Col.
	- Normalise suffixes: Private Limited -> Pvt Ltd, LLP, Ltd
	- Script-aware: Devanagari -> Latin transliteration for comparison

	Integration: pipeline.py resolve_dataset() upgraded to use v2 engine

	---

	### Phase 33 -- Custom Graph Engine: Eliminate Neo4j 50K Limit
	Branch: `feature/phase-33-custom-graph-engine`
	Priority: HIGH -- AuraDB free tier caps at 50K nodes / 175K relationships

	Architecture:
	```
	graph_engine/
	+-- store.py -- LevelDB key-value backing store
	+-- hnsw.py -- HNSW vector index (M=16, ef=200)
	+-- query_planner.py -- Cypher-to-native query translator
	+-- temporal.py -- Time-weighted edge decay by relationship type
	+-- version_control.py -- Git-style diff log for graph mutations
	+-- compat_layer.py -- Translates all existing Cypher to native calls
	```

	Temporal edge decay lambdas:
	- court_order: 0.00005 (slowest -- court records are permanent)
	- cag_audit: 0.0002
	- government_portal: 0.0005
	- director_of: 0.0003
	- member_of: 0.0005
	- news_article: 0.001
	- social_media: 0.01 (fastest decay)

	Version control: Every graph mutation is recorded as a diff with before/after
	hashes. Detects when government portals silently modify records post-publication.
	Anti-forensics pattern: commit A -> commit B (change) -> commit C (reverts to A) = flag

	---

	### Phase 34 -- Vector Search and Hybrid Retrieval
	Branch: `feature/phase-34-vector-search`

	Problem: Keyword search misses semantically similar documents. Searching
	"Maharashtra road contract irregularity" does not find CAG reports about
	"highway construction irregularity in Pune" even though they are the same topic.

	Algorithms:
	- FAISS (cpu) or Qdrant for vector index
	- BM25 for keyword ranking
	- Reciprocal Rank Fusion (k=60): RRF = sum(1 / (60 + rank))
	- Query classifier routes to appropriate retrieval strategy

	Query routing:
	\| Query type \| Keywords \| Retrieval mix \|
	\|-----------\|---------\|--------------\|
	\| factual \| who is, what is, when did \| BM25 70% + vector 30% \|
	\| relational \| connected to, path from \| Graph 80% + vector 20% \|
	\| temporal \| before, after, election, contract date \| Graph 60% + BM25 40% \|
	\| exploratory \| similar to, pattern, cluster \| Vector 60% + community 40% \|

	Embedding model: paraphrase-multilingual-MiniLM-L12-v2 (covers all 22 languages)

	---

	### Phase 35 -- Plugin System and YAML Enrichers
	Branch: `feature/phase-35-plugins`

	Lazy-loading plugin architecture -- new data sources added by dropping
	a YAML file in `enrichers/` with no code changes.

	Plugin registry also covers algorithms -- new detection algorithms
	registered as plugins, enabling Phase 57 A/B testing.

	---

	### Phase 36 -- Sigma-Style YAML Rule Engine
	Branch: `feature/phase-36-rule-engine`

	Problem: Adding a new detection rule requires writing Python + Cypher.
	Non-developer investigators cannot contribute detection logic.

	YAML -> Cypher compiler -- a rule file specifies conditions, thresholds,
	and actions. The engine compiles it to Cypher at startup.

	10 built-in rules shipped:
	1. `cartel_rotation.yaml` -- same vendor group rotates wins
	2. `electoral_bond_proximity.yaml` -- bond + contract within 12 months (CRITICAL)
	3. `family_directorship_web.yaml` -- politician's family = company director
	4. `audit_contract_overlap.yaml` -- continued contracts after CAG audit flag
	5. `shell_company_age_contract.yaml` -- company < 6 months old + large contract
	6. `single_bidder_high_value.yaml` -- single bid above district average
	7. `circular_ownership_3node.yaml` -- 3-node corporate ownership cycle
	8. `revolving_door_365day.yaml` -- government to private within 1 year
	9. `address_cluster_directors.yaml` -- 3+ companies same registered address
	10. `pre_election_contract_surge.yaml` -- contract spend spike 90 days before poll

	---

	### Phase 37 -- Job Queue and Worker Pool
	Branch: `feature/phase-37-job-queue`

	Redis-backed job queue with state machine: INIT -> QUEUED -> RUNNING -> DONE

	Algorithm job priorities:
	- Priority 1 (immediate): entity_resolution, neurosymbolic_risk, rule_engine
	- Priority 2 (30s): gnn_tbml, election_burst, shap_explanation, graphrag_summary
	- Priority 3 (5min): corruption_dna, metapath_walk, community_detection, topic_modeling
	- Priority 4 (off-peak): fingerprint_index, gcpal_pretraining, wayback_drift

	---

	### Phase 38 -- DeepSeek-R1 Chain-of-Thought Reasoning
	Branch: `feature/phase-38-deepseek-r1`

	Problem: Current synthesis logic (3+ investigators agreeing = HIGH) is a
	vote count, not reasoning. No audit trail of how a conclusion was reached.

	DeepSeek-R1 integration:
	- Receives: graph findings + SHAP explanations + TruthChain evidence IDs
	- Generates: step-by-step reasoning chain citing specific evidence node IDs
	- Produces: 2 competing hypotheses with scores, then a final verdict
	- Verdict levels: CONFIRMED (>=80), PROBABLE (>=50), WEAK (>=20), INSUFFICIENT

	Anti-hallucination enforcement:
	- Every R1 claim must cite a TruthChain node_id (format: [EVIDENCE-XXXX])
	- Post-generation validation: regex check for invented node IDs
	- Invalid citations are stripped before the report is returned

	Fallback: When DeepSeek API is unavailable, the existing multi-investigator
	synthesis provides the output. R1 augments -- it does not replace.

	---

	### Phase 38B -- GraphRAG: Graph-Guided LLM Retrieval (NEW)
	Branch: `feature/phase-38b-graphrag`

	Problem: R1 cannot answer global questions like "What are the main corruption
	themes across all 5,000 CAG audit reports?" Standard RAG retrieves isolated chunks.

	GraphRAG approach:
	1. Run Leiden clustering over all scraped documents and graph nodes
	2. For each community > 3 nodes, R1 generates a community summary
	3. At query time: embed query -> retrieve top-k community summaries by cosine
	4. Feed summaries + relevant subgraph as structured context to R1

	New files:
	- `ai/graphrag/community_indexer.py` -- builds community summaries offline
	- `ai/graphrag/graphrag_retriever.py` -- query-time retrieval

	Integration with Phase 38: R1 receives GraphRAG community summaries instead
	of raw graph fragments -- dramatically reduces hallucination.

	---

	### Phase 39 -- DeepSeek-VL2 Visual Evidence Analysis
	Branch: `feature/phase-39-deepseek-vl2`

	Analyse scanned affidavit PDFs, audit report images, and newspaper clippings.
	Signature mismatch detection. Document image authenticity via Shannon entropy.
	OCR pipeline for non-digital government documents.

	---

	### Phase 40 -- DeepSeek-V3 Multilingual Dossier Generation
	Branch: `feature/phase-40-deepseek-v3`

	Generate full investigation reports in all 22 Indian languages.
	CONFIRMED/PROBABLE/WEAK/INSUFFICIENT grading on every finding.
	Length: 800-1200 words per report. Export to PDF with trilingual header.

	---

	### Phase 41 -- Legal Intelligence Pipeline
	Branch: `feature/phase-41-legal`

	IPC Section Classifier:
	- Algorithm: TF-IDF + OneVsRestClassifier(LogisticRegression) -- multi-label
	- 8 corruption-relevant IPC sections: 420, 409, 13, 7, 120B, 467, 468, 471
	- Keyword fallback when model not trained

	Crime triple extractor:
	- Pattern: Subject -> Action -> Object from legal text
	- Store as directed evidence edges: (Company)-[:BRIBED]->(Official)

	Semantic Role Labelling (SRL):
	- ARG0 (agent) -> entity who acted
	- ARG2 (recipient) -> entity who benefited
	- V (predicate) -> action type: BRIBED, APPROVED, AWARDED

	BK-tree for out-of-vocabulary legal term repair.

	---

	### Phase 42 -- Forensic Content Intelligence
	Branch: `feature/phase-42-forensic-content`

	Shannon entropy classifier:
	\| Document type \| Expected range \|
	\|--------------\|----------------\|
	\| government_order \| 3.8 -- 5.2 bits \|
	\| cag_report \| 4.0 -- 5.4 bits \|
	\| tender_document \| 3.5 -- 5.0 bits \|
	\| court_order \| 3.9 -- 5.3 bits \|

	Documents outside expected range flagged as SUSPICIOUS or LIKELY_FABRICATED.

	Perceptual hash (pHash) for image-based document copy detection.
	PAN/CIN/Aadhaar regex extraction from document text.
	Lexical diversity score -- repetitive templates have diversity < 0.3.

	---

	### Phase 43 -- Pivot Recommendation Engine
	Branch: `feature/phase-43-pivot`

	Problem: After finding a suspicious entity, the next best investigation
	target is unclear. The pivot engine scores all connected entities.

	6-factor scoring:
	\| Factor \| Weight \| Description \|
	\|--------\|--------\|-------------\|
	\| pagerank \| 0.20 \| How central is this entity? \|
	\| evidence_gap \| 0.25 \| How much do we NOT know? \|
	\| risk_signals \| 0.20 \| log(risk_signals + 1) \|
	\| connection_strength \| 0.15 \| Edge weight to current entity \|
	\| temporal_recency \| 0.10 \| Recently active? \|
	\| unexplored_depth \| 0.10 \| Unexplored 2-hop nodes \|

	Route: `GET /pivot/{entity_id}?already_investigated=id1,id2`

	---

	### Phase 44 -- Geospatial Verification via Satellite
	Branch: `feature/phase-44-satellite`

	Sentinel-2 L2A time series for project verification.
	NDVI change detection for forest diversion claims.
	NDBI (built-up index) for construction completion verification.
	SAR (Sentinel-1) for flood infrastructure claims.
	Compare contract completion claims vs satellite-observable progress.

	---

	### Phase 45 -- W3C PROV-DM Provenance and TruthChain
	Branch: `feature/phase-45-provenance`

	TruthChain algorithm:
	- Each evidence node has: SHA-256 ID, source_type, content_hash, timestamp, status
	- Merkle tree over all evidence: root_hash changes if ANY evidence changes
	- Temporal decay: weight(E,t) = base_weight * exp(-lambda_type * days)
	- Status propagation: MODIFIED evidence propagates DEPENDS_ON_MODIFIED to descendants
	- Aggregate confidence = active_weight / total_weight

	Decay rates by source:
	- court_order: 0.0001 (permanent)
	- cag_audit: 0.0002
	- government_portal: 0.0005
	- news_article: 0.001
	- social_media: 0.01

	Export: JSON-LD using W3C PROV-DM ontology + Schema.org
	Blockchain anchor: Merkle root stored in audit_chain.py (Bitcoin via OpenTimestamps)

	---

	### Phase 46 -- Source Drift and Historical Record Analysis
	Branch: `feature/phase-46-source-drift`

	Wayback CDX API to detect when government records are silently modified.
	7 fault types (ISWC 2024 taxonomy):
	- node_disappearance: entity removed from portal
	- edge_rewiring: director change silently backdated
	- attribute_drift: contract amount modified post-publication
	- cluster_split: formerly linked entities disconnected
	- cluster_merge: separate networks joined
	- temporal_burst: sudden new relationship creation
	- isolation: previously connected entity becomes isolated

	Anti-forensics detection: commit A -> commit B (change) -> commit C (reverts) = SUPPRESS_ATTEMPT

	---

	### Phase 47 -- Predictive Risk Trajectory
	Branch: `feature/phase-47-predictive`

	ARIMA(2,1,1) risk prediction:
	- Fits on monthly risk score history (min 12 data points)
	- Forecasts 6 months ahead with 80% confidence intervals
	- Alert when predicted score crosses HIGH threshold

	GCPAL contrastive pre-training for label scarcity:
	- India's 1:707 confirmed-corruption ratio makes traditional supervised ML difficult
	- GCPAL mines supervised signals from the unlabelled relationship graph
	- Three augmented views: node feature dropout + edge dropout + KNN view
	- NT-Xent contrastive loss (temperature = 0.07)
	- Fine-tunes on confirmed cases from case_memory (min 5 needed)

	---

	### Phase 48 -- Watchlist, Alerts, and ARIMA Prediction
	Branch: `feature/phase-48-watchlist`

	WebSocket push alerts when risk score changes for watched entities.
	YAML alert rules (same format as Phase 36).
	Webhook support for journalist notification systems.

	---

	### Phase 49 -- Observability and Reliability
	Branch: `feature/phase-49-observability`

	Prometheus /metrics endpoint.
	Stale-data alerts when pipeline has not run in >7 days.
	Ingestion validator checks all 20 node types have recent data.
	/health upgraded to return per-source freshness status.

	---

	### Phase 50 -- Security v2: RBAC and JWT
	Branch: `feature/phase-50-security-v2`

	Role-based access control: Lead Investigator, Contributor, Reviewer, Observer.
	JWT authentication with refresh tokens.
	DPDP Act compliance (India Data Protection).
	Entity-level access control for sensitive investigations.

	---

	### Phase 51 -- Electoral Bond Causal Graph Engine
	Branch: `feature/phase-51-electoral-bond-causal`

	Critical missing feature. The data exists but the causal chain is not mapped.

	Full graph path:
	Corporate donor -> ElectoralBond -> Party -> Ministry -> Policy -> Contract -> Company

	Algorithm: Granger causality (from Phase 25) + Difference-in-Differences
	to establish whether policy changes statistically follow bond purchases.

	New node type: PolicyChange (date, ministry, beneficiaries)
	New relationship: FOLLOWED_BOND (lag_days, p_value, granger_f_stat)

	New route: `GET /electoral-bond/causal/{company_id}`

	---

	### Phase 52 -- Parliament Performance Analytics
	Branch: `feature/phase-52-parliament`

	New data sources: Lok Sabha division votes (loksabha.nic.in/Loksabha/Divisions),
	Rajya Sabha Q&A archive, Praja.org legislator data.

	MP accountability score:
	- Attendance rate (0.30 weight)
	- Questions asked per session (0.25 weight)
	- Vote consistency with party line vs independent votes (0.20 weight)
	- Bills sponsored (0.15 weight)
	- Starred questions with substantive follow-up (0.10 weight)

	New route: `GET /parliament/performance/{politician_id}`
	New node type: DivisionVote, ParliamentSession
	New relationship: VOTED_IN, ASKED_STARRED_QUESTION

	---

	### Phase 53 -- Media Ownership Graph
	Branch: `feature/phase-53-media-ownership`

	New data sources: MIB media license registry, TRAI spectrum allocations.

	Graph paths:
	- Channel -> Corporate parent -> Promoter -> Political donor
	- Channel -> Editorial stance correlation (NLP) -> Political entity

	Editorial bias detection: NLP sentiment analysis comparing coverage of
	political entities across channels with known ownership structures.

	New node types: MediaChannel, SpectrumLicense, EditorialEntity
	New route: `GET /media/ownership/{channel_id}`

	---

	### Phase 54 -- Constituency Development Index
	Branch: `feature/phase-54-constituency`

	Data sources: NDAP district SDG scores, MGNREGS employment data,
	PM Kisan disbursements, PM Awas completions, Swachh Bharat ODF data.

	Algorithm: Regression analysis -- does the constituency improve during
	the politician's tenure vs comparison period?

	Pre-election spending surge detection: CUSUM on district spending in
	90 days before election vs annual baseline.

	New route: `GET /constituency/{id}/development`
	Satellite verification: Sentinel-2 images corroborate claimed completions.

	---

	### Phase 55 -- Family Dynasty and Nepotism Graph
	Branch: `feature/phase-55-dynasty`

	Data source: FAMILY_OF edges extracted from MyNeta affidavit declarations
	("Spouse: X", "Dependent 1: Y"). Already partially available in existing data.

	Dynasty depth score:
	- Count of family members in government positions
	- Count of family-controlled companies with government contracts
	- Count of elections won by family members across generations
	- Geographic concentration (same constituency or district)

	New relationship: FAMILY_OF (role: spouse/child/sibling/parent)
	New route: `GET /dynasty/{politician_id}`

	---

	### Phase 56 -- RTI Intelligence Engine
	Branch: `feature/phase-56-rti`

	RTI auto-filer: System detects evidence gaps in any investigation and
	drafts the exact RTI application to fill them.

	Gap detection algorithm:
	- For each HIGH-risk finding: check if primary source data is available
	- If data missing: identify the correct Public Information Officer
	- Generate RTI draft citing the specific provisions (RTI Act 2005, Sections 6-8)

	RTI outcome tracker: Index filed RTI applications from RTI Online portal.
	Map outcomes to graph: PIOs who deny information for high-risk entities = flag.

	New route: `GET /rti/draft/{entity_id}` (generates RTI text)
	New node type: RTIApplication, PublicInformationOfficer

	---

	### Phase 57 -- A/B Algorithm Testing Framework (NEW)
	Branch: `feature/phase-57-ab-testing`

	Multi-armed bandit (Thompson Sampling) for algorithm selection:
	- Each algorithm arm has Beta(alpha, beta) prior over performance
	- alpha = times algorithm was "preferred" by human review
	- beta = times algorithm was "not preferred"
	- Select arm with highest sampled value at each request

	Use case: When upgrading from static risk scorer -> ML ensemble ->
	NeuroSymbolic, verify the new algorithm actually improves outcomes.

	New route: `GET /admin/algorithm-performance`

	---

	### Phase 58 -- Real-Time Stream Processing (NEW)
	Branch: `feature/phase-58-streaming`

	Problem: Pipeline runs in batches. Breaking leads appear hours late.

	Redis Streams (Kafka fallback) for real-time event ingestion.
	CUSUM online anomaly detection on the stream (no batch needed).
	Sliding window aggregation for real-time indicator updates.

	Events processed in real-time:
	- new_contract: immediate CUSUM check on contract value
	- new_audit_report: check if any tracked entities are mentioned
	- new_enforcement_action: update risk scores for named entities
	- source_modification: detect when a scraped page changes

	---

	### Phase 59 -- CorruptionDNA Fingerprint (NEW)
	Branch: `feature/phase-59-corruption-dna`

	Problem: Two entities in the same corruption network may have no direct
	graph edge -- different states, different directors, but identical patterns.

	512-dim fingerprint = concat(:
	- Node2Vec structural embedding (128d)
	- TF-IDF document vector (128d)
	- Benford's Law digit distribution (9d, padded to 16d)
	- Temporal burst vector (64d)
	- Linguistic fingerprint -- Burrows Delta (64d)
	- Entity type one-hot (16d)
	- Risk indicator vector (16d)
	- CAG audit TF-IDF (64d)
	- Institutional path vector (32d)

	MinHash LSH for efficient similarity search (cosine > 0.82 = same network).
	New route: `GET /dna/{entity_id}` and `GET /dna/similar/{entity_id}`

	---

	### Phase 60 -- ElectionProximityBurst Detector (NEW)
	Branch: `feature/phase-60-election-burst`

	**The only corruption detection algorithm that encodes the Indian electoral
	calendar as a statistical regression variable.**

	Algorithm:
	1. Load full Indian electoral calendar (Lok Sabha + 28 state assemblies)
	2. ARIMA(2,1,1) on monthly metric aggregates
	3. PELT changepoint detection on ARIMA residuals
	4. Match changepoints to election proximity (within 180 days)
	5. CUSUM control chart with k=0.5, h=5.0
	6. Granger causality: does election_proximity_days Granger-cause the metric?

	Output: burst_score (0-100), election_burst_flags, cusum_alerts,
	Granger p-value, interpretation in plain language.

	Integrated as 16th investigator (temporal, weight 0.10)

	---

	### Phase 61 -- BennamiGNN: Heterogeneous Graph Neural Network (NEW)
	Branch: `feature/phase-61-benami-gnn`

	Problem: 5-factor heuristic misses multi-hop benami: politician's cousin
	is director (not the politician), company has legitimate small contracts before
	being used for a large fraudulent one.

	H-GNN architecture:
	- 8 relation types: DIRECTOR_OF, WON_CONTRACT, SHARES_ADDRESS, RELATED_TO,
	AWARDED_BY, FAMILY_MEMBER_OF, APPEARS_IN_AUDIT, SANCTIONED_BY
	- Layer 0: Per-type linear projection to d=64
	- Layer 1: Relation-aware message passing
	- Layer 2: Entity-type attention
	- Layer 3: Classification head -> benami_score in [0,1]

	Fallback: Always falls back to existing 5-factor heuristic when:
	- PyTorch not installed
	- Subgraph has < 5 nodes
	- Model not trained yet

	Training: Fine-tunes on confirmed benami cases from case_memory.

	---

	### Phase 62 -- CartelDNA Sequential Mining (NEW)
	Branch: `feature/phase-62-cartel-dna`

	Problem: Current cartel detector checks single-tender award rotation.
	Temporal cartels rotate wins across months and across ministries to avoid
	statistical detection within any one ministry.

	CartelDNA = PrefixSpan + HITS + DBSCAN:
	1. PrefixSpan on bid event sequences (company, category, month, rank)
	2. Detect alternating rank order patterns (length 2-6, min support 3)
	3. HITS on co-bidding network: authority = real winners, hub = fake competitors
	4. DBSCAN geographic clustering (epsilon = 50km, min_samples = 3)
	5. Cartel confidence = 0.35pattern + 0.25alternation + 0.20geo + 0.20HITS

	New route: `GET /cartel/dna/{entity_id}`

	---

	### Phase 63 -- SHAP and LIME Explainability Layer (NEW)
	Branch: `feature/phase-63-explainability`

	Problem: Every risk score has no explanation. Journalists cannot publish
	"score: 67" without "why: politician_overlap drove +24 points."

	SHAP TreeExplainer on the ML ensemble from Phase 19 upgrade:
	- Feature contributions for each of the 5 indicators
	- Counterfactual: "If contract_concentration were 0, score would be 43"
	- Baseline score (expected value)

	LIME locally linear approximation for non-tree models.

	New fields added to all risk responses:
	- shap_top_drivers: [{feature, shap_value, direction}]
	- shap_counterfactual: plain-language minimum change to flip risk level
	- shap_baseline: expected value before any features

	New route: `GET /risk/explain/{entity_id}`

	---

	### Phase 64 -- Cross-Language Entity Disambiguation (NEW)
	Branch: `feature/phase-64-cross-lingual`

	Problem: "Modi" / "modi" / "modii" appear in 22 scripts -- potentially
	stored as separate graph nodes. Cross-lingual entity linker maps all variants
	to a single canonical node using Wikidata Q-numbers.

	XLM-RoBERTa zero-shot entity linking.
	Wikidata SPARQL for canonical Q-number lookup (existing scraper extended).
	Transliteration confidence score per script pair.

	---

	### Phase 65 -- Knowledge Graph Completion (Missing Link Prediction) (NEW)
	Branch: `feature/phase-65-kg-completion`

	TransE link prediction: h + r = t in d-dimensional space.
	Missing edge score: \|\|h + r - t\|\| (lower = more probable edge).

	Use cases:
	- (Politician, DIRECTOR_OF, ?) -- suggest companies likely controlled
	- (?, RELATED_TO, KnownShellCompany) -- find hidden associates
	- (Company, WON_CONTRACT, ?) -- predict future contract awards

	Output: List of probable missing edges with confidence scores,
	presented as "Suggested next investigation targets."

	---

	### Phase 66 -- LAS-GNN Temporal TBML Detection (NEW)
	Branch: `feature/phase-66-las-gnn`

	Problem: Current TBML detector uses threshold rules. Temporal money
	laundering (pre-election scatter-gather, below-threshold smurfing) is
	invisible to structural analysis.

	LAS-GNN: LSTM aggregator on directed transaction graphs.
	Learns sequential order of edges imposed by timestamps.
	Detects motifs: scatter-gather, fan-in/fan-out, layering, pre-election burst.

	Indian-specific motifs:
	- Pre-election scatter: funds split to many accounts < 6 months before election
	- Post-contract layering: payment -> N shell companies -> reconsolidated
	- Smurfing below threshold: many transactions < Rs 2 lakh (PMLA threshold)
	- Circular director rotation: A appoints X -> X at B -> B pays A

	---

	### Phase 67 -- NeuroSymbolic Risk Reasoning (NEW)
	Branch: `feature/phase-67-neurosymbolic`

	Fuses three reasoning modes into one coherent system:

	Stage 1 -- DEDUCTIVE (Phase 36 YAML rules):
	- Rules fire with certainty = 1.0 (logical certainty)
	- CRITICAL rule match -> score forced >= 75

	Stage 2 -- INDUCTIVE (Phase 19 ML ensemble + SHAP):
	- GNN/ML soft score in [0,1]
	- SHAP feature contributions

	Stage 3 -- ABDUCTIVE (Phase 38 DeepSeek-R1):
	- Chain-of-thought synthesis citing TruthChain evidence IDs
	- 2 competing hypotheses with scores

	Stage 4 -- Integration:
	- final_score = 0.40rule_certainty + 0.35gnn_score + 0.25*r1_confidence
	- Adversarial override: if adversarial engine finds contradicting evidence -> cap at PROBABLE

	---

	### Phase 68 -- InstitutionMetapath2Vec Embeddings (NEW)
	Branch: `feature/phase-68-metapath`

	5 Indian-specific metapaths for structured random walks:
	1. politician_enrichment: Politician-DIRECTOR_OF-Company-WON_CONTRACT-Contract
	2. circular_enrichment: Politician-MEMBER_OF-Party-CONTROLS-Ministry-...-DIRECTOR_OF-Politician
	3. audit_flag_circular: Company-WON_CONTRACT-Contract-MENTIONED_IN-AuditReport-AUDITS-Ministry
	4. shell_address_cluster: Director-DIRECTOR_OF-Company-SHARES_ADDRESS-Company
	5. constituency_benefit: Politician-REPRESENTS-Constituency-LOCATED_IN-District-HAS_PROJECT-Contract

	128-dim entity embeddings trained via Word2Vec skip-gram on guided walks.
	find_similar_by_metapath() finds entities with the same institutional role
	across different states -- invisible to structural graph analysis.

	---

	### Phase 69 -- Geospatial Risk Clustering (NEW)
	Branch: `feature/phase-69-geospatial`

	Moran's I spatial autocorrelation on district-level risk scores.
	I > 0 = spatial corruption hotspots cluster together.

	LISA (Local Indicators of Spatial Association):
	- High-High cluster: high-risk district surrounded by high-risk districts
	- Low-High outlier: low-risk district in high-risk region (potential evasion)
	- High-Low outlier: targeted corruption in otherwise clean district

	Output: District-level choropleth with cluster classification.
	New route: `GET /geospatial/risk-clusters`

	---

	### Phase 70 -- Dynamic Knowledge Graph Anomaly Detection (NEW)
	Branch: `feature/phase-70-dynamic-kg`

	Continuously monitors graph for unexpected structural changes.
	7 fault types (ISWC 2024): node_disappearance, edge_rewiring,
	attribute_drift, cluster_split, cluster_merge, temporal_burst, isolation.

	Contextual anomaly detection: entity that was HIGH-risk 3 months ago
	is now suddenly LOW-risk = possible evidence suppression.

	---

	### Phase 71 -- GCPAL Contrastive Pre-Training (NEW)
	Branch: `feature/phase-71-gcpal`

	Label scarcity problem: India has very few confirmed corruption cases
	relative to the total number of entities (estimated 1:707 ratio).
	Standard supervised ML cannot train on this imbalance.

	GCPAL solution: NT-Xent contrastive loss on 3 augmented views:
	- View 1: node feature dropout (20%)
	- View 2: edge dropout (20%)
	- View 3: KNN implicit interactions (k=5)

	Pre-trains on unlabelled graph. Fine-tunes on case_memory confirmed cases.

	---

	### Phase 72 -- Automated Source Credibility Scoring (NEW)
	Branch: `feature/phase-72-source-credibility`

	Bayesian credibility model per source:
	- institutional_authority: government > NGO > news > social
	- historical_accuracy: confirmed vs denied past claims
	- methodology_transparency: does source explain collection method?
	- timeliness: freshness decay
	- cross_source_corroboration: independent corroboration count

	Bayesian update after each confirmed/denied case.

	---

	### Phase 73 -- Investigative RAG Over Case Memory (NEW)
	Branch: `feature/phase-73-rag-cases`

	RAG over all past investigation reports in case_memory.
	Query: "Past investigations involving electoral bonds and road contracts"
	-> Dense retrieval -> Top-k case summaries as context
	-> DeepSeek-R1 synthesizes commonalities and suggests strategy.

	---

	### Phase 74 -- Continuous Model Drift Detection (NEW)
	Branch: `feature/phase-74-drift`

	Population Stability Index (PSI):
	- PSI < 0.10: stable
	- PSI 0.10-0.25: monitor closely
	- PSI > 0.25: retrain required

	ADWIN (Adaptive Windowing): streaming concept drift detection.
	Auto-triggers GCPAL retraining job when drift detected.

	---

	### Phase 75 -- Ethics and Bias Audit System (NEW)
	Branch: `feature/phase-75-ethics`

	Fairness metrics:
	- Demographic parity: P(HIGH_RISK \| party=A) approx= P(HIGH_RISK \| party=B)
	- Equal opportunity: TPR equal across entity types
	- Predictive parity: PPV equal across geographic regions

	Bias detection: chi-squared test, disparate impact ratio, SHAP fairness.
	Mitigation: Reweighing, adversarial debiasing, calibration.

	New route: `GET /admin/bias-audit`

	---

	## BRANCH WORKFLOW

	```bash
	# Before each new phase:
	git checkout main && git pull origin main
	git checkout -b feature/phase-N-name

	# After all commits:
	git push origin feature/phase-N-name
	# Open PR on GitHub -> merge -> pull main -> tag

	# Tag every completed phase:
	git tag -a vN.0.0 -m "Phase N: description"
	git push origin vN.0.0

	# Deploy to HuggingFace after every merge:
	git push hf main --force

	# Reseed after every deploy:
	curl -X POST https://abinazebinoly-bharatgraph.hf.space/admin/seed
	```

	---

	## VERSION HISTORY

	\| Version \| Phase \| Key addition \|
	\|---------\|-------\|--------------\|
	\| v0.30.0 \| 30 \| Bug fix sprint -- 26 bugs resolved \|
	\| v0.31.0 \| 31 \| Runtime profile auto-scaling \|
	\| v0.32.0 \| 32 \| Entity resolution v2 (planned) \|
	\| v0.33.0 \| 33 \| Custom graph engine (planned) \|
	\| v0.40.0 \| 40 \| DeepSeek-V3 multilingual reports (planned) \|
	\| v0.50.0 \| 50 \| Security v2 RBAC (planned) \|
	\| v1.0.0 \| 75 \| Full production launch (planned) \|

	---

	## Developed by Abinaze Binoy