File size: 11,792 Bytes
26fc2f2
3ba3633
26fc2f2
 
 
 
 
 
1814306
26fc2f2
3ba3633
26fc2f2
 
3ba3633
26fc2f2
5f909d5
3ba3633
 
26fc2f2
3ba3633
 
 
 
 
47e0648
 
 
 
 
 
 
 
 
3ba3633
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47e0648
 
 
 
 
 
 
 
 
3ba3633
 
 
 
 
 
 
 
 
 
47e0648
3ba3633
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47e0648
 
3ba3633
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47e0648
 
3ba3633
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
---
title: UIDAI Project Sentinel
emoji: πŸš€
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
- streamlit
pinned: false
short_description: Data-Driven Innovation for Aadhaar
---

# πŸ›‘οΈ Project Sentinel: AI-Powered Fraud Detection for UIDAI

[![Streamlit App](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://huggingface.co/spaces/lovnishverma/UIDAI)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

> **Context-Aware Anomaly Detection System for Aadhaar Enrolment Centers**  
> Team ID: UIDAI_4571 | Theme: Data-Driven Innovation for Aadhaar

---

## 🎯 Quick Links

- **πŸ“Š Live Notebook**: [Open in Google Colab](https://colab.research.google.com/drive/1YAQ4nfxltvG_cts3fmGc_zi2JQc4oPOT?usp=sharing)
- **πŸš€ Dashboard Demo**: [Hugging Face Spaces](https://huggingface.co/spaces/lovnishverma/UIDAI)
- **πŸ“– Documentation**: See `/docs` folder
- **πŸ’» Source Code**: Available in this repository

---

## 🎯 Overview

Project Sentinel is an innovative fraud detection system designed specifically for UIDAI Aadhaar enrolment centers. Unlike traditional global threshold-based systems, Sentinel uses **context-aware machine learning** with district-level normalization to identify fraudulent patterns while accounting for India's demographic diversity.

### The Problem We Solve

India's demographic diversity creates a unique challenge:
- πŸ“Š Activities normal in Mumbai may be suspicious in tribal villages (and vice versa)
- βš–οΈ Global thresholds either miss frauds or create false positives
- 🎯 Need: Regional baselines that adapt to local patterns

### Our Innovation

**District Normalization**: Each enrolment center is compared to its local district baseline, not a national average.

**Example**: In a tribal district with 40% adult enrolment average, a center with 90% adult ratio gets flagged for deviationβ€”even if absolute numbers are lower than urban centers.

---

## ✨ Key Features

### πŸ€– Machine Learning Engine
- **Algorithm**: Isolation Forest (Unsupervised Learning)
- **Core Innovation**: Context-aware features with district baselines
- **Detection**: Ghost IDs, weekend fraud, data manipulation, coordinated operations

### πŸ“Š Interactive Dashboard
- **Real-time KPIs**: 6 comprehensive metrics with trend indicators
- **Geographic Heatmap**: Risk visualization across India
- **Pattern Analysis**: Scatter plots, histograms, time series
- **Advanced Analytics**: Feature importance, correlation matrix, performance gauges

### πŸ” Smart Filtering
- Date range selection for temporal analysis
- Multi-select risk categories (Low/Medium/High/Critical)
- Dynamic state β†’ district cascading
- Weekend-only anomaly toggle

### πŸ“₯ Multiple Export Formats
- **CSV**: Field team verification lists
- **JSON**: API integration
- **TXT**: Investigation reports for management

---

## πŸš€ Quick Start

### **Option 1: Google Colab (Fastest)**
Run the complete analysis in your browser without any setup:

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1YAQ4nfxltvG_cts3fmGc_zi2JQc4oPOT?usp=sharing)

Click the badge above to open the notebook and run all cells to generate the analyzed data.

### **Option 2: Local Setup**

### Prerequisites
```bash
Python 3.8+
pip (Python package manager)
```

### Installation

1. **Clone the repository**
```bash
git clone https://huggingface.co/spaces/lovnishverma/UIDAI
cd UIDAI
```

2. **Install dependencies**
```bash
pip install -r requirements.txt
```

3. **Run the Jupyter Notebook** (Data Processing)
```bash
jupyter notebook project_sentinel_notebook.ipynb
```
This generates `analyzed_aadhaar_data.csv`

4. **Launch the Dashboard**
```bash
streamlit run sentinel_dashboard_enhanced.py
```

5. **Access the application**
```
http://localhost:8501
```

---

## πŸ“ Project Structure

```
UIDAI/
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ requirements.txt                   # Python dependencies
β”œβ”€β”€ Dockerfile                         # Docker configuration
β”œβ”€β”€ project_sentinel_notebook.ipynb    # ML model & data processing
β”œβ”€β”€ app.py                             # Streamlit dashboard
β”œβ”€β”€ analyzed_aadhaar_data.csv          # Processed data (generated from colab)
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ Project_Sentinel_Analysis.docx
β”‚   β”œβ”€β”€ Sentinel_Dashboard_Documentation.docx
β”‚   └── Dashboard_Enhancements_Guide.docx
└── assets/
    └── screenshots/                   # Dashboard screenshots
```

---

## 🧠 Technical Architecture

### Data Pipeline
```
Raw Data (Biometric + Demographic + Enrolment)
    ↓
SmartLoader (Chunked CSV ingestion)
    ↓
Master Merge (Outer joins on date/state/district/pincode)
    ↓
ContextEngine (District normalization)
    ↓
Feature Engineering (4 context-aware features)
    ↓
Isolation Forest (Anomaly detection)
    ↓
Risk Scoring (0-100 scale)
    ↓
Dashboard Visualization
```

### Core Features (ML Model)

| Feature | Description | Importance |
|---------|-------------|------------|
| **ratio_deviation** | Deviation from district avg adult ratio | 45% |
| **weekend_spike_score** | Activity spike on weekends/holidays | 25% |
| **mismatch_score** | Discrepancy between bio/demo updates | 20% |
| **total_activity** | Overall transaction volume | 10% |

### Technology Stack

- **Backend**: Python 3.8+, Pandas, NumPy, Scikit-learn
- **ML**: Isolation Forest (Unsupervised Anomaly Detection)
- **Frontend**: Streamlit (Web Framework)
- **Visualization**: Plotly Express, Plotly Graph Objects
- **Deployment**: Docker, Hugging Face Spaces

---

## πŸ“Š Dashboard Overview

### Tab 1: Geographic Analysis
- **Interactive Map**: Risk heatmap with circle size = volume, color = risk
- **Top 5 Hotspots**: Color-coded cards showing riskiest locations
- **Risk Distribution**: Donut chart breakdown by category

### Tab 2: Pattern Analysis
- **Ghost ID Indicator**: Scatter plot with deviation thresholds
- **Risk Histogram**: Distribution concentration analysis
- **Time Series**: Dual-axis chart showing trends over time
- **Statistics**: Mean, median, std dev, 95th percentile

### Tab 3: Priority Cases
- **Adjustable Threshold**: Slider to filter by minimum risk score
- **Action Status**: Workflow tracking (Pending/Investigation/Resolved)
- **Enhanced Table**: Progress bars, formatted columns
- **Export Options**: CSV, JSON, TXT formats

### Tab 4: Advanced Analytics
- **Feature Importance**: Bar chart showing ML contributions
- **Performance Gauge**: Speedometer-style model accuracy
- **Correlation Heatmap**: Feature relationship matrix
- **Key Insights**: Contextual intelligence cards

---

## 🎨 Visual Design

### Professional Styling
- **Gradients**: Purple/blue for government portal aesthetic
- **Animations**: Pulsing alerts for critical cases
- **Typography**: Google Fonts (Inter) for modern look
- **Color Coding**: Risk levels with emoji indicators (πŸ”΄πŸŸ πŸŸ‘πŸŸ’)

### Responsive Layout
- **Wide Mode**: Maximum data density
- **Tabbed Interface**: Organized content reduces cognitive load
- **Adaptive Visualizations**: Charts adjust to filter context

---

## πŸ”§ Configuration

### Model Parameters
```python
Config.ML_FEATURES = [
    'ratio_deviation',      # Primary fraud indicator
    'weekend_spike_score',  # Unauthorized operations
    'mismatch_score',       # Data manipulation
    'total_activity'        # Volume context
]
Config.CONTAMINATION = 0.05  # 5% expected anomaly rate
Config.RANDOM_STATE = 42     # Reproducibility
```

### Risk Thresholds
```python
RISK_CATEGORIES = {
    'Low': [0, 50],
    'Medium': [50, 70],
    'High': [70, 85],
    'Critical': [85, 100]
}
```

---

## πŸ“ˆ Use Cases

### 1. Ghost Identity Creation
**Pattern**: Abnormally high adult enrolment ratio  
**Detection**: High positive ratio_deviation  
**Example**: District avg 40%, center reports 90% β†’ FLAGGED

### 2. Weekend/Holiday Fraud
**Pattern**: Activity spikes when centers should be closed  
**Detection**: High weekend_spike_score  
**Example**: 5x normal activity on Sunday β†’ FLAGGED

### 3. Data Manipulation
**Pattern**: Discrepancies between biometric and demographic updates  
**Detection**: High mismatch_score  
**Example**: 100 demo updates, 20 bio updates β†’ FLAGGED

---

## 🚒 Deployment

### Docker Deployment
```bash
# Build image
docker build -t sentinel-dashboard .

# Run container
docker run -p 8501:8501 sentinel-dashboard
```

### Hugging Face Spaces
The app is automatically deployed when you push to the main branch.

### Environment Variables
```bash
STREAMLIT_SERVER_PORT=8501
STREAMLIT_SERVER_ADDRESS=0.0.0.0
STREAMLIT_SERVER_HEADLESS=true
```

---

## πŸ“Š Performance Metrics

### Model Performance (Simulated)
- **Precision**: 89%
- **Recall**: 85%
- **F1-Score**: 87%
- **Accuracy**: 88%

### System Performance
- **Data Points Processed**: 500K+ records
- **Processing Time**: <1 second (cached)
- **Dashboard Load Time**: ~2 seconds
- **Visualization Rendering**: <500ms per chart

---

## πŸ”’ Security Considerations

### Current Implementation
- βœ… Data caching for performance
- βœ… Input validation on filters
- βœ… Error handling for missing data
- ⚠️ Simulated coordinates (demo only)

### Production Requirements
- πŸ” SSO/OAuth authentication
- πŸ” Role-based access control (RBAC)
- πŸ” Audit logging for all actions
- πŸ” Data encryption (at rest & in transit)
- πŸ” Real geocoding with pincode master DB

---

## 🎯 Future Enhancements

### Short-term (1-3 months)
- [ ] Real geocoding integration
- [ ] SHAP values for explainability
- [ ] Feedback loop for model refinement
- [ ] PDF report generation
- [ ] Email/SMS alert system

### Long-term (3-6 months)
- [ ] Multi-level baselines (state, district, pincode)
- [ ] Network analysis for coordinated fraud
- [ ] Real-time streaming pipeline (Kafka)
- [ ] Ensemble methods (LOF + One-Class SVM)
- [ ] Mobile app for field officers

---

## πŸ‘₯ Team

**Team ID**: UIDAI_4571  
**Theme**: Data-Driven Innovation for Aadhaar  
**Competition**: UIDAI Hackathon 2026

---

## πŸ“„ Documentation

Comprehensive documentation available in `/docs`:
- **Project_Sentinel_Analysis.docx**: Technical analysis & code review
- **Sentinel_Dashboard_Documentation.docx**: Dashboard user guide
- **Dashboard_Enhancements_Guide.docx**: Enhancement details

---

## 🀝 Contributing

We welcome contributions! Please follow these steps:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

---

## πŸ“ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## πŸ™ Acknowledgments

- **UIDAI** for the hackathon opportunity and dataset
- **Anthropic** for AI assistance in development
- **Streamlit** for the amazing web framework
- **Plotly** for interactive visualizations

---

## πŸ“§ Contact

For questions or support, please contact:
- **Email**: sentinel-support@example.com
- **Issues**: [GitHub Issues](https://github.com/lovnnishverma/UIDAI/issues)
- **Discussions**: [GitHub Discussions](https://github.com/lovnishverma/UIDAI/discussions)

---

## 🌟 Star History

If you find this project useful, please consider giving it a ⭐!

---

<div align="center">
  <strong>Built with ❀️ for a safer Aadhaar ecosystem</strong>
  <br>
  <sub>Β© 2026 Project Sentinel. All rights reserved.</sub>
</div>