Spaces:

LovnishVerma
/

UIDAI

Sleeping

App Files Files Community

LovnishVerma commited on Jan 13

Commit

2c93382

verified ·

1 Parent(s): 5c06cec

Update README.md

Browse files

Files changed (1) hide show

README.md +82 -300

README.md CHANGED Viewed

@@ -17,395 +17,177 @@ short_description: Data-Driven Innovation for Aadhaar
 [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
-> **Context-Aware Anomaly Detection System for Aadhaar Enrolment Centers**
-> Team ID: UIDAI_4571 | Theme: Data-Driven Innovation for Aadhaar
 ---
 ## 🎯 Quick Links
-- **📊 Live Notebook**: [Open in Google Colab](https://colab.research.google.com/drive/1YAQ4nfxltvG_cts3fmGc_zi2JQc4oPOT?usp=sharing)
-- **🚀 Dashboard Demo**: [Hugging Face Spaces](https://huggingface.co/spaces/lovnishverma/UIDAI)
-- **📖 Documentation**: See `/docs` folder
 - **💻 Source Code**: Available in this repository
 ---
 ## 🎯 Overview
-Project S.A.T.A.R.K is an innovative fraud detection system designed specifically for UIDAI Aadhaar enrolment centers. Unlike traditional global threshold-based systems, Sentinel uses **context-aware machine learning** with district-level normalization to identify fraudulent patterns while accounting for India's demographic diversity.
-### The Problem We Solve
-India's demographic diversity creates a unique challenge:
-- 📊 Activities normal in Mumbai may be suspicious in tribal villages (and vice versa)
-- ⚖️ Global thresholds either miss frauds or create false positives
-- 🎯 Need: Regional baselines that adapt to local patterns
-### Our Innovation
-**District Normalization**: Each enrolment center is compared to its local district baseline, not a national average.
-**Example**: In a tribal district with 40% adult enrolment average, a center with 90% adult ratio gets flagged for deviation—even if absolute numbers are lower than urban centers.
 ---
 ## ✨ Key Features
-### 🤖 Machine Learning Engine
 - **Algorithm**: Isolation Forest (Unsupervised Learning)
-- **Core Innovation**: Context-aware features with district baselines
-- **Detection**: Ghost IDs, weekend fraud, data manipulation, coordinated operations
-### 📊 Interactive Dashboard
-- **Real-time KPIs**: 6 comprehensive metrics with trend indicators
-- **Geographic Heatmap**: Risk visualization across India
-- **Pattern Analysis**: Scatter plots, histograms, time series
-- **Advanced Analytics**: Feature importance, correlation matrix, performance gauges
-### 🔍 Smart Filtering
-- Date range selection for temporal analysis
-- Multi-select risk categories (Low/Medium/High/Critical)
-- Dynamic state → district cascading
-- Weekend-only anomaly toggle
-### 📥 Multiple Export Formats
-- **CSV**: Field team verification lists
-- **JSON**: API integration
-- **TXT**: Investigation reports for management
 ---
 ## 🚀 Quick Start
-### **Option 1: Google Colab (Fastest)**
-Run the complete analysis in your browser without any setup:
 [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1YAQ4nfxltvG_cts3fmGc_zi2JQc4oPOT?usp=sharing)
-Click the badge above to open the notebook and run all cells to generate the analyzed data.
-### **Option 2: Local Setup**
-### Prerequisites
-```bash
-Python 3.8+
-pip (Python package manager)
-```
-### Installation
 1. **Clone the repository**
-```bash
-git clone https://huggingface.co/spaces/lovnishverma/UIDAI
-cd UIDAI
 ```
 2. **Install dependencies**
 ```bash
 pip install -r requirements.txt
-```
-3. **Run the Jupyter Notebook** (Data Processing)
-```bash
-jupyter notebook project_notebook.ipynb
 ```
-This generates `analyzed_aadhaar_data.csv`
-4. **Launch the Dashboard**
 ```bash
 streamlit run app.py
-```
-5. **Access the application**
-```
-http://localhost:8501
 ```
----
-## 📁 Project Structure
-```
-UIDAI/
-├── README.md                          # This file
-├── requirements.txt                   # Python dependencies
-├── Dockerfile                         # Docker configuration
-├── project_notebook.ipynb    # ML model & data processing
-├── app.py                             # Streamlit dashboard
-├── analyzed_aadhaar_data.csv          # Processed data (generated from colab)
-├── docs/
-│   ├── Project_S.A.T.A.R.K_Analysis.docx
-│   ├── S.A.T.A.R.K_Dashboard_Documentation.docx
-│   └── Dashboard_Enhancements_Guide.docx
-└── assets/
-    └── screenshots/                   # Dashboard screenshots
-```
----
-## 🧠 Technical Architecture
-### Data Pipeline
-```
-Raw Data (Biometric + Demographic + Enrolment)
-    ↓
-SmartLoader (Chunked CSV ingestion)
-    ↓
-Master Merge (Outer joins on date/state/district/pincode)
-    ↓
-ContextEngine (District normalization)
-    ↓
-Feature Engineering (4 context-aware features)
-    ↓
-Isolation Forest (Anomaly detection)
-    ↓
-Risk Scoring (0-100 scale)
-    ↓
-Dashboard Visualization
-```
-### Core Features (ML Model)
-| Feature | Description | Importance |
-|---------|-------------|------------|
-| **ratio_deviation** | Deviation from district avg adult ratio | 45% |
-| **weekend_spike_score** | Activity spike on weekends/holidays | 25% |
-| **mismatch_score** | Discrepancy between bio/demo updates | 20% |
-| **total_activity** | Overall transaction volume | 10% |
-### Technology Stack
-- **Backend**: Python 3.8+, Pandas, NumPy, Scikit-learn
-- **ML**: Isolation Forest (Unsupervised Anomaly Detection)
-- **Frontend**: Streamlit (Web Framework)
-- **Visualization**: Plotly Express, Plotly Graph Objects
-- **Deployment**: Docker, Hugging Face Spaces
 ---
-## 📊 Dashboard Overview
-### Tab 1: Geographic Analysis
-- **Interactive Map**: Risk heatmap with circle size = volume, color = risk
-- **Top 5 Hotspots**: Color-coded cards showing riskiest locations
-- **Risk Distribution**: Donut chart breakdown by category
-### Tab 2: Pattern Analysis
-- **Ghost ID Indicator**: Scatter plot with deviation thresholds
-- **Risk Histogram**: Distribution concentration analysis
-- **Time Series**: Dual-axis chart showing trends over time
-- **Statistics**: Mean, median, std dev, 95th percentile
-### Tab 3: Priority Cases
-- **Adjustable Threshold**: Slider to filter by minimum risk score
-- **Action Status**: Workflow tracking (Pending/Investigation/Resolved)
-- **Enhanced Table**: Progress bars, formatted columns
-- **Export Options**: CSV, JSON, TXT formats
-### Tab 4: Advanced Analytics
-- **Feature Importance**: Bar chart showing ML contributions
-- **Performance Gauge**: Speedometer-style model accuracy
-- **Correlation Heatmap**: Feature relationship matrix
-- **Key Insights**: Contextual intelligence cards
----
-## 🎨 Visual Design
-### Professional Styling
-- **Gradients**: Purple/blue for government portal aesthetic
-- **Animations**: Pulsing alerts for critical cases
-- **Typography**: Google Fonts (Inter) for modern look
-- **Color Coding**: Risk levels with emoji indicators (🔴🟠🟡🟢)
-### Responsive Layout
-- **Wide Mode**: Maximum data density
-- **Tabbed Interface**: Organized content reduces cognitive load
-- **Adaptive Visualizations**: Charts adjust to filter context
----
-## 🔧 Configuration
-### Model Parameters
-```python
-Config.ML_FEATURES = [
-    'ratio_deviation',      # Primary fraud indicator
-    'weekend_spike_score',  # Unauthorized operations
-    'mismatch_score',       # Data manipulation
-    'total_activity'        # Volume context
-]
-Config.CONTAMINATION = 0.05  # 5% expected anomaly rate
-Config.RANDOM_STATE = 42     # Reproducibility
 ```
-### Risk Thresholds
-```python
-RISK_CATEGORIES = {
-    'Low': [0, 50],
-    'Medium': [50, 70],
-    'High': [70, 85],
-    'Critical': [85, 100]
-}
 ```
 ---
-## 📈 Use Cases
-### 1. Ghost Identity Creation
-**Pattern**: Abnormally high adult enrolment ratio
-**Detection**: High positive ratio_deviation
-**Example**: District avg 40%, center reports 90% → FLAGGED
-### 2. Weekend/Holiday Fraud
-**Pattern**: Activity spikes when centers should be closed
-**Detection**: High weekend_spike_score
-**Example**: 5x normal activity on Sunday → FLAGGED
-### 3. Data Manipulation
-**Pattern**: Discrepancies between biometric and demographic updates
-**Detection**: High mismatch_score
-**Example**: 100 demo updates, 20 bio updates → FLAGGED
----
-## 🚢 Deployment
-### Docker Deployment
-```bash
-# Build image
-docker build -t app .
-# Run container
-docker run -p 8501:8501 app
-```
-### Hugging Face Spaces
-The app is automatically deployed when you push to the main branch.
-### Environment Variables
-```bash
-STREAMLIT_SERVER_PORT=8501
-STREAMLIT_SERVER_ADDRESS=0.0.0.0
-STREAMLIT_SERVER_HEADLESS=true
-```
----
-## 📊 Performance Metrics
-### Model Performance (Simulated)
-- **Precision**: 89%
-- **Recall**: 85%
-- **F1-Score**: 87%
-- **Accuracy**: 88%
-### System Performance
-- **Data Points Processed**: 500K+ records
-- **Processing Time**: <1 second (cached)
-- **Dashboard Load Time**: ~2 seconds
-- **Visualization Rendering**: <500ms per chart
 ---
-## 🔒 Security Considerations
-### Current Implementation
-- ✅ Data caching for performance
-- ✅ Input validation on filters
-- ✅ Error handling for missing data
-- ⚠️ Simulated coordinates (demo only)
-### Production Requirements
-- 🔐 SSO/OAuth authentication
-- 🔐 Role-based access control (RBAC)
-- 🔐 Audit logging for all actions
-- 🔐 Data encryption (at rest & in transit)
-- 🔐 Real geocoding with pincode master DB
----
-## 🎯 Future Enhancements
-### Short-term (1-3 months)
-- [ ] Real geocoding integration
-- [ ] SHAP values for explainability
-- [ ] Feedback loop for model refinement
-- [ ] PDF report generation
-- [ ] Email/SMS alert system
-### Long-term (3-6 months)
-- [ ] Multi-level baselines (state, district, pincode)
-- [ ] Network analysis for coordinated fraud
-- [ ] Real-time streaming pipeline (Kafka)
-- [ ] Ensemble methods (LOF + One-Class SVM)
-- [ ] Mobile app for field officers
----
-## 👥 Team
-**Team ID**: UIDAI_4571
-**Theme**: Data-Driven Innovation for Aadhaar
-**Competition**: UIDAI Hackathon 2026
 ---
-## 📄 Documentation
-Comprehensive documentation available in `/docs`:
-- **Project_S.A.T.A.R.K_Analysis.docx**: Technical analysis & code review
-- **Sentinel_Dashboard_Documentation.docx**: Dashboard user guide
-- **Dashboard_Enhancements_Guide.docx**: Enhancement details
----
-## 🤝 Contributing
-We welcome contributions! Please follow these steps:
-1. Fork the repository
-2. Create a feature branch (`git checkout -b feature/AmazingFeature`)
-3. Commit your changes (`git commit -m 'Add AmazingFeature'`)
-4. Push to the branch (`git push origin feature/AmazingFeature`)
-5. Open a Pull Request
 ---
 ## 📝 License
-This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
 ---
-## 🙏 Acknowledgments
-- **UIDAI** for the hackathon opportunity and dataset
-- **Anthropic** for AI assistance in development
-- **Streamlit** for the amazing web framework
-- **Plotly** for interactive visualizations
----
-## 📧 Contact
-For questions or support, please contact:
-- **Email**: princelv84@gmail.com
-- **Issues**: [GitHub Issues](https://github.com/lovnnishverma/UIDAI/issues)
-- **Discussions**: [GitHub Discussions](https://github.com/lovnishverma/UIDAI/discussions)
----
-## 🌟 Star History
-If you find this project useful, please consider giving it a ⭐!
----
-<div align="center">
-  <strong>Built with ❤️ for a safer Aadhaar ecosystem</strong>
-  <br>
-  <sub>© 2026 Project S.A.T.A.R.K. All rights reserved.</sub>
-</div>

 [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+> **Context-Aware Anomaly Detection System for Aadhaar Enrolment Centers** > **Team ID:** UIDAI_4571 | **Theme:** Data-Driven Innovation for Aadhaar
 ---
 ## 🎯 Quick Links
+- **📊 Live Analysis Notebook**: [Open in Google Colab](https://colab.research.google.com/drive/1YAQ4nfxltvG_cts3fmGc_zi2JQc4oPOT?usp=sharing)
+- **🚀 Live Dashboard**: [Hugging Face Spaces](https://huggingface.co/spaces/lovnishverma/UIDAI)
+- **📖 Project Report**: [View PDF](Final-Project-Report.pdf)
 - **💻 Source Code**: Available in this repository
 ---
 ## 🎯 Overview
+**Project S.A.T.A.R.K** (Statistical Anomaly Tracking & Aadhaar Risk Kit) is a revolutionary fraud detection system designed to solve the critical "Accuracy vs. Fairness" trade-off in Aadhaar vigilance.
+### The Problem
+India's demographic diversity makes global rules ineffective:
+- ❌ **Strict Rules:** Flag legitimate activities in tribal belts (False Positives).
+- ❌ **Lenient Rules:** Miss sophisticated fraud in metropolitan areas (False Negatives).
+### Our Innovation: District Normalization
+Instead of using a national average, S.A.T.A.R.K compares each enrolment center against its **local district baseline**.
+- **Example:** In a tribal district where late enrolment is common (Avg: 40%), a center doing 90% is flagged. But in a city where 90% is normal, it is marked safe.
 ---
 ## ✨ Key Features
+### 🧠 The "Context-Aware" AI Engine
 - **Algorithm**: Isolation Forest (Unsupervised Learning)
+- **Smart Logic**: Detects anomalies relative to local geography.
+- **Capabilities**: Identifies "Ghost IDs", "Sunday Surges" (Illegal Camps), and "Mass Update Operations".
+### 📊 The Vigilance Dashboard
+- **Geospatial Intelligence**: Interactive Heatmap of High-Risk Centers.
+- **Actionable Insights**: "Priority Action List" exportable for field agents.
+- **Evidence-Based**: Charts proving *why* a center was flagged (e.g., Weekend Activity vs. Weekday).
+### 📥 Smart Data Ingestion
+- **Automated**: Recursively fetches and merges fragmented CSV chunks.
+- **Robust**: Handles massive datasets without data loss using Outer Joins.
 ---
 ## 🚀 Quick Start
+### **Option 1: Run Analysis (Google Colab)**
+To see the Feature Engineering and Model Training in action:
 [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1YAQ4nfxltvG_cts3fmGc_zi2JQc4oPOT?usp=sharing)
+1. Open the Notebook.
+2. Run all cells to process the raw data.
+3. Download the generated `analyzed_aadhaar_data.csv`.
+### **Option 2: Run Dashboard (Local)**
+**Prerequisites:** Python 3.8+, pip
 1. **Clone the repository**
+   ```bash
+   git clone [https://huggingface.co/spaces/lovnishverma/UIDAI](https://huggingface.co/spaces/lovnishverma/UIDAI)
+   cd UIDAI
 ```
 2. **Install dependencies**
 ```bash
 pip install -r requirements.txt
 ```
+3. **Launch the App**
 ```bash
 streamlit run app.py
 ```
+4. **Access the Dashboard**
+Open `http://localhost:8501` in your browser.
 ---
+## 📁 Project Structure
 ```
+UIDAI/
+├── README.md                                 # This documentation
+├── requirements.txt                          # Python dependencies
+├── Dockerfile                                # Container configuration
+├── app.py                                    # Streamlit Dashboard Code
+├── UIDAI_4571_(PROJECT_S_A_T_A_R_K_AI).ipynb # Main Analysis Notebook
+├── analyzed_aadhaar_data.csv                 # Processed Data for Dashboard
+├── Final-Project-Report.pdf                  # Complete Project Documentation
+└── assets/                                   # Images and logos
 ```
 ---
+## 🧠 Technical Architecture
+### The Pipeline
+1. **Ingestion**: `SmartLoader` class merges fragmented CSVs.
+2. **Context Engine**: Calculates `ratio_deviation` (Center vs. District).
+3. **AI Model**: `IsolationForest` detects statistical outliers.
+4. **Visualization**: Streamlit app renders the `RISK_SCORE` on maps.
+### Core Risk Signals
+| Feature | Logic | Detects |
+| --- | --- | --- |
+| **Ratio Deviation** | `(Center_Ratio - District_Avg)` | Ghost IDs |
+| **Weekend Spike** | `Activity on Sunday / Normal Day` | Illegal Camps |
+| **Mismatch Score** | ` | Bio - Demo |
+| **Volume Anomaly** | `Total_Activity > 99th Percentile` | Mass Operations |
 ---
+## 📊 Dashboard Preview
+### 1. Geographic Heatmap
+Instantly spot high-risk clusters across India.
+*(See `assets/` for screenshots)*
+### 2. Priority Action List
+Downloadable CSV for vigilance officers containing only the top 1% critical cases.
+### 3. AI Insights Panel
+"Why is this flagged?" - The AI explains its decision (e.g., *"Flagged due to 500% spike in weekend activity"*).
 ---
+## 👥 Team UIDAI_4571
+**Team Leader:** Aman Choudhary (NIELIT Ropar)
+**Team Member:** Prateek Dhar Dwivedi (NIELIT Ropar)
+**Mentor:** Lovnish Verma (Project Engineer, NIELIT Ropar)
+**Competition:** UIDAI Hackathon 2026
+**Submission Date:** January 2026
 ---
 ## 📝 License
+This project is open-source under the [MIT License](https://www.google.com/search?q=LICENSE).
 ---
+<div align="center">
+<strong>Project S.A.T.A.R.K.</strong>
+<em>Statistical Anomaly Tracking & Aadhaar Risk Kit</em>
+Built with ❤️ for a safer, inclusive Digital India.
+</div>