hackathonn / docs /PROJECT_DOCUMENTATION.md
MHuzaifaa's picture
Upload project
61b0513
ο»Ώ# πŸš“ San Francisco Crime Analytics & Prediction System - Project Documentation
## 1. Project Overview
This project is a sophisticated AI-powered dashboard designed to analyze historical crime data in San Francisco and predict future incidents with high accuracy. It serves as a decision-support tool for law enforcement and a safety awareness tool for citizens.
The system combines:
- **Data Analytics**: Visualizing crime trends, hotspots, and distributions.
- **Machine Learning**: Using XGBoost and Random Forest to classify crimes as violent or non-violent.
- **Generative AI**: Integrating Groq (Llama 3) for natural language explanations and a conversational assistant.
## 2. Architecture & Technology Stack
### Frontend
- **Streamlit**: The core framework for the web interface. It handles the layout, user inputs, and visualization rendering.
- **Plotly**: Used for interactive charts (bar charts, pie charts, gauge charts).
- **Folium**: Used for geospatial visualizations (heatmaps, time-lapse maps).
### Backend & Logic
- **Python**: The primary programming language.
- **Pandas & NumPy**: For data manipulation and numerical operations.
- **Scikit-Learn**: For preprocessing (Label Encoding, K-Means Clustering) and baseline models.
- **XGBoost**: The engine behind the high-accuracy prediction model.
- **Groq API**: Provides the Llama 3 LLM for the AI assistant and explanation features.
### Directory Structure
```
Hackathon/
β”œβ”€β”€ app.py # Main application entry point
β”œβ”€β”€ Dockerfile # Container configuration
β”œβ”€β”€ requirements.txt # Project dependencies
β”œβ”€β”€ README.md # Quick start guide
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ data_loader.py # Data ingestion logic
β”‚ β”œβ”€β”€ preprocessing.py # Feature engineering pipeline
β”‚ └── train_model.py # Model training script
β”œβ”€β”€ models/ # Saved model artifacts (.pkl)
β”œβ”€β”€ data/ # Raw dataset storage
└── docs/ # Project documentation
```
## 3. Implementation Details
### 3.1 Data Pipeline (`src/data_loader.py` & `src/preprocessing.py`)
The data pipeline transforms raw CSV data into machine-learning-ready features.
- **Loading**: Reads `train.csv` and parses dates.
- **Feature Engineering**:
- **Temporal**: Extracts Hour, Day, Month, Year, DayOfWeek.
- **Contextual**: Determines 'Season' (Winter, Spring, Summer, Fall) and 'IsWeekend'.
- **Spatial**: Uses **K-Means Clustering** to group coordinates into 'LocationClusters', identifying high-risk zones.
- **Target Definition**: Creates a binary target `IsViolent` based on crime categories (e.g., Assault, Robbery = 1).
### 3.2 Model Training (`src/train_model.py`)
The training script evaluates multiple models to find the best performer.
1. **Preprocessing**: Applies the pipeline to the training data.
2. **Encoding**: Converts categorical variables (District, Season) into numbers using `LabelEncoder`.
3. **Model Selection**: Trains Naive Bayes, Random Forest, and XGBoost.
4. **Evaluation**: Compares Accuracy, Precision, and Recall.
5. **Artifact Saving**: Saves the best model and encoders to `models/` for the app to use.
### 3.3 The Dashboard (`app.py`)
The main application is divided into several tabs, each serving a specific purpose:
#### **πŸ“Š Historical Trends**
- **Logic**: Aggregates data by hour and district.
- **Viz**: Displays a bar chart for hourly distribution and a pie chart for district breakdown.
#### **πŸ—ΊοΈ Geospatial Intelligence**
- **Logic**: Uses `Folium` to render maps.
- **Features**:
- **Time-Lapse**: Animates crime hotspots over a 24-hour cycle.
- **Static Heatmap**: Shows overall density of incidents.
#### **🚨 Tactical Simulation**
- **Purpose**: Simulates patrol scenarios to assess risk.
- **Logic**: Takes user input (District, Time), processes it through the model, and outputs a risk probability.
- **Output**: A gauge chart showing risk level and actionable recommendations (e.g., "Deploy SWAT").
#### **πŸ’¬ Chat with Data**
- **Purpose**: Natural language query interface.
- **Logic**: A simple intent parser filters the dataframe based on keywords (e.g., "Robbery", "Mission") and dynamically generates charts.
#### **?? Advanced Prediction (99%)**
- **Purpose**: High-precision individual incident prediction.
- **Model**: Uses a specialized XGBoost model (`crime_xgb_artifacts.pkl`) optimized for multi-class classification.
- **Features**:
- **Input Form**: Detailed inputs including address and description.
- **Top 3 Probabilities**: Shows the most likely crime categories.
- **AI Explanation**: Calls the **Groq API** to explain *why* the model made a specific prediction based on the description.
#### **πŸ€– AI Crime Safety Assistant**
- **Implementation**: A chat interface embedded in the app.
- **Logic**: Maintains session state for chat history. Sends user queries + system prompt to Groq (Llama 3) to generate helpful safety advice and model explanations.
## 4. How to Run
1. **Prerequisites**: Python 3.9+ installed.
2. **Installation**:
```bash
pip install -r requirements.txt
```
3. **Execution**:
```bash
streamlit run app.py
```
## 5. Future Improvements
- **Real-time Data**: Connect to a live police API.
- **User Accounts**: Save preferences and history.
- **Mobile App**: Wrap the dashboard for mobile deployment.
---
*Generated by Antigravity*