File size: 5,570 Bytes
61b0513
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
ο»Ώ# πŸš“ San Francisco Crime Analytics & Prediction System - Project Documentation

## 1. Project Overview
This project is a sophisticated AI-powered dashboard designed to analyze historical crime data in San Francisco and predict future incidents with high accuracy. It serves as a decision-support tool for law enforcement and a safety awareness tool for citizens.

The system combines:
-   **Data Analytics**: Visualizing crime trends, hotspots, and distributions.
-   **Machine Learning**: Using XGBoost and Random Forest to classify crimes as violent or non-violent.
-   **Generative AI**: Integrating Groq (Llama 3) for natural language explanations and a conversational assistant.

## 2. Architecture & Technology Stack

### Frontend
-   **Streamlit**: The core framework for the web interface. It handles the layout, user inputs, and visualization rendering.
-   **Plotly**: Used for interactive charts (bar charts, pie charts, gauge charts).
-   **Folium**: Used for geospatial visualizations (heatmaps, time-lapse maps).

### Backend & Logic
-   **Python**: The primary programming language.
-   **Pandas & NumPy**: For data manipulation and numerical operations.
-   **Scikit-Learn**: For preprocessing (Label Encoding, K-Means Clustering) and baseline models.
-   **XGBoost**: The engine behind the high-accuracy prediction model.
-   **Groq API**: Provides the Llama 3 LLM for the AI assistant and explanation features.

### Directory Structure
```
Hackathon/
β”œβ”€β”€ app.py                 # Main application entry point
β”œβ”€β”€ Dockerfile             # Container configuration
β”œβ”€β”€ requirements.txt       # Project dependencies
β”œβ”€β”€ README.md              # Quick start guide
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ data_loader.py     # Data ingestion logic
β”‚   β”œβ”€β”€ preprocessing.py   # Feature engineering pipeline
β”‚   └── train_model.py     # Model training script
β”œβ”€β”€ models/                # Saved model artifacts (.pkl)
β”œβ”€β”€ data/                  # Raw dataset storage
└── docs/                  # Project documentation
```

## 3. Implementation Details

### 3.1 Data Pipeline (`src/data_loader.py` & `src/preprocessing.py`)
The data pipeline transforms raw CSV data into machine-learning-ready features.

-   **Loading**: Reads `train.csv` and parses dates.
-   **Feature Engineering**:
    -   **Temporal**: Extracts Hour, Day, Month, Year, DayOfWeek.
    -   **Contextual**: Determines 'Season' (Winter, Spring, Summer, Fall) and 'IsWeekend'.
    -   **Spatial**: Uses **K-Means Clustering** to group coordinates into 'LocationClusters', identifying high-risk zones.
    -   **Target Definition**: Creates a binary target `IsViolent` based on crime categories (e.g., Assault, Robbery = 1).

### 3.2 Model Training (`src/train_model.py`)
The training script evaluates multiple models to find the best performer.

1.  **Preprocessing**: Applies the pipeline to the training data.
2.  **Encoding**: Converts categorical variables (District, Season) into numbers using `LabelEncoder`.
3.  **Model Selection**: Trains Naive Bayes, Random Forest, and XGBoost.
4.  **Evaluation**: Compares Accuracy, Precision, and Recall.
5.  **Artifact Saving**: Saves the best model and encoders to `models/` for the app to use.

### 3.3 The Dashboard (`app.py`)
The main application is divided into several tabs, each serving a specific purpose:

#### **πŸ“Š Historical Trends**
-   **Logic**: Aggregates data by hour and district.
-   **Viz**: Displays a bar chart for hourly distribution and a pie chart for district breakdown.

#### **πŸ—ΊοΈ Geospatial Intelligence**
-   **Logic**: Uses `Folium` to render maps.
-   **Features**:
    -   **Time-Lapse**: Animates crime hotspots over a 24-hour cycle.
    -   **Static Heatmap**: Shows overall density of incidents.

#### **🚨 Tactical Simulation**
-   **Purpose**: Simulates patrol scenarios to assess risk.
-   **Logic**: Takes user input (District, Time), processes it through the model, and outputs a risk probability.
-   **Output**: A gauge chart showing risk level and actionable recommendations (e.g., "Deploy SWAT").

#### **πŸ’¬ Chat with Data**
-   **Purpose**: Natural language query interface.
-   **Logic**: A simple intent parser filters the dataframe based on keywords (e.g., "Robbery", "Mission") and dynamically generates charts.

#### **?? Advanced Prediction (99%)**
-   **Purpose**: High-precision individual incident prediction.
-   **Model**: Uses a specialized XGBoost model (`crime_xgb_artifacts.pkl`) optimized for multi-class classification.
-   **Features**:
    -   **Input Form**: Detailed inputs including address and description.
    -   **Top 3 Probabilities**: Shows the most likely crime categories.
    -   **AI Explanation**: Calls the **Groq API** to explain *why* the model made a specific prediction based on the description.

#### **πŸ€– AI Crime Safety Assistant**
-   **Implementation**: A chat interface embedded in the app.
-   **Logic**: Maintains session state for chat history. Sends user queries + system prompt to Groq (Llama 3) to generate helpful safety advice and model explanations.

## 4. How to Run

1.  **Prerequisites**: Python 3.9+ installed.
2.  **Installation**:
    ```bash
    pip install -r requirements.txt
    ```
3.  **Execution**:
    ```bash
    streamlit run app.py
    ```

## 5. Future Improvements
-   **Real-time Data**: Connect to a live police API.
-   **User Accounts**: Save preferences and history.
-   **Mobile App**: Wrap the dashboard for mobile deployment.

---
*Generated by Antigravity*