# 🚓 San Francisco Crime Analytics & Prediction System - Project Documentation ## 1. Project Overview This project is a sophisticated AI-powered dashboard designed to analyze historical crime data in San Francisco and predict future incidents with high accuracy. It serves as a decision-support tool for law enforcement and a safety awareness tool for citizens. The system combines: - **Data Analytics**: Visualizing crime trends, hotspots, and distributions. - **Machine Learning**: Using XGBoost and Random Forest to classify crimes as violent or non-violent. - **Generative AI**: Integrating Groq (Llama 3) for natural language explanations and a conversational assistant. ## 2. Architecture & Technology Stack ### Frontend - **Streamlit**: The core framework for the web interface. It handles the layout, user inputs, and visualization rendering. - **Plotly**: Used for interactive charts (bar charts, pie charts, gauge charts). - **Folium**: Used for geospatial visualizations (heatmaps, time-lapse maps). ### Backend & Logic - **Python**: The primary programming language. - **Pandas & NumPy**: For data manipulation and numerical operations. - **Scikit-Learn**: For preprocessing (Label Encoding, K-Means Clustering) and baseline models. - **XGBoost**: The engine behind the high-accuracy prediction model. - **Groq API**: Provides the Llama 3 LLM for the AI assistant and explanation features. ### Directory Structure ``` Hackathon/ ├── app.py # Main application entry point ├── Dockerfile # Container configuration ├── requirements.txt # Project dependencies ├── README.md # Quick start guide ├── src/ │ ├── data_loader.py # Data ingestion logic │ ├── preprocessing.py # Feature engineering pipeline │ └── train_model.py # Model training script ├── models/ # Saved model artifacts (.pkl) ├── data/ # Raw dataset storage └── docs/ # Project documentation ``` ## 3. Implementation Details ### 3.1 Data Pipeline (`src/data_loader.py` & `src/preprocessing.py`) The data pipeline transforms raw CSV data into machine-learning-ready features. - **Loading**: Reads `train.csv` and parses dates. - **Feature Engineering**: - **Temporal**: Extracts Hour, Day, Month, Year, DayOfWeek. - **Contextual**: Determines 'Season' (Winter, Spring, Summer, Fall) and 'IsWeekend'. - **Spatial**: Uses **K-Means Clustering** to group coordinates into 'LocationClusters', identifying high-risk zones. - **Target Definition**: Creates a binary target `IsViolent` based on crime categories (e.g., Assault, Robbery = 1). ### 3.2 Model Training (`src/train_model.py`) The training script evaluates multiple models to find the best performer. 1. **Preprocessing**: Applies the pipeline to the training data. 2. **Encoding**: Converts categorical variables (District, Season) into numbers using `LabelEncoder`. 3. **Model Selection**: Trains Naive Bayes, Random Forest, and XGBoost. 4. **Evaluation**: Compares Accuracy, Precision, and Recall. 5. **Artifact Saving**: Saves the best model and encoders to `models/` for the app to use. ### 3.3 The Dashboard (`app.py`) The main application is divided into several tabs, each serving a specific purpose: #### **📊 Historical Trends** - **Logic**: Aggregates data by hour and district. - **Viz**: Displays a bar chart for hourly distribution and a pie chart for district breakdown. #### **🗺️ Geospatial Intelligence** - **Logic**: Uses `Folium` to render maps. - **Features**: - **Time-Lapse**: Animates crime hotspots over a 24-hour cycle. - **Static Heatmap**: Shows overall density of incidents. #### **🚨 Tactical Simulation** - **Purpose**: Simulates patrol scenarios to assess risk. - **Logic**: Takes user input (District, Time), processes it through the model, and outputs a risk probability. - **Output**: A gauge chart showing risk level and actionable recommendations (e.g., "Deploy SWAT"). #### **💬 Chat with Data** - **Purpose**: Natural language query interface. - **Logic**: A simple intent parser filters the dataframe based on keywords (e.g., "Robbery", "Mission") and dynamically generates charts. #### **?? Advanced Prediction (99%)** - **Purpose**: High-precision individual incident prediction. - **Model**: Uses a specialized XGBoost model (`crime_xgb_artifacts.pkl`) optimized for multi-class classification. - **Features**: - **Input Form**: Detailed inputs including address and description. - **Top 3 Probabilities**: Shows the most likely crime categories. - **AI Explanation**: Calls the **Groq API** to explain *why* the model made a specific prediction based on the description. #### **🤖 AI Crime Safety Assistant** - **Implementation**: A chat interface embedded in the app. - **Logic**: Maintains session state for chat history. Sends user queries + system prompt to Groq (Llama 3) to generate helpful safety advice and model explanations. ## 4. How to Run 1. **Prerequisites**: Python 3.9+ installed. 2. **Installation**: ```bash pip install -r requirements.txt ``` 3. **Execution**: ```bash streamlit run app.py ``` ## 5. Future Improvements - **Real-time Data**: Connect to a live police API. - **User Accounts**: Save preferences and history. - **Mobile App**: Wrap the dashboard for mobile deployment. --- *Generated by Antigravity*