hackathonn / docs /PROJECT_DOCUMENTATION.md
MHuzaifaa's picture
Upload project
61b0513

ο»Ώ# πŸš“ San Francisco Crime Analytics & Prediction System - Project Documentation

1. Project Overview

This project is a sophisticated AI-powered dashboard designed to analyze historical crime data in San Francisco and predict future incidents with high accuracy. It serves as a decision-support tool for law enforcement and a safety awareness tool for citizens.

The system combines:

  • Data Analytics: Visualizing crime trends, hotspots, and distributions.
  • Machine Learning: Using XGBoost and Random Forest to classify crimes as violent or non-violent.
  • Generative AI: Integrating Groq (Llama 3) for natural language explanations and a conversational assistant.

2. Architecture & Technology Stack

Frontend

  • Streamlit: The core framework for the web interface. It handles the layout, user inputs, and visualization rendering.
  • Plotly: Used for interactive charts (bar charts, pie charts, gauge charts).
  • Folium: Used for geospatial visualizations (heatmaps, time-lapse maps).

Backend & Logic

  • Python: The primary programming language.
  • Pandas & NumPy: For data manipulation and numerical operations.
  • Scikit-Learn: For preprocessing (Label Encoding, K-Means Clustering) and baseline models.
  • XGBoost: The engine behind the high-accuracy prediction model.
  • Groq API: Provides the Llama 3 LLM for the AI assistant and explanation features.

Directory Structure

Hackathon/
β”œβ”€β”€ app.py                 # Main application entry point
β”œβ”€β”€ Dockerfile             # Container configuration
β”œβ”€β”€ requirements.txt       # Project dependencies
β”œβ”€β”€ README.md              # Quick start guide
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ data_loader.py     # Data ingestion logic
β”‚   β”œβ”€β”€ preprocessing.py   # Feature engineering pipeline
β”‚   └── train_model.py     # Model training script
β”œβ”€β”€ models/                # Saved model artifacts (.pkl)
β”œβ”€β”€ data/                  # Raw dataset storage
└── docs/                  # Project documentation

3. Implementation Details

3.1 Data Pipeline (src/data_loader.py & src/preprocessing.py)

The data pipeline transforms raw CSV data into machine-learning-ready features.

  • Loading: Reads train.csv and parses dates.
  • Feature Engineering:
    • Temporal: Extracts Hour, Day, Month, Year, DayOfWeek.
    • Contextual: Determines 'Season' (Winter, Spring, Summer, Fall) and 'IsWeekend'.
    • Spatial: Uses K-Means Clustering to group coordinates into 'LocationClusters', identifying high-risk zones.
    • Target Definition: Creates a binary target IsViolent based on crime categories (e.g., Assault, Robbery = 1).

3.2 Model Training (src/train_model.py)

The training script evaluates multiple models to find the best performer.

  1. Preprocessing: Applies the pipeline to the training data.
  2. Encoding: Converts categorical variables (District, Season) into numbers using LabelEncoder.
  3. Model Selection: Trains Naive Bayes, Random Forest, and XGBoost.
  4. Evaluation: Compares Accuracy, Precision, and Recall.
  5. Artifact Saving: Saves the best model and encoders to models/ for the app to use.

3.3 The Dashboard (app.py)

The main application is divided into several tabs, each serving a specific purpose:

πŸ“Š Historical Trends

  • Logic: Aggregates data by hour and district.
  • Viz: Displays a bar chart for hourly distribution and a pie chart for district breakdown.

πŸ—ΊοΈ Geospatial Intelligence

  • Logic: Uses Folium to render maps.
  • Features:
    • Time-Lapse: Animates crime hotspots over a 24-hour cycle.
    • Static Heatmap: Shows overall density of incidents.

🚨 Tactical Simulation

  • Purpose: Simulates patrol scenarios to assess risk.
  • Logic: Takes user input (District, Time), processes it through the model, and outputs a risk probability.
  • Output: A gauge chart showing risk level and actionable recommendations (e.g., "Deploy SWAT").

πŸ’¬ Chat with Data

  • Purpose: Natural language query interface.
  • Logic: A simple intent parser filters the dataframe based on keywords (e.g., "Robbery", "Mission") and dynamically generates charts.

?? Advanced Prediction (99%)

  • Purpose: High-precision individual incident prediction.
  • Model: Uses a specialized XGBoost model (crime_xgb_artifacts.pkl) optimized for multi-class classification.
  • Features:
    • Input Form: Detailed inputs including address and description.
    • Top 3 Probabilities: Shows the most likely crime categories.
    • AI Explanation: Calls the Groq API to explain why the model made a specific prediction based on the description.

πŸ€– AI Crime Safety Assistant

  • Implementation: A chat interface embedded in the app.
  • Logic: Maintains session state for chat history. Sends user queries + system prompt to Groq (Llama 3) to generate helpful safety advice and model explanations.

4. How to Run

  1. Prerequisites: Python 3.9+ installed.
  2. Installation:
    pip install -r requirements.txt
    
  3. Execution:
    streamlit run app.py
    

5. Future Improvements

  • Real-time Data: Connect to a live police API.
  • User Accounts: Save preferences and history.
  • Mobile App: Wrap the dashboard for mobile deployment.

Generated by Antigravity