Spaces:
Sleeping
Sleeping
| title: PhishGuard Pro | |
| emoji: ๐ก๏ธ | |
| colorFrom: red | |
| colorTo: gray | |
| sdk: gradio | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| # ๐ก๏ธ PhishGuard Pro: Advanced Hybrid AI Scam & Phishing Detector | |
| ## ๐ Project Overview | |
| **PhishGuard Pro** is an enterprise-grade AI tool designed to detect, analyze, and explain phishing emails, SMS scams (Smishing), and financial fraud attempts. | |
| It leverages a **Hybrid AI Architecture** by combining fast, accurate Sequence Classification (BERT) with generative explainability (RAG + LLM) to not only flag malicious content but also provide actionable cybersecurity advice in English. | |
| This project was developed as a robust portfolio piece demonstrating advanced AI Engineering skills in Fintech & Cybersecurity. | |
| --- | |
| ## ๐ ๏ธ What I Built vs. What is Ready-Made | |
| To maintain transparency and highlight the specific engineering effort, here is a breakdown of my custom implementation versus the open-source tools utilized: | |
| ### โ๏ธ What I Built (My Core AI Engineering Contribution) | |
| 1. **Hybrid AI Pipeline Architecture**: Architected a dual-stage inference pipeline, fusing fast BERT-based sequence classification with LLM-powered context reasoning to balance real-time latency with deep analytical accuracy. | |
| 2. **Specialized Financial RAG Engineering**: Curated and embedded a high-fidelity vector knowledge base focused on sophisticated attack vectors (e.g., *Pig Butchering*, *CEO Fraud / BEC*, advanced *Smishing*), enabling the AI to counter complex social engineering tactics. | |
| 3. **Automated IoC Forensics Extraction**: Engineered a deterministic Threat Intelligence layer utilizing Regex to parse raw inputs, instantly isolating Indicators of Compromise (malicious domains, burner emails, spoofed numbers) for immediate forensic visibility. | |
| 4. **Guardrailed Prompt Design**: Implemented strict, constraint-based prompt architectures that systematically mitigate LLM hallucination and force the generation of standardized, actionable Incident Response Plans. | |
| 5. **Enterprise-Grade Analytical Dashboard**: Developed a dynamic, responsive security terminal utilizing Gradio and **Plotly** to visually synthesize classification metrics, threat probabilities (Interactive Risk Gauge), and LLM reasoning into an intuitive analyst dashboard. | |
| ### ๐ฆ Ready-Made Open-Source Models (The Giants I Stand On) | |
| I integrated state-of-the-art free models to achieve maximum accuracy with zero deployment cost: | |
| - **Phishing Classifier**: `Auguzcht/securisense-phishing-detection` (Fine-tuned BERT-base). | |
| - **Vector Embeddings**: `sentence-transformers/all-MiniLM-L6-v2` (Fast deployment embeddings). | |
| - **Reasoning Engine (LLM)**: `HuggingFaceH4/zephyr-7b-beta` (Highly capable instruction-tuned 7B model). | |
| - **Orchestration**: `LangChain` (Vector DB bridging) and `FAISS` (In-memory similarity search). | |
| --- | |
| ## ๐ How to Run Locally | |
| 1. Clone the repository and install dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 2. Run the Gradio app: | |
| ```bash | |
| python app.py | |
| ``` | |
| *Note: Due to the usage of powerful LLMs, this application may require significant memory (RAM/VRAM) to run optimally locally. On Hugging Face Spaces, it runs efficiently within available hardware limits.* | |
| ## ๐ฎ Future Roadmap (Enterprise Scaling) | |
| While the current architecture serves as a highly effective **Minimum Viable Product (MVP)**, transitioning this to a production-grade enterprise deployment would involve the following architectural upgrades: | |
| 1. **Model Fine-Tuning (Data-Centric Optimization)** | |
| * **What:** Fine-tuning the base Sequence Classification model (e.g., utilizing larger BERT variants) specifically on datasets containing high-volume contextual Smishing (SMS phishing) and WhatsApp fraud. | |
| * **Why:** Scammers rely heavily on social engineering specifically formatted for mobile platforms. Fine-tuning guarantees extreme precision against zero-day telecom fraud. | |
| 2. **Live Threat Intelligence Integration (Dynamic Vector DB)** | |
| * **What:** Migrating the static in-memory RAG store to a live, distributed Vector Database (such as `Pinecone` or `Milvus`) connected to automated OSINT (Open-Source Intelligence) threat feeds. | |
| * **Why:** Scam narratives evolve daily. A dynamic Vector DB ensures the AI's contextual knowledge base updates in real-time without requiring application rebuilds or downtime. | |
| 3. **Active URL Sandboxing & API Verification** | |
| * **What:** Automatically routing the extracted Indicators of Compromise (IoCs) through professional threat aggregation APIs like `VirusTotal` or `Google Safe Browsing`. | |
| * **Why:** While the current system excels at behavioral linguistic analysis, combining AI heuristics with deterministic IP/URL reputation checks provides a fail-proof, multi-layered security blanket. | |
| 4. **Autonomous AI Agents (Tool-Calling Integration)** | |
| * **What:** Upgrading the passive RAG pipeline into an active **Autonomous Agent** framework (via LangChain Agents) equipped with tools such as a `SandboxBrowserTool` and `DomainLookupTool`. | |
| * **Why:** Instead of merely analyzing the passive text of a message, an Agent can *investigate* it. If an email contains a link, the Agent autonomously securely browses the link, observes the webpage (e.g., detecting a cloned PayPal login screen), checks the domain registration date, and synthesizes a conclusive forensic report. **This active investigation represents the true State-of-the-Art in AI Cybersecurity.** | |
| --- | |
| ## ๐ Legal Disclaimer | |
| This tool is for educational and advisory purposes only. Complex fraud schemes evolve rapidly. Always rely on authorized bank or official channels for final verification. | |