NetworkSecurity / README.md
Inder-26
Fix data ingestion path, update README images, and enable reload
2d7183c
---
title: NetworkSecurity
emoji: 😻
colorFrom: blue
colorTo: blue
sdk: docker
pinned: false
license: mit
---
# πŸ›‘οΈ Network Security System: Phishing URL Detection
![UI Homepage 1](images/home_page_1.png)
![UI Homepage 2](images/home_page_2.png)
![UI Homepage 3](images/home_page_3.png)
![UI Homepage 4](images/home_page_4.png)
![UI Homepage 5](images/home_page_5.png)
## πŸ“‹ Table of Contents
- [About The Project](#-about-the-project)
- [Architecture](#-architecture)
- [Features](#-features)
- [Tech Stack](#-tech-stack)
- [Dataset](#-dataset)
- [Project Structure](#-project-structure)
- [Pipeline Workflow](#-pipeline-workflow)
- [Screenshots](#-screenshots)
- [Installation](#-installation)
- [Usage](#-usage)
- [Model Performance](#-model-performance)
- [Experiment Tracking](#-experiment-tracking)
- [Future Enhancements](#-future-enhancements)
- [Contributing](#-contributing)
- [License](#-license)
- [Contact](#-contact)
## πŸš€ Live Demo
- **Live Application**: [inderjeet-networksecurity.hf.space](https://inderjeet-networksecurity.hf.space/)
- **Experiment Tracking**: [DagsHub Experiments](https://dagshub.com/Inder-26/NetworkSecurity/experiments#/)
## 🎯 About The Project
In the digital age, cybersecurity threats such as phishing attacks are becoming increasingly sophisticated. This project implements a robust **Network Security Machine Learning Pipeline** designed to detect phishing URLs with high accuracy.
It leverages a modular MLOps architecture, ensuring scalability, maintainability, and reproducibility. The system automates the entire flow from data ingestion to model deployment, utilizing advanced techniques like drift detection and automated model evaluation.
## πŸ—οΈ Architecture
The system follows a strict modular pipeline architecture, orchestrated by a central training pipeline.
![Architecture Diagram](images/architecture_diagram.png)
## ✨ Features
- **πŸš€ End-to-End Pipeline**: Fully automated workflow from data ingestion to model deployment.
- **πŸ›‘οΈ Data Validation**: Comprehensive schema checks and data drift detection using KS tests.
- **πŸ”„ Robust Preprocessing**: Automated handling of missing values (KNN Imputer) and feature scaling (Robust Scaler).
- **πŸ€– Multi-Model Training**: Experiments with RandomForest, DecisionTree, GradientBoosting, and AdaBoost using GridSearchCV.
- **πŸ“Š Experiment Tracking**: Integrated with **MLflow** and **DagsHub** for tracking parameters, metrics, and models.
- **⚑ Fast API**: High-performance REST API built with **FastAPI** for real-time predictions.
- **🐳 Containerized**: Docker support for consistent deployment across environments.
- **☁️ Cloud Ready**: Designed to be deployed on platforms like AWS or Hugging Face Spaces.
## πŸ› οΈ Tech Stack
- **Languages**: Python 3.8+
- **Frameworks**: FastAPI, Uvicorn
- **ML Libraries**: Scikit-learn, Pandas, NumPy
- **MLOps**: MLflow, DagsHub
- **Database**: MongoDB
- **Containerization**: Docker
- **Frontend**: HTML, CSS (Custom Design System), JavaScript
## πŸ“Š Dataset
The project utilizes a dataset containing various URL features to distinguish between legitimate and phishing URLs.
- **Source**: [Phishing Dataset for Machine Learning](https://archive.ics.uci.edu/ml/datasets/Phishing+Websites) (or similar Phishing URL dataset)
- **Features**: IP Address, URL Length, TinyURL, forwarding, etc.
- **Target**: `Result` (LEGITIMATE / PHISHING)
## πŸ“ Project Structure
```
NetworkSecurity/
β”œβ”€β”€ images/ # Project diagrams and screenshots
β”œβ”€β”€ networksecurity/ # Main package
β”‚ β”œβ”€β”€ components/ # Pipeline components (Ingestion, Validation, Transformation, Training)
β”‚ β”œβ”€β”€ pipeline/ # Training and Prediction pipelines
β”‚ β”œβ”€β”€ entity/ # Artifact and Config entities
β”‚ β”œβ”€β”€ constants/ # Project constants
β”‚ β”œβ”€β”€ utils/ # Utility functions
β”‚ └── exception/ # Custom exception handling
β”œβ”€β”€ data_schema/ # Schema definitions
β”œβ”€β”€ Dockerfile # Docker configuration
β”œβ”€β”€ app.py # FastAPI application entry point
β”œβ”€β”€ requirements.txt # Project dependencies
└── README.md # Project documentation
```
## βš™οΈ Pipeline Workflow
### 1. Data Ingestion πŸ“₯
Fetches data from MongoDB, handles fallback to local CSV, and performs train-test split.
![Data Ingestion](images/data_ingestion_diagram.png)
### 2. Data Validation βœ…
Validates data against schema and checks for data drift.
![Data Validation](images/data_validation_diagram.png)
### 3. Data Transformation πŸ”„
Imputes missing values and scales features for optimal model performance.
![Data Transformation](images/data_transformation_diagram.png)
### 4. Model Training πŸ€–
Trains and tunes multiple models, selecting the best one based on F1-score/Accuracy.
![Model Training](images/model_training_diagram.png)
## πŸ“Έ Screenshots
### Prediction Results & Threat Assessment
![Prediction Results](images/prediction_results.png)
### Experiment Tracking (DagsHub/MLflow)
![Experiment Tracking](images/dagshub_experiments.png)
## πŸ’» Installation
### Prerequisites
- Python 3.8+
- MongoDB Account
- DagsHub Account (for experiment tracking)
### Step-by-Step
1. **Clone the Repository**
```bash
git clone https://github.com/Inder-26/NetworkSecurity.git
cd NetworkSecurity
```
2. **Create Virtual Environment**
```bash
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
```
3. **Install Dependencies**
```bash
pip install -r requirements.txt
```
4. **Set Environment Variables**
Create a `.env` file with your credentials:
```env
MONGO_DB_URL=your_mongodb_url_here
MLFLOW_TRACKING_URI=https://dagshub.com/your_username/project.mlflow
MLFLOW_TRACKING_USERNAME=your_username
MLFLOW_TRACKING_PASSWORD=your_password
```
## πŸš€ Usage
### Run the Web Application
```bash
python app.py
```
Visit `http://localhost:8000` to access the UI.
### Train a New Model
To trigger the training pipeline:
```bash
http://localhost:8000/train
```
Or use the "Train New Model" button in the UI.
## πŸ“ˆ Model Performance
The system evaluates models using accuracy and F1 score.
- **Best Model**: [Automatically selected, typically RandomForest or GradientBoosting]
- **Recall**: Optimized to minimize false negatives (missing a phishing URL is dangerous).
### Model Evaluation Metrics
Below are the performance visualizations for the best trained model:
#### Confusion Matrix
![Confusion Matrix](images/confusion_matrix.png)
#### ROC Curve
![ROC Curve](images/roc_curve.png)
#### Precision-Recall Curve
![Precision-Recall Curve](images/precision_recall_curve.png)
## πŸ§ͺ Experiment Tracking
All runs are logged to DagsHub. You can view parameters, metrics, and models in the MLflow UI.
## πŸš€ Future Enhancements
- [ ] Implement Deep Learning models (LSTM/CNN) for URL text analysis.
- [ ] Add real-time browser extension.
- [ ] Deploy serverless architecture.
- [ ] Add more comprehensive unit and integration tests.
## 🀝 Contributing
Contributions are welcome! Please fork the repository and create a pull request.
1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
## πŸ“„ License
Distributed under the MIT License. See `LICENSE` for more information.
## πŸ“ž Contact
Inder - [GitHub Profile](https://github.com/Inder-26)