File size: 7,749 Bytes
195cc50
fdafc3c
 
195cc50
fdafc3c
195cc50
fdafc3c
195cc50
 
 
 
 
2d7183c
 
 
 
 
195cc50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
add681a
 
 
 
195cc50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
---
title: NetworkSecurity
emoji: 😻
colorFrom: blue
colorTo: blue
sdk: docker
pinned: false
license: mit
---

# πŸ›‘οΈ Network Security System: Phishing URL Detection

![UI Homepage 1](images/home_page_1.png)
![UI Homepage 2](images/home_page_2.png)
![UI Homepage 3](images/home_page_3.png)
![UI Homepage 4](images/home_page_4.png)
![UI Homepage 5](images/home_page_5.png)

## πŸ“‹ Table of Contents

- [About The Project](#-about-the-project)
- [Architecture](#-architecture)
- [Features](#-features)
- [Tech Stack](#-tech-stack)
- [Dataset](#-dataset)
- [Project Structure](#-project-structure)
- [Pipeline Workflow](#-pipeline-workflow)
- [Screenshots](#-screenshots)
- [Installation](#-installation)
- [Usage](#-usage)
- [Model Performance](#-model-performance)
- [Experiment Tracking](#-experiment-tracking)
- [Future Enhancements](#-future-enhancements)
- [Contributing](#-contributing)
- [License](#-license)
- [Contact](#-contact)

## πŸš€ Live Demo

- **Live Application**: [inderjeet-networksecurity.hf.space](https://inderjeet-networksecurity.hf.space/)
- **Experiment Tracking**: [DagsHub Experiments](https://dagshub.com/Inder-26/NetworkSecurity/experiments#/)

## 🎯 About The Project

In the digital age, cybersecurity threats such as phishing attacks are becoming increasingly sophisticated. This project implements a robust **Network Security Machine Learning Pipeline** designed to detect phishing URLs with high accuracy.

It leverages a modular MLOps architecture, ensuring scalability, maintainability, and reproducibility. The system automates the entire flow from data ingestion to model deployment, utilizing advanced techniques like drift detection and automated model evaluation.

## πŸ—οΈ Architecture

The system follows a strict modular pipeline architecture, orchestrated by a central training pipeline.

![Architecture Diagram](images/architecture_diagram.png)

## ✨ Features

- **πŸš€ End-to-End Pipeline**: Fully automated workflow from data ingestion to model deployment.
- **πŸ›‘οΈ Data Validation**: Comprehensive schema checks and data drift detection using KS tests.
- **πŸ”„ Robust Preprocessing**: Automated handling of missing values (KNN Imputer) and feature scaling (Robust Scaler).
- **πŸ€– Multi-Model Training**: Experiments with RandomForest, DecisionTree, GradientBoosting, and AdaBoost using GridSearchCV.
- **πŸ“Š Experiment Tracking**: Integrated with **MLflow** and **DagsHub** for tracking parameters, metrics, and models.
- **⚑ Fast API**: High-performance REST API built with **FastAPI** for real-time predictions.
- **🐳 Containerized**: Docker support for consistent deployment across environments.
- **☁️ Cloud Ready**: Designed to be deployed on platforms like AWS or Hugging Face Spaces.

## πŸ› οΈ Tech Stack

- **Languages**: Python 3.8+
- **Frameworks**: FastAPI, Uvicorn
- **ML Libraries**: Scikit-learn, Pandas, NumPy
- **MLOps**: MLflow, DagsHub
- **Database**: MongoDB
- **Containerization**: Docker
- **Frontend**: HTML, CSS (Custom Design System), JavaScript

## πŸ“Š Dataset

The project utilizes a dataset containing various URL features to distinguish between legitimate and phishing URLs.

- **Source**: [Phishing Dataset for Machine Learning](https://archive.ics.uci.edu/ml/datasets/Phishing+Websites) (or similar Phishing URL dataset)
- **Features**: IP Address, URL Length, TinyURL, forwarding, etc.
- **Target**: `Result` (LEGITIMATE / PHISHING)

## πŸ“ Project Structure

```
NetworkSecurity/
β”œβ”€β”€ images/                  # Project diagrams and screenshots
β”œβ”€β”€ networksecurity/         # Main package
β”‚   β”œβ”€β”€ components/          # Pipeline components (Ingestion, Validation, Transformation, Training)
β”‚   β”œβ”€β”€ pipeline/            # Training and Prediction pipelines
β”‚   β”œβ”€β”€ entity/              # Artifact and Config entities
β”‚   β”œβ”€β”€ constants/           # Project constants
β”‚   β”œβ”€β”€ utils/               # Utility functions
β”‚   └── exception/           # Custom exception handling
β”œβ”€β”€ data_schema/             # Schema definitions
β”œβ”€β”€ Dockerfile               # Docker configuration
β”œβ”€β”€ app.py                   # FastAPI application entry point
β”œβ”€β”€ requirements.txt         # Project dependencies
└── README.md                # Project documentation
```

## βš™οΈ Pipeline Workflow

### 1. Data Ingestion πŸ“₯

Fetches data from MongoDB, handles fallback to local CSV, and performs train-test split.
![Data Ingestion](images/data_ingestion_diagram.png)

### 2. Data Validation βœ…

Validates data against schema and checks for data drift.
![Data Validation](images/data_validation_diagram.png)

### 3. Data Transformation πŸ”„

Imputes missing values and scales features for optimal model performance.
![Data Transformation](images/data_transformation_diagram.png)

### 4. Model Training πŸ€–

Trains and tunes multiple models, selecting the best one based on F1-score/Accuracy.
![Model Training](images/model_training_diagram.png)

## πŸ“Έ Screenshots

### Prediction Results & Threat Assessment

![Prediction Results](images/prediction_results.png)

### Experiment Tracking (DagsHub/MLflow)

![Experiment Tracking](images/dagshub_experiments.png)

## πŸ’» Installation

### Prerequisites

- Python 3.8+
- MongoDB Account
- DagsHub Account (for experiment tracking)

### Step-by-Step

1. **Clone the Repository**

   ```bash
   git clone https://github.com/Inder-26/NetworkSecurity.git
   cd NetworkSecurity
   ```

2. **Create Virtual Environment**

   ```bash
   python -m venv .venv
   source .venv/bin/activate  # On Windows: .venv\Scripts\activate
   ```

3. **Install Dependencies**

   ```bash
   pip install -r requirements.txt
   ```

4. **Set Environment Variables**
   Create a `.env` file with your credentials:
   ```env
   MONGO_DB_URL=your_mongodb_url_here
   MLFLOW_TRACKING_URI=https://dagshub.com/your_username/project.mlflow
   MLFLOW_TRACKING_USERNAME=your_username
   MLFLOW_TRACKING_PASSWORD=your_password
   ```

## πŸš€ Usage

### Run the Web Application

```bash
python app.py
```

Visit `http://localhost:8000` to access the UI.

### Train a New Model

To trigger the training pipeline:

```bash
http://localhost:8000/train
```

Or use the "Train New Model" button in the UI.

## πŸ“ˆ Model Performance

The system evaluates models using accuracy and F1 score.

- **Best Model**: [Automatically selected, typically RandomForest or GradientBoosting]
- **Recall**: Optimized to minimize false negatives (missing a phishing URL is dangerous).

### Model Evaluation Metrics

Below are the performance visualizations for the best trained model:

#### Confusion Matrix

![Confusion Matrix](images/confusion_matrix.png)

#### ROC Curve

![ROC Curve](images/roc_curve.png)

#### Precision-Recall Curve

![Precision-Recall Curve](images/precision_recall_curve.png)

## πŸ§ͺ Experiment Tracking

All runs are logged to DagsHub. You can view parameters, metrics, and models in the MLflow UI.

## πŸš€ Future Enhancements

- [ ] Implement Deep Learning models (LSTM/CNN) for URL text analysis.
- [ ] Add real-time browser extension.
- [ ] Deploy serverless architecture.
- [ ] Add more comprehensive unit and integration tests.

## 🀝 Contributing

Contributions are welcome! Please fork the repository and create a pull request.

1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

## πŸ“„ License

Distributed under the MIT License. See `LICENSE` for more information.

## πŸ“ž Contact

Inder - [GitHub Profile](https://github.com/Inder-26)