Log_Classifier / README.md
yomnafarag95's picture
Upload README.md
7877434 verified
# Firewall Log Classifier
A machine learning system for automated classification of firewall log entries into four action categories: allow, deny, drop, and reset-both. Built as part of CSAI 801 β€” Artificial Intelligence and Machine Learning.
**Live Application:** https://huggingface.co/spaces/yomnafarag95/Log_Classifier
---
## Overview
Enterprise firewalls generate thousands of log entries per hour, making manual review impractical. This project trains a tuned Random Forest classifier on real network traffic data to automate that review process, achieving 99.56% test accuracy across four action classes.
---
## Model Performance
| Model | Test Accuracy | Macro F1 |
|-------------------------|--------------|----------|
| Random Forest (baseline)| 98.32% | 0.981 |
| Logistic Regression | 99.75% | 0.997 |
| KNN | 99.23% | 0.990 |
| Random Forest (tuned) | **99.56%** | **0.803**|
Tuned hyperparameters: `n_estimators=200`, `max_depth=20`, `min_samples_split=2`
---
## Dataset
- **Source:** UCI Machine Learning Repository β€” Internet Firewall Data
- **URL:** https://archive.ics.uci.edu/dataset/542/internet+firewall+data
- **Raw records:** 65,532
- **After deduplication:** 57,170
- **Class distribution:** allow (37,439) Β· drop (11,635) Β· deny (8,042) Β· reset-both (54)
**Input features (11):**
| # | Feature |
|---|----------------------|
| 1 | Source Port |
| 2 | Destination Port |
| 3 | NAT Source Port |
| 4 | NAT Destination Port |
| 5 | Bytes |
| 6 | Bytes Sent |
| 7 | Bytes Received |
| 8 | Packets |
| 9 | Elapsed Time (sec) |
|10 | pkts_sent |
|11 | pkts_received |
---
## Preprocessing Pipeline
1. Duplicate removal (65,532 β†’ 57,170 records)
2. Stratified 70/30 train/test split
3. SMOTE oversampling on training set to balance minority classes
4. StandardScaler normalization
---
## Try the Application
Paste any of the following lines into the application input and click Classify.
Each line contains 11 comma-separated values matching the feature order above.
**Allow**
```
51465,443,39975,443,3961,1595,2366,21,16,12,9
```
**Deny**
```
34086,25174,0,0,62,62,0,1,0,1,0
```
**Drop**
```
51125,445,0,0,66,66,0,1,0,1,0
```
**Reset-Both**
```
64461,31652,0,0,62,62,0,1,0,1,0
```
---
## Run Locally
```bash
git clone https://github.com/yomnafarag95/Log_Classifier.git
cd Log_Classifier
pip install -r requirements.txt
streamlit run app.py
```
---
## Retrain the Model
```bash
pip install scikit-learn imbalanced-learn pandas joblib
python retrain.py
```
Outputs: `model.joblib`, `scaler.joblib`, `label_encoder.joblib`
---
## Repository Structure
```
Log_Classifier/
β”œβ”€β”€ app.py Streamlit web application
β”œβ”€β”€ retrain.py Model retraining script
β”œβ”€β”€ model.joblib Trained Random Forest model
β”œβ”€β”€ scaler.joblib Fitted StandardScaler
β”œβ”€β”€ label_encoder.joblib Label encoder for action classes
β”œβ”€β”€ requirements.txt Python dependencies
└── The_Report.pdf Full project report
```
---
## Authors
Yasmeen Algendy, Yomna Algendy, Zahraa Mohamed
Supervisor: Dr. Marwa Elsayed
CSAI 801 β€” Artificial Intelligence and Machine Learning