Taranpreet Singh
Phase 1: Offline NIDS prototype with CV, threshold tuning
98d799c
---
title: AI NIDS Student Project
emoji: πŸ›‘οΈ
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.39.0
app_file: app.py
pinned: false
---
# πŸ›‘οΈ AI-Based Network Intrusion Detection System (Student Project)
Project Status: Phase 1 – Pre-Production / Offline Prototype
This project demonstrates how to use **Machine Learning (Random Forest)** and **Generative AI (Groq)** to detect and explain network attacks (specifically DDoS).
## πŸš€ How to Use
1. **Enter API Key:** Paste your Groq API key in the sidebar (optional, for AI explanations).
2. **Train Model:** Click the "Train Model Now" button. The system loads the `Friday-WorkingHours...` dataset automatically.
3. **Simulate:** Click "🎲 Capture Random Packet" to pick a real network packet from the test set.
4. **Analyze:** See if the model flags it as **BENIGN** or **DDoS**, and ask Groq to explain why.
## πŸ“‚ Files
- `app.py`: The main Python application code.
- `requirements.txt`: List of libraries used.
- `Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv`: The dataset (CIC-IDS2017 subset).
## πŸ”§ PHASE 0 β€” Foundation Hardening (completed)
This repository includes an incremental, production-aligned hardening of the original student project.
- Deterministic reproducibility (global seed, logging).
- Explicit data validation and feature checks.
- Class-imbalance handling via `class_weight='balanced'`.
- Stratified 5-fold cross-validation with per-fold metrics.
- Evaluation metrics replaced accuracy with: precision, recall, F1, PR-AUC, ROC-AUC, and confusion matrices.
- Artifacts saved to `models/` and `metrics/` (see below).
These changes are intentionally small and reversible β€” see `training_utils.py` for the training implementation.
## πŸ“¦ Artifacts (generated after training)
- `models/rf_model.joblib` β€” serialized RandomForest model (best fold).
- `metrics/training_metrics.json` β€” timestamped CV metrics including PR-curve, seed, feature list.
## ⚠️ Dataset & Publishing
- ⚠️ Dataset Note: The full CIC-IDS2017 CSV (~96 MB) is intentionally excluded from GitHub.
This repository focuses on model architecture and training logic. A small sample or synthetic dataset (`sample_data/sample_small.csv`) is included for demos; the full dataset is not committed.
## ▢️ Run locally
1. Create a virtual environment and install dependencies:
```powershell
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
```
2. Run the Streamlit app:
```powershell
streamlit run app.py
```
## Contact / Next steps
If you want, I can generate a small sample CSV (e.g., 1k rows) that allows publishing the repo to GitHub safely.
## πŸŽ“ About
Created for a university cybersecurity project to demonstrate the integration of traditional ML and LLMs in security operations.