Clean repo without binary files

Files changed (3) hide show

.gitattributes +1 -0
README.md +132 -0
requirements.txt +12 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1 @@


1	+ *.joblib filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,132 @@

+# ITD_Model — Insider Threat Detection
+Short: supervised ensemble for insider-threat detection on Active Directory logs.
+Core components
+- `train.py`: trains the ensemble and saves `Models/improved_threat_detector.joblib`
+- `predict.py`: runs inference on `Data/test_synthetic.csv` and writes outputs to `Outputs/`
+- `Src/train_model.py`: core training logic (SMOTE, training, threshold search)
+Quick start
+```powershell
+python -m venv .venv
+.venv\Scripts\activate
+pip install -r requirements.txt
+python train.py      # trains and saves model
+python predict.py    # runs inference on Data/test_synthetic.csv
+```
+## Complete pipeline & model details
+### Goal
+Detect insider threats by classifying users as normal or anomalous using historical Active Directory activity features.
+### Data
+- Raw input (in `Data/Raw/`): CSV logs for users (file, email, device, logon, decoy access, user metadata).
+- Features (in `Data/Features/all_features.csv`): pre-engineered numeric features per user (20 features). The label is derived from decoy access (`accessed_decoy` column) and encoded as 1 = anomaly (threat), 0 = normal.
+- Synthetic test data: `Data/test_synthetic.csv` (500 users, used for generalization testing).
+### Preprocessing
+- Missing values: numeric features filled with median or domain-appropriate value.
+- Scaling: `StandardScaler` fit on training features; scaler saved with model artifact.
+- Feature selection: the pipeline uses a fixed set of 20 numeric features saved alongside the model.
+### Class balancing
+- Method: SMOTE (from `imbalanced-learn`) applied only to the training split.
+- Sampling strategy: 0.67 (results in roughly 60% normal / 40% anomaly in training set).
+- Reason: original dataset had heavy class imbalance; SMOTE reduces bias toward the majority while avoiding extreme oversampling.
+### Models (ensemble)
+- Base learners:
+	- `RandomForestClassifier` (scikit-learn)
+		- `n_estimators=300`, `max_depth=12`, `min_samples_leaf=4`, `class_weight='balanced_subsample'`, `n_jobs=-1`
+	- `XGBClassifier` (xgboost)
+		- `n_estimators=250`, `max_depth=6`, `learning_rate=0.03`, `gamma=1.0`, `colsample_bytree=0.7`
+	- `LGBMClassifier` (lightgbm)
+		- `n_estimators=250`, `max_depth=6`, `learning_rate=0.03`, `reg_alpha=1.0`, `reg_lambda=1.0`
+- Ensemble: `VotingClassifier` (soft voting) with weights RF=1.0, XGB=1.3, LGB=1.2.
+### Threshold optimization
+- Default binary threshold (0.5) is replaced by an optimal threshold discovered on validation data.
+- Method: compute prediction probabilities on validation set, sweep thresholds, and choose the threshold maximizing F1 (ROC/PR analysis used to verify). The chosen threshold is saved in the model artifact (≈ 0.9909).
+- Effect: conservative decision boundary to reduce false positives in deployment (trade-off: lower recall).
+### Training workflow (`train.py`)
+1. Load `Data/Features/all_features.csv` and the label column.
+2. Split into train/validation (75/25 stratified split by label).
+3. Fit `StandardScaler` on training features.
+4. Apply SMOTE to training set only.
+5. Train RF, XGB, LGB models with the hyperparameters above.
+6. Build weighted `VotingClassifier` using soft probabilities.
+7. Find optimal decision threshold on validation set (maximize F1).
+8. Evaluate and save training metrics to `Outputs/improved_summary.txt` and predictions to `Outputs/improved_results.csv`.
+9. Save artifact `Models/improved_threat_detector.joblib` containing:
+	 - ensemble model object
+	 - individual model objects
+	 - fitted `StandardScaler`
+	 - `feature_columns` list
+	 - `optimal_threshold`
+### Inference workflow (`predict.py`)
+1. Load artifact `Models/improved_threat_detector.joblib`.
+2. Load features from `Data/test_synthetic.csv` (or other target file) and apply scaler using saved `StandardScaler`.
+3. Produce probability scores via the ensemble (`predict_proba`).
+4. Apply saved `optimal_threshold` to convert probabilities to binary predictions.
+5. Compute evaluation metrics (accuracy, precision, recall, F1, ROC AUC) if ground-truth is available.
+6. Write detailed predictions to `Outputs/predictions_new_data.csv` and a human-readable summary to `Outputs/summary_new_data.txt`.
+### Evaluation (reported)
+- Training-set (with optimized threshold evaluated on training distribution): Accuracy=100%, Precision=100%, Recall=100%, F1=1.0 (note: reflects training distribution and SMOTE adjustments).
+- Generalization test (synthetic new set, 500 users): Accuracy=62.20%, Precision=31.20%, Recall=27.46%, F1=0.2921, ROC AUC=0.5063. Confusion: TN=272, TP=39, FP=86, FN=103.
+### Outputs (what to expect)
+- `Models/improved_threat_detector.joblib` — model artifact with components listed above.
+- `Outputs/improved_results.csv` — per-row predictions and probabilities for training/test split.
+- `Outputs/improved_summary.txt` — training evaluation metrics and top-risk users.
+- `Outputs/predictions_new_data.csv` — predictions for synthetic/new data (columns: `user_id`, `anomaly_score`, `predicted_anomaly`, `actual_anomaly`).
+- `Outputs/summary_new_data.txt` — evaluation summary for new data.
+### Configuration
+- `Src/config.py` contains path variables used by scripts. Example:
+```python
+from pathlib import Path
+PROJECT_DIR = Path(__file__).parent.parent
+DATA_DIR = PROJECT_DIR / 'Data'
+FEATURES_FILE = DATA_DIR / 'Features' / 'all_features.csv'
+MODEL_FILE = PROJECT_DIR / 'Models' / 'improved_threat_detector.joblib'
+TEST_DATA_FILE = DATA_DIR / 'test_synthetic.csv'
+```
+### Dependencies
+- Python 3.8+
+- Primary packages:
+	- scikit-learn
+	- xgboost
+	- lightgbm
+	- imbalanced-learn
+	- pandas, numpy
+	- joblib
+Create a virtual environment and install dependencies:
+```powershell
+python -m venv .venv
+.venv\Scripts\activate
+pip install scikit-learn xgboost lightgbm imbalanced-learn pandas numpy joblib
+```
+Optionally generate `requirements.txt` from your environment:
+```powershell
+pip freeze > requirements.txt
+```
+### Reproducibility & notes
+- Random seeds: training sets a fixed `random_state` for reproducible splits and model training where applicable.
+- Do not apply SMOTE to validation/test sets — only to training fold.
+- If deploying to production, consider periodic retraining and threshold recalibration to accommodate distribution drift.
+### Suggested next steps
+- Add `requirements.txt` (I can generate it from the current environment).
+- Add `CONTRIBUTING.md` and a short checklist for retraining in production.
+- Add optional explainability (SHAP) to `predict.py` to produce per-user feature attributions.

requirements.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+scikit-learn>=1.1
+xgboost>=1.6
+lightgbm>=3.3
+imbalanced-learn>=0.10
+pandas>=1.3
+numpy>=1.21
+joblib>=1.1
+protobuf>=3.20.0
+# Optional (for explainability)
+shap>=0.42