Arjuna12 commited on
Commit
1b7db22
·
0 Parent(s):

Clean repo without binary files

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +132 -0
  3. requirements.txt +12 -0
.gitattributes ADDED
@@ -0,0 +1 @@
 
 
1
+ *.joblib filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ITD_Model — Insider Threat Detection
2
+
3
+ Short: supervised ensemble for insider-threat detection on Active Directory logs.
4
+
5
+ Core components
6
+ - `train.py`: trains the ensemble and saves `Models/improved_threat_detector.joblib`
7
+ - `predict.py`: runs inference on `Data/test_synthetic.csv` and writes outputs to `Outputs/`
8
+ - `Src/train_model.py`: core training logic (SMOTE, training, threshold search)
9
+
10
+ Quick start
11
+ ```powershell
12
+ python -m venv .venv
13
+ .venv\Scripts\activate
14
+ pip install -r requirements.txt
15
+ python train.py # trains and saves model
16
+ python predict.py # runs inference on Data/test_synthetic.csv
17
+ ```
18
+ ## Complete pipeline & model details
19
+
20
+ ### Goal
21
+ Detect insider threats by classifying users as normal or anomalous using historical Active Directory activity features.
22
+
23
+ ### Data
24
+ - Raw input (in `Data/Raw/`): CSV logs for users (file, email, device, logon, decoy access, user metadata).
25
+ - Features (in `Data/Features/all_features.csv`): pre-engineered numeric features per user (20 features). The label is derived from decoy access (`accessed_decoy` column) and encoded as 1 = anomaly (threat), 0 = normal.
26
+ - Synthetic test data: `Data/test_synthetic.csv` (500 users, used for generalization testing).
27
+
28
+ ### Preprocessing
29
+ - Missing values: numeric features filled with median or domain-appropriate value.
30
+ - Scaling: `StandardScaler` fit on training features; scaler saved with model artifact.
31
+ - Feature selection: the pipeline uses a fixed set of 20 numeric features saved alongside the model.
32
+
33
+ ### Class balancing
34
+ - Method: SMOTE (from `imbalanced-learn`) applied only to the training split.
35
+ - Sampling strategy: 0.67 (results in roughly 60% normal / 40% anomaly in training set).
36
+ - Reason: original dataset had heavy class imbalance; SMOTE reduces bias toward the majority while avoiding extreme oversampling.
37
+
38
+ ### Models (ensemble)
39
+ - Base learners:
40
+ - `RandomForestClassifier` (scikit-learn)
41
+ - `n_estimators=300`, `max_depth=12`, `min_samples_leaf=4`, `class_weight='balanced_subsample'`, `n_jobs=-1`
42
+ - `XGBClassifier` (xgboost)
43
+ - `n_estimators=250`, `max_depth=6`, `learning_rate=0.03`, `gamma=1.0`, `colsample_bytree=0.7`
44
+ - `LGBMClassifier` (lightgbm)
45
+ - `n_estimators=250`, `max_depth=6`, `learning_rate=0.03`, `reg_alpha=1.0`, `reg_lambda=1.0`
46
+ - Ensemble: `VotingClassifier` (soft voting) with weights RF=1.0, XGB=1.3, LGB=1.2.
47
+
48
+ ### Threshold optimization
49
+ - Default binary threshold (0.5) is replaced by an optimal threshold discovered on validation data.
50
+ - Method: compute prediction probabilities on validation set, sweep thresholds, and choose the threshold maximizing F1 (ROC/PR analysis used to verify). The chosen threshold is saved in the model artifact (≈ 0.9909).
51
+ - Effect: conservative decision boundary to reduce false positives in deployment (trade-off: lower recall).
52
+
53
+ ### Training workflow (`train.py`)
54
+ 1. Load `Data/Features/all_features.csv` and the label column.
55
+ 2. Split into train/validation (75/25 stratified split by label).
56
+ 3. Fit `StandardScaler` on training features.
57
+ 4. Apply SMOTE to training set only.
58
+ 5. Train RF, XGB, LGB models with the hyperparameters above.
59
+ 6. Build weighted `VotingClassifier` using soft probabilities.
60
+ 7. Find optimal decision threshold on validation set (maximize F1).
61
+ 8. Evaluate and save training metrics to `Outputs/improved_summary.txt` and predictions to `Outputs/improved_results.csv`.
62
+ 9. Save artifact `Models/improved_threat_detector.joblib` containing:
63
+ - ensemble model object
64
+ - individual model objects
65
+ - fitted `StandardScaler`
66
+ - `feature_columns` list
67
+ - `optimal_threshold`
68
+
69
+ ### Inference workflow (`predict.py`)
70
+ 1. Load artifact `Models/improved_threat_detector.joblib`.
71
+ 2. Load features from `Data/test_synthetic.csv` (or other target file) and apply scaler using saved `StandardScaler`.
72
+ 3. Produce probability scores via the ensemble (`predict_proba`).
73
+ 4. Apply saved `optimal_threshold` to convert probabilities to binary predictions.
74
+ 5. Compute evaluation metrics (accuracy, precision, recall, F1, ROC AUC) if ground-truth is available.
75
+ 6. Write detailed predictions to `Outputs/predictions_new_data.csv` and a human-readable summary to `Outputs/summary_new_data.txt`.
76
+
77
+ ### Evaluation (reported)
78
+ - Training-set (with optimized threshold evaluated on training distribution): Accuracy=100%, Precision=100%, Recall=100%, F1=1.0 (note: reflects training distribution and SMOTE adjustments).
79
+ - Generalization test (synthetic new set, 500 users): Accuracy=62.20%, Precision=31.20%, Recall=27.46%, F1=0.2921, ROC AUC=0.5063. Confusion: TN=272, TP=39, FP=86, FN=103.
80
+
81
+ ### Outputs (what to expect)
82
+ - `Models/improved_threat_detector.joblib` — model artifact with components listed above.
83
+ - `Outputs/improved_results.csv` — per-row predictions and probabilities for training/test split.
84
+ - `Outputs/improved_summary.txt` — training evaluation metrics and top-risk users.
85
+ - `Outputs/predictions_new_data.csv` — predictions for synthetic/new data (columns: `user_id`, `anomaly_score`, `predicted_anomaly`, `actual_anomaly`).
86
+ - `Outputs/summary_new_data.txt` — evaluation summary for new data.
87
+
88
+ ### Configuration
89
+ - `Src/config.py` contains path variables used by scripts. Example:
90
+
91
+ ```python
92
+ from pathlib import Path
93
+ PROJECT_DIR = Path(__file__).parent.parent
94
+ DATA_DIR = PROJECT_DIR / 'Data'
95
+ FEATURES_FILE = DATA_DIR / 'Features' / 'all_features.csv'
96
+ MODEL_FILE = PROJECT_DIR / 'Models' / 'improved_threat_detector.joblib'
97
+ TEST_DATA_FILE = DATA_DIR / 'test_synthetic.csv'
98
+ ```
99
+
100
+ ### Dependencies
101
+ - Python 3.8+
102
+ - Primary packages:
103
+ - scikit-learn
104
+ - xgboost
105
+ - lightgbm
106
+ - imbalanced-learn
107
+ - pandas, numpy
108
+ - joblib
109
+
110
+ Create a virtual environment and install dependencies:
111
+
112
+ ```powershell
113
+ python -m venv .venv
114
+ .venv\Scripts\activate
115
+ pip install scikit-learn xgboost lightgbm imbalanced-learn pandas numpy joblib
116
+ ```
117
+
118
+ Optionally generate `requirements.txt` from your environment:
119
+
120
+ ```powershell
121
+ pip freeze > requirements.txt
122
+ ```
123
+
124
+ ### Reproducibility & notes
125
+ - Random seeds: training sets a fixed `random_state` for reproducible splits and model training where applicable.
126
+ - Do not apply SMOTE to validation/test sets — only to training fold.
127
+ - If deploying to production, consider periodic retraining and threshold recalibration to accommodate distribution drift.
128
+
129
+ ### Suggested next steps
130
+ - Add `requirements.txt` (I can generate it from the current environment).
131
+ - Add `CONTRIBUTING.md` and a short checklist for retraining in production.
132
+ - Add optional explainability (SHAP) to `predict.py` to produce per-user feature attributions.
requirements.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ scikit-learn>=1.1
2
+ xgboost>=1.6
3
+ lightgbm>=3.3
4
+ imbalanced-learn>=0.10
5
+ pandas>=1.3
6
+ numpy>=1.21
7
+ joblib>=1.1
8
+ protobuf>=3.20.0
9
+
10
+ # Optional (for explainability)
11
+ shap>=0.42
12
+