mboukabous's picture
first commit
4c91838
# Anomaly (Outlier) Detection Models
This directory hosts scripts defining **anomaly detection** estimators (e.g., Isolation Forest, One-Class SVM, etc.) for use with `train_anomaly_detection.py`. Each file specifies a scikit-learn–compatible outlier detector and, if applicable, a parameter grid.
**Key Points**:
- **Estimator**: Must allow `.fit(X)` and `.predict(X)` or similar. Typically returns +1 / −1 for inliers / outliers (we unify to 0 / 1).
- **Parameter Grid**: You can define hyperparameters (like `n_estimators`, `contamination`) for potential searching.
- **Default Approach**: We do not rely on labeled anomalies (unsupervised). The script will produce a predictions CSV with 0 = normal, 1 = outlier.
**Note**: The main script `train_anomaly_detection.py` handles data loading, label encoding, dropping/selecting columns, the `.fit(X)`, `.predict(X)` steps, saving the outlier predictions, and (optionally) a 2D plot with outliers in red.
## Available Anomaly Detection Models
- [Isolation Forest](isolation_forest.py)
- [One-Class SVM](one_class_svm.py)
- [Local Outlier Factor (LOF)](local_outlier_factor.py)
### Usage
For example, to detect outliers with an Isolation Forest:
```bash
python scripts/train_anomaly_detection.py \
--model_module isolation_forest \
--data_path data/breast_cancer/data.csv \
--drop_columns "id,diagnosis" \
--visualize
```
This:
1. Loads `isolation_forest.py`, sets up `IsolationForest(...)`.
2. Fits the model to the data, saves it, then `predict(...)`.
3. Saves a `predictions.csv` with `OutlierPrediction`.
4. If `--visualize`, does a 2D PCA scatter, coloring outliers red.