File size: 1,653 Bytes
4c91838
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Anomaly (Outlier) Detection Models

This directory hosts scripts defining **anomaly detection** estimators (e.g., Isolation Forest, One-Class SVM, etc.) for use with `train_anomaly_detection.py`. Each file specifies a scikit-learn–compatible outlier detector and, if applicable, a parameter grid.

**Key Points**:
- **Estimator**: Must allow `.fit(X)` and `.predict(X)` or similar. Typically returns +1 / −1 for inliers / outliers (we unify to 0 / 1).
- **Parameter Grid**: You can define hyperparameters (like `n_estimators`, `contamination`) for potential searching. 
- **Default Approach**: We do not rely on labeled anomalies (unsupervised). The script will produce a predictions CSV with 0 = normal, 1 = outlier.

**Note**: The main script `train_anomaly_detection.py` handles data loading, label encoding, dropping/selecting columns, the `.fit(X)`, `.predict(X)` steps, saving the outlier predictions, and (optionally) a 2D plot with outliers in red.

## Available Anomaly Detection Models

- [Isolation Forest](isolation_forest.py)  
- [One-Class SVM](one_class_svm.py)  
- [Local Outlier Factor (LOF)](local_outlier_factor.py)  

### Usage

For example, to detect outliers with an Isolation Forest:

```bash
python scripts/train_anomaly_detection.py \
  --model_module isolation_forest \
  --data_path data/breast_cancer/data.csv \
  --drop_columns "id,diagnosis" \
  --visualize
```

This:
1. Loads `isolation_forest.py`, sets up `IsolationForest(...)`.
2. Fits the model to the data, saves it, then `predict(...)`.
3. Saves a `predictions.csv` with `OutlierPrediction`.
4. If `--visualize`, does a 2D PCA scatter, coloring outliers red.