File size: 1,587 Bytes
4c91838
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Dimensionality Reduction Models

This directory contains Python scripts defining **dimensionality reduction** techniques (e.g., PCA, t-SNE, UMAP). Each model file sets up a scikit-learn–compatible estimator or follows a similar interface, making it easy to swap in `train_dimred_model.py`.

**Key Points**:
- **Estimator**: Typically supports `.fit_transform(X)` for dimension reduction.
- **Default Settings**: e.g., PCA might default to `n_components=2`; t-SNE might set `n_components=2` and `perplexity=30`; UMAP might define `n_neighbors=15` or `n_components=2`.
- **No Supervised Tuning**: Usually we pick hyperparameters based on interpretability or domain. A manual approach or specialized metric can be used if needed.

**Note**: The `train_dimred_model.py` script handles dropping columns, label encoding, performing `.fit_transform(X)`, and optionally saving a 2D/3D scatter plot if `--visualize` is used.

## Available Dimensionality Reduction Models

- [PCA](pca.py)  
- [t-SNE](tsne.py)  
- [UMAP](umap.py)  

### Usage

To reduce data dimensions:

```bash
python scripts/train_dimred_model.py \
  --model_module pca \
  --data_path data/breast_cancer/data.csv \
  --select_columns "radius_mean, texture_mean, area_mean, smoothness_mean" \
  --visualize
```

This:
1. Loads `pca.py`, which defines a `PCA(n_components=2)` estimator by default.
2. Applies `.fit_transform(...)` to produce a 2D embedding.
3. Saves the model (`dimred_model.pkl`) and the transformed data (`X_transformed.csv`).
4. If `--visualize` is set and `n_components=2`, it scatter-plots the result.