Update A Short Guide for Feature Engineering and Feature Selection.md
Browse files
A Short Guide for Feature Engineering and Feature Selection.md
CHANGED
|
@@ -25,7 +25,7 @@ Narrowly speaking, in data mining context, machine learning (ML) is the process
|
|
| 25 |
A typical ML workflow/pipeline looks like this:
|
| 26 |
|
| 27 |
|
| 28 |
-

|
| 29 |
|
| 30 |
[img source](https://www.springer.com/us/book/9781484232064)
|
| 31 |
|
|
@@ -330,7 +330,7 @@ All these methods attempt to group some of the labels and reduce cardinality. Gr
|
|
| 330 |
A comparison of three methods when facing outliers:
|
| 331 |
|
| 332 |
|
| 333 |
-

|
| 334 |
|
| 335 |
[img source](https://stackoverflow.com/questions/51841506/data-standardization-vs-normalization-vs-robust-scaler)
|
| 336 |
|
|
@@ -471,7 +471,7 @@ In the situations above, transformation of the original variable can help give t
|
|
| 471 |
|
| 472 |
**Box-Cox transformation** in sklearn [13] is another popular function belonging to the power transform family of functions. This function has a pre-requisite that the numeric values to be transformed must be positive (similar to what log transform expects). In case they are negative, shifting using a constant value helps. Mathematically, the Box-Cox transform function can be denoted as follows.
|
| 473 |
|
| 474 |
-

|
| 475 |
|
| 476 |
**Quantile transformation** in sklearn [14] transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme. However, this transform is non-linear. It may distort linear correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.
|
| 477 |
|
|
@@ -481,7 +481,7 @@ We can use **Q-Q plot** to check if the variable is normally distributed (a 45 d
|
|
| 481 |
|
| 482 |
Below is an example showing the effect of sklearn's Box-plot/Yeo-johnson/Quantile transform to map data from various distributions to a normal distribution.
|
| 483 |
|
| 484 |
-

|
| 485 |
|
| 486 |
[img source](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_map_data_to_normal.html#sphx-glr-auto-examples-preprocessing-plot-map-data-to-normal-py)
|
| 487 |
|
|
@@ -521,7 +521,7 @@ Still take call log for example, we can have crossed features like: number of ca
|
|
| 521 |
|
| 522 |
**Note**: An open-source python framework named **Featuretools** that helps automatically generate such features can be found [here](https://github.com/Featuretools/featuretools).
|
| 523 |
|
| 524 |
-

|
| 525 |
|
| 526 |
Personally I haven't used it in practice. You may try and discover if it can be of industry usage.
|
| 527 |
|
|
@@ -596,7 +596,7 @@ As a result, filter methods are suited for a first step quick screen and removal
|
|
| 596 |
|
| 597 |
WOE encoding (see section 3.3.2) and IV often go hand in hand in scorecard development. The two concepts both derived from logistic regression and is kind of standard practice in credit card industry. IV is a popular and widely used measure as there are very convenient rules of thumb for variables selection associated with IV as below:
|
| 598 |
|
| 599 |
-

|
| 600 |
|
| 601 |
However, all these filtering methods fail to consider the interaction between features and may reduce our predict power. Personally I only use variance and correlation to filter some absolutely unnecessary features.
|
| 602 |
|
|
|
|
| 25 |
A typical ML workflow/pipeline looks like this:
|
| 26 |
|
| 27 |
|
| 28 |
+

|
| 29 |
|
| 30 |
[img source](https://www.springer.com/us/book/9781484232064)
|
| 31 |
|
|
|
|
| 330 |
A comparison of three methods when facing outliers:
|
| 331 |
|
| 332 |
|
| 333 |
+

|
| 334 |
|
| 335 |
[img source](https://stackoverflow.com/questions/51841506/data-standardization-vs-normalization-vs-robust-scaler)
|
| 336 |
|
|
|
|
| 471 |
|
| 472 |
**Box-Cox transformation** in sklearn [13] is another popular function belonging to the power transform family of functions. This function has a pre-requisite that the numeric values to be transformed must be positive (similar to what log transform expects). In case they are negative, shifting using a constant value helps. Mathematically, the Box-Cox transform function can be denoted as follows.
|
| 473 |
|
| 474 |
+

|
| 475 |
|
| 476 |
**Quantile transformation** in sklearn [14] transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme. However, this transform is non-linear. It may distort linear correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.
|
| 477 |
|
|
|
|
| 481 |
|
| 482 |
Below is an example showing the effect of sklearn's Box-plot/Yeo-johnson/Quantile transform to map data from various distributions to a normal distribution.
|
| 483 |
|
| 484 |
+

|
| 485 |
|
| 486 |
[img source](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_map_data_to_normal.html#sphx-glr-auto-examples-preprocessing-plot-map-data-to-normal-py)
|
| 487 |
|
|
|
|
| 521 |
|
| 522 |
**Note**: An open-source python framework named **Featuretools** that helps automatically generate such features can be found [here](https://github.com/Featuretools/featuretools).
|
| 523 |
|
| 524 |
+

|
| 525 |
|
| 526 |
Personally I haven't used it in practice. You may try and discover if it can be of industry usage.
|
| 527 |
|
|
|
|
| 596 |
|
| 597 |
WOE encoding (see section 3.3.2) and IV often go hand in hand in scorecard development. The two concepts both derived from logistic regression and is kind of standard practice in credit card industry. IV is a popular and widely used measure as there are very convenient rules of thumb for variables selection associated with IV as below:
|
| 598 |
|
| 599 |
+

|
| 600 |
|
| 601 |
However, all these filtering methods fail to consider the interaction between features and may reduce our predict power. Personally I only use variance and correlation to filter some absolutely unnecessary features.
|
| 602 |
|