aitek230telu commited on
Commit
0d370a5
·
verified ·
1 Parent(s): a9c5f4b

Update A Short Guide for Feature Engineering and Feature Selection.md

Browse files
A Short Guide for Feature Engineering and Feature Selection.md CHANGED
@@ -25,7 +25,7 @@ Narrowly speaking, in data mining context, machine learning (ML) is the process
25
  A typical ML workflow/pipeline looks like this:
26
 
27
 
28
- ![workflow](/images/workflow2.png)
29
 
30
  [img source](https://www.springer.com/us/book/9781484232064)
31
 
@@ -330,7 +330,7 @@ All these methods attempt to group some of the labels and reduce cardinality. Gr
330
  A comparison of three methods when facing outliers:
331
 
332
 
333
- ![scaling](/images/scaling.png)
334
 
335
  [img source](https://stackoverflow.com/questions/51841506/data-standardization-vs-normalization-vs-robust-scaler)
336
 
@@ -471,7 +471,7 @@ In the situations above, transformation of the original variable can help give t
471
 
472
  **Box-Cox transformation** in sklearn [13] is another popular function belonging to the power transform family of functions. This function has a pre-requisite that the numeric values to be transformed must be positive (similar to what log transform expects). In case they are negative, shifting using a constant value helps. Mathematically, the Box-Cox transform function can be denoted as follows.
473
 
474
- ![](images/box-cox.png)
475
 
476
  **Quantile transformation** in sklearn [14] transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme. However, this transform is non-linear. It may distort linear correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.
477
 
@@ -481,7 +481,7 @@ We can use **Q-Q plot** to check if the variable is normally distributed (a 45 d
481
 
482
  Below is an example showing the effect of sklearn's Box-plot/Yeo-johnson/Quantile transform to map data from various distributions to a normal distribution.
483
 
484
- ![](images/sphx_glr_plot_map_data_to_normal_001.png)
485
 
486
  [img source](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_map_data_to_normal.html#sphx-glr-auto-examples-preprocessing-plot-map-data-to-normal-py)
487
 
@@ -521,7 +521,7 @@ Still take call log for example, we can have crossed features like: number of ca
521
 
522
  **Note**: An open-source python framework named **Featuretools** that helps automatically generate such features can be found [here](https://github.com/Featuretools/featuretools).
523
 
524
- ![featuretools](images/featuretools.png)
525
 
526
  Personally I haven't used it in practice. You may try and discover if it can be of industry usage.
527
 
@@ -596,7 +596,7 @@ As a result, filter methods are suited for a first step quick screen and removal
596
 
597
  WOE encoding (see section 3.3.2) and IV often go hand in hand in scorecard development. The two concepts both derived from logistic regression and is kind of standard practice in credit card industry. IV is a popular and widely used measure as there are very convenient rules of thumb for variables selection associated with IV as below:
598
 
599
- ![](images/IV.png)
600
 
601
  However, all these filtering methods fail to consider the interaction between features and may reduce our predict power. Personally I only use variance and correlation to filter some absolutely unnecessary features.
602
 
 
25
  A typical ML workflow/pipeline looks like this:
26
 
27
 
28
+ ![workflow](https://huggingface.co/recommender-system/feature-engineering-guide/resolve/main/images/workflow2.png)
29
 
30
  [img source](https://www.springer.com/us/book/9781484232064)
31
 
 
330
  A comparison of three methods when facing outliers:
331
 
332
 
333
+ ![scaling](https://huggingface.co/recommender-system/feature-engineering-guide/resolve/main/images/scaling.png)
334
 
335
  [img source](https://stackoverflow.com/questions/51841506/data-standardization-vs-normalization-vs-robust-scaler)
336
 
 
471
 
472
  **Box-Cox transformation** in sklearn [13] is another popular function belonging to the power transform family of functions. This function has a pre-requisite that the numeric values to be transformed must be positive (similar to what log transform expects). In case they are negative, shifting using a constant value helps. Mathematically, the Box-Cox transform function can be denoted as follows.
473
 
474
+ ![](https://huggingface.co/recommender-system/feature-engineering-guide/resolve/main/images/box-cox.png)
475
 
476
  **Quantile transformation** in sklearn [14] transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme. However, this transform is non-linear. It may distort linear correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.
477
 
 
481
 
482
  Below is an example showing the effect of sklearn's Box-plot/Yeo-johnson/Quantile transform to map data from various distributions to a normal distribution.
483
 
484
+ ![](https://huggingface.co/recommender-system/feature-engineering-guide/resolve/main/images/sphx_glr_plot_map_data_to_normal_001.png)
485
 
486
  [img source](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_map_data_to_normal.html#sphx-glr-auto-examples-preprocessing-plot-map-data-to-normal-py)
487
 
 
521
 
522
  **Note**: An open-source python framework named **Featuretools** that helps automatically generate such features can be found [here](https://github.com/Featuretools/featuretools).
523
 
524
+ ![featuretools](https://huggingface.co/recommender-system/feature-engineering-guide/resolve/main/images/featuretools.png)
525
 
526
  Personally I haven't used it in practice. You may try and discover if it can be of industry usage.
527
 
 
596
 
597
  WOE encoding (see section 3.3.2) and IV often go hand in hand in scorecard development. The two concepts both derived from logistic regression and is kind of standard practice in credit card industry. IV is a popular and widely used measure as there are very convenient rules of thumb for variables selection associated with IV as below:
598
 
599
+ ![](https://huggingface.co/recommender-system/feature-engineering-guide/resolve/main/images/IV.png)
600
 
601
  However, all these filtering methods fail to consider the interaction between features and may reduce our predict power. Personally I only use variance and correlation to filter some absolutely unnecessary features.
602