AashishAIHub commited on
Commit
ba2eb6f
·
1 Parent(s): c2b5d4f

feat: Added advanced tabular feature engineering sections (NLP, Time-Series, Target Leakage, AutoFE)

Browse files
Files changed (1) hide show
  1. feature-engineering/index.html +214 -1
feature-engineering/index.html CHANGED
@@ -38,7 +38,10 @@
38
  <li><a href="#feature-transformation" class="nav__link">🔄 Feature Transformation</a></li>
39
  <li><a href="#feature-creation" class="nav__link">🛠️ Feature Creation</a></li>
40
  <li><a href="#dimensionality-reduction" class="nav__link">📉 Dimensionality Reduction</a></li>
41
- </ul>
 
 
 
42
  </nav>
43
  </aside>
44
 
@@ -909,6 +912,216 @@ print(f"Reduced from {X.shape[1]} to {X_pca.shape[1]} features.")</code></pre>
909
  <li>⚠️ Losing interpretability (PCs are linear combinations)</li>
910
  </ul>
911
  </section>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
912
  </main>
913
  </div>
914
 
 
38
  <li><a href="#feature-transformation" class="nav__link">🔄 Feature Transformation</a></li>
39
  <li><a href="#feature-creation" class="nav__link">🛠️ Feature Creation</a></li>
40
  <li><a href="#dimensionality-reduction" class="nav__link">📉 Dimensionality Reduction</a></li>
41
+ <li><a href="#text-data" class="nav__link">📝 Text Data (NLP)</a></li>
42
+ <li><a href="#time-series" class="nav__link">⏳ Time-Series</a></li>
43
+ <li><a href="#target-leakage" class="nav__link">⚠️ Target Leakage</a></li>
44
+ <li><a href="#automated-fe" class="nav__link">🤖 Automated FE</a></li>
45
  </nav>
46
  </aside>
47
 
 
912
  <li>⚠️ Losing interpretability (PCs are linear combinations)</li>
913
  </ul>
914
  </section>
915
+
916
+ <!-- ================== 12. TEXT DATA (NLP BASICS) ==================== -->
917
+ <section id="text-data" class="topic-section">
918
+ <h2>Text Data (NLP Basics)</h2>
919
+ <p>Real-world tabular data often contains unstructured text (e.g., reviews, titles). Algorithms require numbers,
920
+ so we must vectorize this text into numerical representations.</p>
921
+
922
+ <div class="info-card">
923
+ <strong>Real Example:</strong> Converting thousands of Amazon product reviews into numeric features allows a
924
+ classification model to predict positive vs. negative sentiment.
925
+ </div>
926
+
927
+ <h3>Mathematical Foundations</h3>
928
+ <div class="info-card">
929
+ <strong>Bag of Words (BoW):</strong> Represents text by counting the frequency of each word, ignoring grammar
930
+ and order.<br><br>
931
+ <strong>TF-IDF (Term Frequency - Inverse Document Frequency):</strong><br>
932
+ Penalizes frequent, uninformative words (like "the", "and") while boosting rare, meaningful words.<br><br>
933
+ <div
934
+ style="background: rgba(0,0,0,0.2); padding: 15px; border-radius: 8px; text-align: center; margin: 15px 0; font-size: 1.1em; color: #e4e6eb;">
935
+ $$ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) $$
936
+ </div>
937
+ • $\text{TF}$: (count of term $t$ in document $d$) / (total terms in $d$)<br>
938
+ • $\text{IDF}$: $\log \left( \frac{\text{Total Documents } N}{\text{Documents containing term } t} \right)$
939
+ </div>
940
+
941
+ <div class="code-block" style="margin-top: 20px;">
942
+ <div class="code-header">
943
+ <span>text_features.py - Scikit-Learn Vectorizers</span>
944
+ <button class="copy-btn" onclick="copyCode(this)">Copy</button>
945
+ </div>
946
+ <pre><code>from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
947
+ import pandas as pd
948
+
949
+ # Sample text column
950
+ corpus = [
951
+ "Machine learning is amazing",
952
+ "Deep learning is the future of learning",
953
+ "Data science and artificial intelligence"
954
+ ]
955
+
956
+ # 1. Bag of Words (CountVectorizer)
957
+ # Creates a column for every unique word in the corpus
958
+ vectorizer = CountVectorizer(stop_words='english')
959
+ X_bow = vectorizer.fit_transform(corpus)
960
+
961
+ # 2. TF-IDF (TfidfVectorizer)
962
+ # Converts words to continuous weights between 0 and 1
963
+ tfidf = TfidfVectorizer(stop_words='english', max_features=100)
964
+ X_tfidf = tfidf.fit_transform(corpus)
965
+
966
+ # Quick way to view features as a DataFrame
967
+ tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf.get_feature_names_out())
968
+ print(tfidf_df.head())</code></pre>
969
+ </div>
970
+
971
+ <h3>Meta-Features</h3>
972
+ <p>Before throwing text into a vectorizer, you can extract powerful <strong>meta-features</strong> using pure
973
+ Python or Pandas:</p>
974
+ <ul>
975
+ <li><strong>Word count:</strong> <code>df['text'].apply(lambda x: len(str(x).split()))</code></li>
976
+ <li><strong>Character count:</strong> <code>df['text'].apply(lambda x: len(str(x)))</code></li>
977
+ <li><strong>Count of punctuation/capitals:</strong> (Often strongly correlated with SPAM or fake reviews).
978
+ </li>
979
+ </ul>
980
+ </section>
981
+
982
+ <!-- ================= 13. TIME-SERIES ENGINEERING ==================== -->
983
+ <section id="time-series" class="topic-section">
984
+ <h2>Time-Series Feature Engineering</h2>
985
+ <p>Time-series data assumes that past values influence future values. We cannot simply shuffle rows; order
986
+ matters. We must engineer features that capture chronological patterns.</p>
987
+
988
+ <h3>Mathematical Foundations</h3>
989
+ <div class="info-card">
990
+ <strong>Lag Features:</strong> Shifting the target variable back by $t$ steps. "What was yesterday's
991
+ sales?"<br>
992
+ $X_{lag\_1} = Y_{t-1}$<br><br>
993
+ <strong>Rolling Windows:</strong> Computing statistics over a moving window of past data. Smoothes out
994
+ short-term fluctuations to reveal trends.<br>
995
+ • Simple Moving Average (SMA) for window $w$:
996
+ <div
997
+ style="background: rgba(0,0,0,0.2); padding: 15px; border-radius: 8px; text-align: center; margin: 15px 0; font-size: 1.1em; color: #e4e6eb;">
998
+ $$ SMA_t = \frac{1}{w} \sum_{i=1}^{w} Y_{t-i} $$
999
+ </div>
1000
+ <strong>Expanding Windows:</strong> Computes statistics from the very beginning of the dataset up to the
1001
+ current point $t$ (e.g., cumulative sum or cumulative max).
1002
+ </div>
1003
+
1004
+ <div class="code-block" style="margin-top: 20px;">
1005
+ <div class="code-header">
1006
+ <span>time_series.py - Lags and Rolling Windows</span>
1007
+ <button class="copy-btn" onclick="copyCode(this)">Copy</button>
1008
+ </div>
1009
+ <pre><code>import pandas as pd
1010
+
1011
+ # Assuming 'df' is sorted chronologically and indexed by Date
1012
+ # 1. Lag Features (Looking back in time)
1013
+ # What was the value 1 day ago? 7 days ago?
1014
+ df['sales_lag_1'] = df['sales'].shift(1)
1015
+ df['sales_lag_7'] = df['sales'].shift(7)
1016
+
1017
+ # 2. Rolling Window Features
1018
+ # The average and standard deviation over the last 7 days
1019
+ df['sales_rolling_mean_7d'] = df['sales'].rolling(window=7).mean()
1020
+ df['sales_rolling_std_7d'] = df['sales'].rolling(window=7).std()
1021
+
1022
+ # 3. Expanding Window Features
1023
+ # Year-to-date maximum sales
1024
+ df['sales_expanding_max'] = df['sales'].expanding().max()
1025
+
1026
+ # Drop NaNs generated by shifting/rolling
1027
+ df.dropna(inplace=True)</code></pre>
1028
+ </div>
1029
+ </section>
1030
+
1031
+ <!-- ===================== 14. TARGET LEAKAGE ========================= -->
1032
+ <section id="target-leakage" class="topic-section">
1033
+ <h2>Target Leakage (Data Leakage)</h2>
1034
+ <p>Data Leakage occurs when information from outside the training dataset is used to create the model. This
1035
+ guarantees amazing performance during training/validation, but total failure in the real world.</p>
1036
+
1037
+ <div class="callout callout--mistake">⚠️ The most common cause of leakage is performing feature engineering
1038
+ (Scaling, Imputing, TF-IDF) on the ENTIRE dataset <strong>before</strong> calling train_test_split.</div>
1039
+
1040
+ <div class="info-card" style="margin-top: 20px; border-left-color: #ff3366;">
1041
+ <h3 style="margin-top: 0; color: #ff3366;">🧠 Under the Hood: The Contamination Problem</h3>
1042
+ <p>Imagine using <code>StandardScaler</code> on your entire dataset. The scaler calculates the global $\mu$
1043
+ (mean) and $\sigma$ (standard deviation) to scale the data.</p>
1044
+ <p>If you split the data <em>after</em> scaling, your Training Data has been transformed using the mean of the
1045
+ Test Data. The Test Data is supposed to be completely unseen, but you just "leaked" its summary statistics
1046
+ into the training process.</p>
1047
+ </div>
1048
+
1049
+ <div class="code-block" style="margin-top: 20px;">
1050
+ <div class="code-header">
1051
+ <span>leakage.py - The Golden Rule of Fit vs Transform</span>
1052
+ <button class="copy-btn" onclick="copyCode(this)">Copy</button>
1053
+ </div>
1054
+ <pre><code>from sklearn.model_selection import train_test_split
1055
+ from sklearn.preprocessing import StandardScaler
1056
+
1057
+ # ❌ BAD PRACTICE (Creates Leakage)
1058
+ scaler_bad = StandardScaler()
1059
+ X_scaled_bad = scaler_bad.fit_transform(X) # Entire dataset sees the scaler
1060
+ X_train_bad, X_test_bad = train_test_split(X_scaled_bad)
1061
+
1062
+ # ✅ GOOD PRACTICE (No Leakage)
1063
+ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
1064
+
1065
+ scaler = StandardScaler()
1066
+ # Fit ONLY on the training data to learn parameters (mean, std)
1067
+ X_train_scaled = scaler.fit_transform(X_train)
1068
+
1069
+ # Transform test data using the parameters learned from the training data
1070
+ X_test_scaled = scaler.transform(X_test)</code></pre>
1071
+ </div>
1072
+ <div class="callout callout--tip">✅ The easiest way to mathematically prevent leakage in production is to
1073
+ package all your feature engineering steps inside a <strong>Scikit-Learn Pipeline</strong>.</div>
1074
+ </section>
1075
+
1076
+ <!-- ================ 15. AUTOMATED FEATURE ENGINEERING =============== -->
1077
+ <section id="automated-fe" class="topic-section">
1078
+ <h2>Automated Feature Engineering</h2>
1079
+ <p>In complex, multi-table relational databases, manually creating features is incredibly tedious. Automated
1080
+ Feature Engineering relies on algorithms to automatically synthesize hundreds of new features from relational
1081
+ datasets.</p>
1082
+
1083
+ <h3>Deep Feature Synthesis (DFS)</h3>
1084
+ <p>DFS stacks mathematical primitives (like computing sums, counts, averages, and time-since-last-event) across
1085
+ entity relationships (e.g., Customers $\xrightarrow{\text{1 to M}}$ Orders $\xrightarrow{\text{1 to M}}$
1086
+ Order_Items).</p>
1087
+
1088
+ <div class="info-card">
1089
+ <strong>Real Example:</strong> From a raw database of e-commerce transactions, DFS can automatically generate
1090
+ complex features like: <em>"The average value of a customer's orders over the last 30 days"</em> or <em>"The
1091
+ standard deviation of time between a user's logins."</em>
1092
+ </div>
1093
+
1094
+ <div class="code-block" style="margin-top: 20px;">
1095
+ <div class="code-header">
1096
+ <span>autofe.py - Featuretools Library</span>
1097
+ <button class="copy-btn" onclick="copyCode(this)">Copy</button>
1098
+ </div>
1099
+ <pre><code>import featuretools as ft
1100
+
1101
+ # Assume we have three Pandas DataFrames: clients, loans, and payments
1102
+ # Step 1: Create an EntitySet (a representation of your database)
1103
+ es = ft.EntitySet(id="banking")
1104
+
1105
+ # Step 2: Add dataframes to the EntitySet with primary keys
1106
+ es = es.add_dataframe(dataframe_name="clients", dataframe=clients_df, index="client_id")
1107
+ es = es.add_dataframe(dataframe_name="loans", dataframe=loans_df, index="loan_id")
1108
+
1109
+ # Step 3: Define relational joins (Foreign Keys)
1110
+ es = es.add_relationship("clients", "client_id", "loans", "client_id")
1111
+
1112
+ # Step 4: Run Deep Feature Synthesis!
1113
+ # Automatically generates agg features for clients based on their loans history
1114
+ feature_matrix, feature_defs = ft.dfs(
1115
+ entityset=es,
1116
+ target_dataframe_name="clients",
1117
+ agg_primitives=["mean", "sum", "mode", "std"],
1118
+ trans_primitives=["month", "hour"],
1119
+ max_depth=2 # Stacks primitives up to 2 layers deep
1120
+ )
1121
+
1122
+ print(f"Automatically generated {len(feature_defs)} features!")</code></pre>
1123
+ </div>
1124
+ </section>
1125
  </main>
1126
  </div>
1127