Spaces:
Running
Running
Commit ·
ba2eb6f
1
Parent(s): c2b5d4f
feat: Added advanced tabular feature engineering sections (NLP, Time-Series, Target Leakage, AutoFE)
Browse files- feature-engineering/index.html +214 -1
feature-engineering/index.html
CHANGED
|
@@ -38,7 +38,10 @@
|
|
| 38 |
<li><a href="#feature-transformation" class="nav__link">🔄 Feature Transformation</a></li>
|
| 39 |
<li><a href="#feature-creation" class="nav__link">🛠️ Feature Creation</a></li>
|
| 40 |
<li><a href="#dimensionality-reduction" class="nav__link">📉 Dimensionality Reduction</a></li>
|
| 41 |
-
|
|
|
|
|
|
|
|
|
|
| 42 |
</nav>
|
| 43 |
</aside>
|
| 44 |
|
|
@@ -909,6 +912,216 @@ print(f"Reduced from {X.shape[1]} to {X_pca.shape[1]} features.")</code></pre>
|
|
| 909 |
<li>⚠️ Losing interpretability (PCs are linear combinations)</li>
|
| 910 |
</ul>
|
| 911 |
</section>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 912 |
</main>
|
| 913 |
</div>
|
| 914 |
|
|
|
|
| 38 |
<li><a href="#feature-transformation" class="nav__link">🔄 Feature Transformation</a></li>
|
| 39 |
<li><a href="#feature-creation" class="nav__link">🛠️ Feature Creation</a></li>
|
| 40 |
<li><a href="#dimensionality-reduction" class="nav__link">📉 Dimensionality Reduction</a></li>
|
| 41 |
+
<li><a href="#text-data" class="nav__link">📝 Text Data (NLP)</a></li>
|
| 42 |
+
<li><a href="#time-series" class="nav__link">⏳ Time-Series</a></li>
|
| 43 |
+
<li><a href="#target-leakage" class="nav__link">⚠️ Target Leakage</a></li>
|
| 44 |
+
<li><a href="#automated-fe" class="nav__link">🤖 Automated FE</a></li>
|
| 45 |
</nav>
|
| 46 |
</aside>
|
| 47 |
|
|
|
|
| 912 |
<li>⚠️ Losing interpretability (PCs are linear combinations)</li>
|
| 913 |
</ul>
|
| 914 |
</section>
|
| 915 |
+
|
| 916 |
+
<!-- ================== 12. TEXT DATA (NLP BASICS) ==================== -->
|
| 917 |
+
<section id="text-data" class="topic-section">
|
| 918 |
+
<h2>Text Data (NLP Basics)</h2>
|
| 919 |
+
<p>Real-world tabular data often contains unstructured text (e.g., reviews, titles). Algorithms require numbers,
|
| 920 |
+
so we must vectorize this text into numerical representations.</p>
|
| 921 |
+
|
| 922 |
+
<div class="info-card">
|
| 923 |
+
<strong>Real Example:</strong> Converting thousands of Amazon product reviews into numeric features allows a
|
| 924 |
+
classification model to predict positive vs. negative sentiment.
|
| 925 |
+
</div>
|
| 926 |
+
|
| 927 |
+
<h3>Mathematical Foundations</h3>
|
| 928 |
+
<div class="info-card">
|
| 929 |
+
<strong>Bag of Words (BoW):</strong> Represents text by counting the frequency of each word, ignoring grammar
|
| 930 |
+
and order.<br><br>
|
| 931 |
+
<strong>TF-IDF (Term Frequency - Inverse Document Frequency):</strong><br>
|
| 932 |
+
Penalizes frequent, uninformative words (like "the", "and") while boosting rare, meaningful words.<br><br>
|
| 933 |
+
<div
|
| 934 |
+
style="background: rgba(0,0,0,0.2); padding: 15px; border-radius: 8px; text-align: center; margin: 15px 0; font-size: 1.1em; color: #e4e6eb;">
|
| 935 |
+
$$ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) $$
|
| 936 |
+
</div>
|
| 937 |
+
• $\text{TF}$: (count of term $t$ in document $d$) / (total terms in $d$)<br>
|
| 938 |
+
• $\text{IDF}$: $\log \left( \frac{\text{Total Documents } N}{\text{Documents containing term } t} \right)$
|
| 939 |
+
</div>
|
| 940 |
+
|
| 941 |
+
<div class="code-block" style="margin-top: 20px;">
|
| 942 |
+
<div class="code-header">
|
| 943 |
+
<span>text_features.py - Scikit-Learn Vectorizers</span>
|
| 944 |
+
<button class="copy-btn" onclick="copyCode(this)">Copy</button>
|
| 945 |
+
</div>
|
| 946 |
+
<pre><code>from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
|
| 947 |
+
import pandas as pd
|
| 948 |
+
|
| 949 |
+
# Sample text column
|
| 950 |
+
corpus = [
|
| 951 |
+
"Machine learning is amazing",
|
| 952 |
+
"Deep learning is the future of learning",
|
| 953 |
+
"Data science and artificial intelligence"
|
| 954 |
+
]
|
| 955 |
+
|
| 956 |
+
# 1. Bag of Words (CountVectorizer)
|
| 957 |
+
# Creates a column for every unique word in the corpus
|
| 958 |
+
vectorizer = CountVectorizer(stop_words='english')
|
| 959 |
+
X_bow = vectorizer.fit_transform(corpus)
|
| 960 |
+
|
| 961 |
+
# 2. TF-IDF (TfidfVectorizer)
|
| 962 |
+
# Converts words to continuous weights between 0 and 1
|
| 963 |
+
tfidf = TfidfVectorizer(stop_words='english', max_features=100)
|
| 964 |
+
X_tfidf = tfidf.fit_transform(corpus)
|
| 965 |
+
|
| 966 |
+
# Quick way to view features as a DataFrame
|
| 967 |
+
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf.get_feature_names_out())
|
| 968 |
+
print(tfidf_df.head())</code></pre>
|
| 969 |
+
</div>
|
| 970 |
+
|
| 971 |
+
<h3>Meta-Features</h3>
|
| 972 |
+
<p>Before throwing text into a vectorizer, you can extract powerful <strong>meta-features</strong> using pure
|
| 973 |
+
Python or Pandas:</p>
|
| 974 |
+
<ul>
|
| 975 |
+
<li><strong>Word count:</strong> <code>df['text'].apply(lambda x: len(str(x).split()))</code></li>
|
| 976 |
+
<li><strong>Character count:</strong> <code>df['text'].apply(lambda x: len(str(x)))</code></li>
|
| 977 |
+
<li><strong>Count of punctuation/capitals:</strong> (Often strongly correlated with SPAM or fake reviews).
|
| 978 |
+
</li>
|
| 979 |
+
</ul>
|
| 980 |
+
</section>
|
| 981 |
+
|
| 982 |
+
<!-- ================= 13. TIME-SERIES ENGINEERING ==================== -->
|
| 983 |
+
<section id="time-series" class="topic-section">
|
| 984 |
+
<h2>Time-Series Feature Engineering</h2>
|
| 985 |
+
<p>Time-series data assumes that past values influence future values. We cannot simply shuffle rows; order
|
| 986 |
+
matters. We must engineer features that capture chronological patterns.</p>
|
| 987 |
+
|
| 988 |
+
<h3>Mathematical Foundations</h3>
|
| 989 |
+
<div class="info-card">
|
| 990 |
+
<strong>Lag Features:</strong> Shifting the target variable back by $t$ steps. "What was yesterday's
|
| 991 |
+
sales?"<br>
|
| 992 |
+
$X_{lag\_1} = Y_{t-1}$<br><br>
|
| 993 |
+
<strong>Rolling Windows:</strong> Computing statistics over a moving window of past data. Smoothes out
|
| 994 |
+
short-term fluctuations to reveal trends.<br>
|
| 995 |
+
• Simple Moving Average (SMA) for window $w$:
|
| 996 |
+
<div
|
| 997 |
+
style="background: rgba(0,0,0,0.2); padding: 15px; border-radius: 8px; text-align: center; margin: 15px 0; font-size: 1.1em; color: #e4e6eb;">
|
| 998 |
+
$$ SMA_t = \frac{1}{w} \sum_{i=1}^{w} Y_{t-i} $$
|
| 999 |
+
</div>
|
| 1000 |
+
<strong>Expanding Windows:</strong> Computes statistics from the very beginning of the dataset up to the
|
| 1001 |
+
current point $t$ (e.g., cumulative sum or cumulative max).
|
| 1002 |
+
</div>
|
| 1003 |
+
|
| 1004 |
+
<div class="code-block" style="margin-top: 20px;">
|
| 1005 |
+
<div class="code-header">
|
| 1006 |
+
<span>time_series.py - Lags and Rolling Windows</span>
|
| 1007 |
+
<button class="copy-btn" onclick="copyCode(this)">Copy</button>
|
| 1008 |
+
</div>
|
| 1009 |
+
<pre><code>import pandas as pd
|
| 1010 |
+
|
| 1011 |
+
# Assuming 'df' is sorted chronologically and indexed by Date
|
| 1012 |
+
# 1. Lag Features (Looking back in time)
|
| 1013 |
+
# What was the value 1 day ago? 7 days ago?
|
| 1014 |
+
df['sales_lag_1'] = df['sales'].shift(1)
|
| 1015 |
+
df['sales_lag_7'] = df['sales'].shift(7)
|
| 1016 |
+
|
| 1017 |
+
# 2. Rolling Window Features
|
| 1018 |
+
# The average and standard deviation over the last 7 days
|
| 1019 |
+
df['sales_rolling_mean_7d'] = df['sales'].rolling(window=7).mean()
|
| 1020 |
+
df['sales_rolling_std_7d'] = df['sales'].rolling(window=7).std()
|
| 1021 |
+
|
| 1022 |
+
# 3. Expanding Window Features
|
| 1023 |
+
# Year-to-date maximum sales
|
| 1024 |
+
df['sales_expanding_max'] = df['sales'].expanding().max()
|
| 1025 |
+
|
| 1026 |
+
# Drop NaNs generated by shifting/rolling
|
| 1027 |
+
df.dropna(inplace=True)</code></pre>
|
| 1028 |
+
</div>
|
| 1029 |
+
</section>
|
| 1030 |
+
|
| 1031 |
+
<!-- ===================== 14. TARGET LEAKAGE ========================= -->
|
| 1032 |
+
<section id="target-leakage" class="topic-section">
|
| 1033 |
+
<h2>Target Leakage (Data Leakage)</h2>
|
| 1034 |
+
<p>Data Leakage occurs when information from outside the training dataset is used to create the model. This
|
| 1035 |
+
guarantees amazing performance during training/validation, but total failure in the real world.</p>
|
| 1036 |
+
|
| 1037 |
+
<div class="callout callout--mistake">⚠️ The most common cause of leakage is performing feature engineering
|
| 1038 |
+
(Scaling, Imputing, TF-IDF) on the ENTIRE dataset <strong>before</strong> calling train_test_split.</div>
|
| 1039 |
+
|
| 1040 |
+
<div class="info-card" style="margin-top: 20px; border-left-color: #ff3366;">
|
| 1041 |
+
<h3 style="margin-top: 0; color: #ff3366;">🧠 Under the Hood: The Contamination Problem</h3>
|
| 1042 |
+
<p>Imagine using <code>StandardScaler</code> on your entire dataset. The scaler calculates the global $\mu$
|
| 1043 |
+
(mean) and $\sigma$ (standard deviation) to scale the data.</p>
|
| 1044 |
+
<p>If you split the data <em>after</em> scaling, your Training Data has been transformed using the mean of the
|
| 1045 |
+
Test Data. The Test Data is supposed to be completely unseen, but you just "leaked" its summary statistics
|
| 1046 |
+
into the training process.</p>
|
| 1047 |
+
</div>
|
| 1048 |
+
|
| 1049 |
+
<div class="code-block" style="margin-top: 20px;">
|
| 1050 |
+
<div class="code-header">
|
| 1051 |
+
<span>leakage.py - The Golden Rule of Fit vs Transform</span>
|
| 1052 |
+
<button class="copy-btn" onclick="copyCode(this)">Copy</button>
|
| 1053 |
+
</div>
|
| 1054 |
+
<pre><code>from sklearn.model_selection import train_test_split
|
| 1055 |
+
from sklearn.preprocessing import StandardScaler
|
| 1056 |
+
|
| 1057 |
+
# ❌ BAD PRACTICE (Creates Leakage)
|
| 1058 |
+
scaler_bad = StandardScaler()
|
| 1059 |
+
X_scaled_bad = scaler_bad.fit_transform(X) # Entire dataset sees the scaler
|
| 1060 |
+
X_train_bad, X_test_bad = train_test_split(X_scaled_bad)
|
| 1061 |
+
|
| 1062 |
+
# ✅ GOOD PRACTICE (No Leakage)
|
| 1063 |
+
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
|
| 1064 |
+
|
| 1065 |
+
scaler = StandardScaler()
|
| 1066 |
+
# Fit ONLY on the training data to learn parameters (mean, std)
|
| 1067 |
+
X_train_scaled = scaler.fit_transform(X_train)
|
| 1068 |
+
|
| 1069 |
+
# Transform test data using the parameters learned from the training data
|
| 1070 |
+
X_test_scaled = scaler.transform(X_test)</code></pre>
|
| 1071 |
+
</div>
|
| 1072 |
+
<div class="callout callout--tip">✅ The easiest way to mathematically prevent leakage in production is to
|
| 1073 |
+
package all your feature engineering steps inside a <strong>Scikit-Learn Pipeline</strong>.</div>
|
| 1074 |
+
</section>
|
| 1075 |
+
|
| 1076 |
+
<!-- ================ 15. AUTOMATED FEATURE ENGINEERING =============== -->
|
| 1077 |
+
<section id="automated-fe" class="topic-section">
|
| 1078 |
+
<h2>Automated Feature Engineering</h2>
|
| 1079 |
+
<p>In complex, multi-table relational databases, manually creating features is incredibly tedious. Automated
|
| 1080 |
+
Feature Engineering relies on algorithms to automatically synthesize hundreds of new features from relational
|
| 1081 |
+
datasets.</p>
|
| 1082 |
+
|
| 1083 |
+
<h3>Deep Feature Synthesis (DFS)</h3>
|
| 1084 |
+
<p>DFS stacks mathematical primitives (like computing sums, counts, averages, and time-since-last-event) across
|
| 1085 |
+
entity relationships (e.g., Customers $\xrightarrow{\text{1 to M}}$ Orders $\xrightarrow{\text{1 to M}}$
|
| 1086 |
+
Order_Items).</p>
|
| 1087 |
+
|
| 1088 |
+
<div class="info-card">
|
| 1089 |
+
<strong>Real Example:</strong> From a raw database of e-commerce transactions, DFS can automatically generate
|
| 1090 |
+
complex features like: <em>"The average value of a customer's orders over the last 30 days"</em> or <em>"The
|
| 1091 |
+
standard deviation of time between a user's logins."</em>
|
| 1092 |
+
</div>
|
| 1093 |
+
|
| 1094 |
+
<div class="code-block" style="margin-top: 20px;">
|
| 1095 |
+
<div class="code-header">
|
| 1096 |
+
<span>autofe.py - Featuretools Library</span>
|
| 1097 |
+
<button class="copy-btn" onclick="copyCode(this)">Copy</button>
|
| 1098 |
+
</div>
|
| 1099 |
+
<pre><code>import featuretools as ft
|
| 1100 |
+
|
| 1101 |
+
# Assume we have three Pandas DataFrames: clients, loans, and payments
|
| 1102 |
+
# Step 1: Create an EntitySet (a representation of your database)
|
| 1103 |
+
es = ft.EntitySet(id="banking")
|
| 1104 |
+
|
| 1105 |
+
# Step 2: Add dataframes to the EntitySet with primary keys
|
| 1106 |
+
es = es.add_dataframe(dataframe_name="clients", dataframe=clients_df, index="client_id")
|
| 1107 |
+
es = es.add_dataframe(dataframe_name="loans", dataframe=loans_df, index="loan_id")
|
| 1108 |
+
|
| 1109 |
+
# Step 3: Define relational joins (Foreign Keys)
|
| 1110 |
+
es = es.add_relationship("clients", "client_id", "loans", "client_id")
|
| 1111 |
+
|
| 1112 |
+
# Step 4: Run Deep Feature Synthesis!
|
| 1113 |
+
# Automatically generates agg features for clients based on their loans history
|
| 1114 |
+
feature_matrix, feature_defs = ft.dfs(
|
| 1115 |
+
entityset=es,
|
| 1116 |
+
target_dataframe_name="clients",
|
| 1117 |
+
agg_primitives=["mean", "sum", "mode", "std"],
|
| 1118 |
+
trans_primitives=["month", "hour"],
|
| 1119 |
+
max_depth=2 # Stacks primitives up to 2 layers deep
|
| 1120 |
+
)
|
| 1121 |
+
|
| 1122 |
+
print(f"Automatically generated {len(feature_defs)} features!")</code></pre>
|
| 1123 |
+
</div>
|
| 1124 |
+
</section>
|
| 1125 |
</main>
|
| 1126 |
</div>
|
| 1127 |
|