Spaces:

AashishAIHub
/

DataScience

Running

App Files Files Community

AashishAIHub commited on 23 days ago

Commit

ba2eb6f

1 Parent(s): c2b5d4f

feat: Added advanced tabular feature engineering sections (NLP, Time-Series, Target Leakage, AutoFE)

Browse files

Files changed (1) hide show

feature-engineering/index.html +214 -1

feature-engineering/index.html CHANGED Viewed

@@ -38,7 +38,10 @@
           <li><a href="#feature-transformation" class="nav__link">🔄 Feature Transformation</a></li>
           <li><a href="#feature-creation" class="nav__link">🛠️ Feature Creation</a></li>
           <li><a href="#dimensionality-reduction" class="nav__link">📉 Dimensionality Reduction</a></li>
-        </ul>
       </nav>
     </aside>
@@ -909,6 +912,216 @@ print(f"Reduced from {X.shape[1]} to {X_pca.shape[1]} features.")</code></pre>
           <li>⚠️ Losing interpretability (PCs are linear combinations)</li>
         </ul>
       </section>
     </main>
   </div>

           <li><a href="#feature-transformation" class="nav__link">🔄 Feature Transformation</a></li>
           <li><a href="#feature-creation" class="nav__link">🛠️ Feature Creation</a></li>
           <li><a href="#dimensionality-reduction" class="nav__link">📉 Dimensionality Reduction</a></li>
+          <li><a href="#text-data" class="nav__link">📝 Text Data (NLP)</a></li>
+          <li><a href="#time-series" class="nav__link">⏳ Time-Series</a></li>
+          <li><a href="#target-leakage" class="nav__link">⚠️ Target Leakage</a></li>
+          <li><a href="#automated-fe" class="nav__link">🤖 Automated FE</a></li>
       </nav>
     </aside>
           <li>⚠️ Losing interpretability (PCs are linear combinations)</li>
         </ul>
       </section>
+      <!-- ================== 12. TEXT DATA (NLP BASICS) ==================== -->
+      <section id="text-data" class="topic-section">
+        <h2>Text Data (NLP Basics)</h2>
+        <p>Real-world tabular data often contains unstructured text (e.g., reviews, titles). Algorithms require numbers,
+          so we must vectorize this text into numerical representations.</p>
+        <div class="info-card">
+          <strong>Real Example:</strong> Converting thousands of Amazon product reviews into numeric features allows a
+          classification model to predict positive vs. negative sentiment.
+        </div>
+        <h3>Mathematical Foundations</h3>
+        <div class="info-card">
+          <strong>Bag of Words (BoW):</strong> Represents text by counting the frequency of each word, ignoring grammar
+          and order.<br><br>
+          <strong>TF-IDF (Term Frequency - Inverse Document Frequency):</strong><br>
+          Penalizes frequent, uninformative words (like "the", "and") while boosting rare, meaningful words.<br><br>
+          <div
+            style="background: rgba(0,0,0,0.2); padding: 15px; border-radius: 8px; text-align: center; margin: 15px 0; font-size: 1.1em; color: #e4e6eb;">
+            $$ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) $$
+          </div>
+          • $\text{TF}$: (count of term $t$ in document $d$) / (total terms in $d$)<br>
+          • $\text{IDF}$: $\log \left( \frac{\text{Total Documents } N}{\text{Documents containing term } t} \right)$
+        </div>
+        <div class="code-block" style="margin-top: 20px;">
+          <div class="code-header">
+            <span>text_features.py - Scikit-Learn Vectorizers</span>
+            <button class="copy-btn" onclick="copyCode(this)">Copy</button>
+          </div>
+          <pre><code>from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
+import pandas as pd
+# Sample text column
+corpus = [
+    "Machine learning is amazing",
+    "Deep learning is the future of learning",
+    "Data science and artificial intelligence"
+]
+# 1. Bag of Words (CountVectorizer)
+# Creates a column for every unique word in the corpus
+vectorizer = CountVectorizer(stop_words='english')
+X_bow = vectorizer.fit_transform(corpus)
+# 2. TF-IDF (TfidfVectorizer)
+# Converts words to continuous weights between 0 and 1
+tfidf = TfidfVectorizer(stop_words='english', max_features=100)
+X_tfidf = tfidf.fit_transform(corpus)
+# Quick way to view features as a DataFrame
+tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf.get_feature_names_out())
+print(tfidf_df.head())</code></pre>
+        </div>
+        <h3>Meta-Features</h3>
+        <p>Before throwing text into a vectorizer, you can extract powerful <strong>meta-features</strong> using pure
+          Python or Pandas:</p>
+        <ul>
+          <li><strong>Word count:</strong> <code>df['text'].apply(lambda x: len(str(x).split()))</code></li>
+          <li><strong>Character count:</strong> <code>df['text'].apply(lambda x: len(str(x)))</code></li>
+          <li><strong>Count of punctuation/capitals:</strong> (Often strongly correlated with SPAM or fake reviews).
+          </li>
+        </ul>
+      </section>
+      <!-- ================= 13. TIME-SERIES ENGINEERING ==================== -->
+      <section id="time-series" class="topic-section">
+        <h2>Time-Series Feature Engineering</h2>
+        <p>Time-series data assumes that past values influence future values. We cannot simply shuffle rows; order
+          matters. We must engineer features that capture chronological patterns.</p>
+        <h3>Mathematical Foundations</h3>
+        <div class="info-card">
+          <strong>Lag Features:</strong> Shifting the target variable back by $t$ steps. "What was yesterday's
+          sales?"<br>
+          $X_{lag\_1} = Y_{t-1}$<br><br>
+          <strong>Rolling Windows:</strong> Computing statistics over a moving window of past data. Smoothes out
+          short-term fluctuations to reveal trends.<br>
+          • Simple Moving Average (SMA) for window $w$:
+          <div
+            style="background: rgba(0,0,0,0.2); padding: 15px; border-radius: 8px; text-align: center; margin: 15px 0; font-size: 1.1em; color: #e4e6eb;">
+            $$ SMA_t = \frac{1}{w} \sum_{i=1}^{w} Y_{t-i} $$
+          </div>
+          <strong>Expanding Windows:</strong> Computes statistics from the very beginning of the dataset up to the
+          current point $t$ (e.g., cumulative sum or cumulative max).
+        </div>
+        <div class="code-block" style="margin-top: 20px;">
+          <div class="code-header">
+            <span>time_series.py - Lags and Rolling Windows</span>
+            <button class="copy-btn" onclick="copyCode(this)">Copy</button>
+          </div>
+          <pre><code>import pandas as pd
+# Assuming 'df' is sorted chronologically and indexed by Date
+# 1. Lag Features (Looking back in time)
+# What was the value 1 day ago? 7 days ago?
+df['sales_lag_1'] = df['sales'].shift(1)
+df['sales_lag_7'] = df['sales'].shift(7)
+# 2. Rolling Window Features
+# The average and standard deviation over the last 7 days
+df['sales_rolling_mean_7d'] = df['sales'].rolling(window=7).mean()
+df['sales_rolling_std_7d'] = df['sales'].rolling(window=7).std()
+# 3. Expanding Window Features
+# Year-to-date maximum sales
+df['sales_expanding_max'] = df['sales'].expanding().max()
+# Drop NaNs generated by shifting/rolling
+df.dropna(inplace=True)</code></pre>
+        </div>
+      </section>
+      <!-- ===================== 14. TARGET LEAKAGE ========================= -->
+      <section id="target-leakage" class="topic-section">
+        <h2>Target Leakage (Data Leakage)</h2>
+        <p>Data Leakage occurs when information from outside the training dataset is used to create the model. This
+          guarantees amazing performance during training/validation, but total failure in the real world.</p>
+        <div class="callout callout--mistake">⚠️ The most common cause of leakage is performing feature engineering
+          (Scaling, Imputing, TF-IDF) on the ENTIRE dataset <strong>before</strong> calling train_test_split.</div>
+        <div class="info-card" style="margin-top: 20px; border-left-color: #ff3366;">
+          <h3 style="margin-top: 0; color: #ff3366;">🧠 Under the Hood: The Contamination Problem</h3>
+          <p>Imagine using <code>StandardScaler</code> on your entire dataset. The scaler calculates the global $\mu$
+            (mean) and $\sigma$ (standard deviation) to scale the data.</p>
+          <p>If you split the data <em>after</em> scaling, your Training Data has been transformed using the mean of the
+            Test Data. The Test Data is supposed to be completely unseen, but you just "leaked" its summary statistics
+            into the training process.</p>
+        </div>
+        <div class="code-block" style="margin-top: 20px;">
+          <div class="code-header">
+            <span>leakage.py - The Golden Rule of Fit vs Transform</span>
+            <button class="copy-btn" onclick="copyCode(this)">Copy</button>
+          </div>
+          <pre><code>from sklearn.model_selection import train_test_split
+from sklearn.preprocessing import StandardScaler
+# ❌ BAD PRACTICE (Creates Leakage)
+scaler_bad = StandardScaler()
+X_scaled_bad = scaler_bad.fit_transform(X) # Entire dataset sees the scaler
+X_train_bad, X_test_bad = train_test_split(X_scaled_bad)
+# ✅ GOOD PRACTICE (No Leakage)
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
+scaler = StandardScaler()
+# Fit ONLY on the training data to learn parameters (mean, std)
+X_train_scaled = scaler.fit_transform(X_train)
+# Transform test data using the parameters learned from the training data
+X_test_scaled = scaler.transform(X_test)</code></pre>
+        </div>
+        <div class="callout callout--tip">✅ The easiest way to mathematically prevent leakage in production is to
+          package all your feature engineering steps inside a <strong>Scikit-Learn Pipeline</strong>.</div>
+      </section>
+      <!-- ================ 15. AUTOMATED FEATURE ENGINEERING =============== -->
+      <section id="automated-fe" class="topic-section">
+        <h2>Automated Feature Engineering</h2>
+        <p>In complex, multi-table relational databases, manually creating features is incredibly tedious. Automated
+          Feature Engineering relies on algorithms to automatically synthesize hundreds of new features from relational
+          datasets.</p>
+        <h3>Deep Feature Synthesis (DFS)</h3>
+        <p>DFS stacks mathematical primitives (like computing sums, counts, averages, and time-since-last-event) across
+          entity relationships (e.g., Customers $\xrightarrow{\text{1 to M}}$ Orders $\xrightarrow{\text{1 to M}}$
+          Order_Items).</p>
+        <div class="info-card">
+          <strong>Real Example:</strong> From a raw database of e-commerce transactions, DFS can automatically generate
+          complex features like: <em>"The average value of a customer's orders over the last 30 days"</em> or <em>"The
+            standard deviation of time between a user's logins."</em>
+        </div>
+        <div class="code-block" style="margin-top: 20px;">
+          <div class="code-header">
+            <span>autofe.py - Featuretools Library</span>
+            <button class="copy-btn" onclick="copyCode(this)">Copy</button>
+          </div>
+          <pre><code>import featuretools as ft
+# Assume we have three Pandas DataFrames: clients, loans, and payments
+# Step 1: Create an EntitySet (a representation of your database)
+es = ft.EntitySet(id="banking")
+# Step 2: Add dataframes to the EntitySet with primary keys
+es = es.add_dataframe(dataframe_name="clients", dataframe=clients_df, index="client_id")
+es = es.add_dataframe(dataframe_name="loans", dataframe=loans_df, index="loan_id")
+# Step 3: Define relational joins (Foreign Keys)
+es = es.add_relationship("clients", "client_id", "loans", "client_id")
+# Step 4: Run Deep Feature Synthesis!
+# Automatically generates agg features for clients based on their loans history
+feature_matrix, feature_defs = ft.dfs(
+    entityset=es,
+    target_dataframe_name="clients",
+    agg_primitives=["mean", "sum", "mode", "std"],
+    trans_primitives=["month", "hour"],
+    max_depth=2 # Stacks primitives up to 2 layers deep
+)
+print(f"Automatically generated {len(feature_defs)} features!")</code></pre>
+        </div>
+      </section>
     </main>
   </div>