Spaces:

Harika22
/

Natural_Language_Processing

Sleeping

App Files Files Community

Harika22 commited on Feb 2, 2025

Commit

183160b

verified ·

1 Parent(s): e1792cc

Update pages/6_Feature_Engineering.py

Browse files

Files changed (1) hide show

pages/6_Feature_Engineering.py +3 -35

pages/6_Feature_Engineering.py CHANGED Viewed

@@ -421,13 +421,11 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
     st.markdown(
     """
-    <div class='step-box'>
         <ul>
             <li><strong>Create a vocabulary:</strong> A set of unique words from the corpus.</li>
             <li><strong>Convert each document into a vector:</strong> A d-dimensional representation.</li>
             <li><strong>Calculate Term Frequency (TF):</strong> Measures the importance of a word within a document.</li>
         </ul>
-    </div>
     """,
     unsafe_allow_html=True,
     )
@@ -436,12 +434,10 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
     st.markdown(
     """
-    <div class='step-box'>
         <ul>
             <li><strong>Compute Inverse Document Frequency (IDF):</strong> Measures how important a word is across all documents.</li>
             <li><strong>For every word in the vocabulary, apply IDF:</strong></li>
         </ul>
-    </div>
     """,
     unsafe_allow_html=True,
     )
@@ -450,11 +446,9 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
     st.markdown(
     """
-    <div class='step-box'>
         - <strong>N:</strong> Total number of documents in the corpus.<br>
         - <strong>n:</strong> Number of documents containing the word wᵢ.<br>
         - TF-IDF helps in understanding word significance while reducing the impact of commonly used words.
-    </div>
     """,
     unsafe_allow_html=True,
     )
@@ -478,13 +472,11 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
     st.markdown(
     """
-    <div class='box'>
         <ul>
             <li>TF measures how often a word appears in a document.</li>
             <li>Formula: <span class='highlight'>TF(wᵢ, dᵢ) = (Occurrences of wᵢ in dᵢ) / (Total words in dᵢ)</span></li>
             <li>TF values change based on the document.</li>
         </ul>
-    </div>
     """,
     unsafe_allow_html=True,
     )
@@ -504,13 +496,11 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
     st.markdown(
     """
-    <div class='box'>
         <ul>
             <li>TF values always range from <strong>0 to 1</strong>.</li>
             <li>Case-1: <span class='highlight'>TF = 0</span> → Word is not present in the document.</li>
             <li>Case-2: <span class='highlight'>TF = 1</span> → Word is the only word in the document.</li>
         </ul>
-    </div>
     """,
     unsafe_allow_html=True,
     )
@@ -519,7 +509,6 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
     st.markdown(
     """
-    <div class='box'>
         <ul>
             <li>IDF measures how important a word is across the entire corpus.</li>
             <li>Formula: <span class='highlight'>IDF(wᵢ, C) = log(N/n)</span></li>
@@ -527,7 +516,6 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
             <li>n = Number of documents containing wᵢ.</li>
             <li>IDF values range from <strong>0 to ∞</strong>.</li>
         </ul>
-    </div>
     """,
     unsafe_allow_html=True,
     )
@@ -536,13 +524,11 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
     st.markdown(
     """
-    <div class='box'>
         <ul>
             <li>We calculate TF-IDF by multiplying TF and IDF values.</li>
             <li>Formula: <span class='highlight'>TF-IDF = TF * IDF</span></li>
             <li>TF-IDF helps reduce the impact of frequent words while keeping rare words important.</li>
         </ul>
-    </div>
     """,
     unsafe_allow_html=True,
     )
@@ -558,9 +544,7 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
     st.markdown(
     """
-    <div class='box'>
         - The final TF-IDF values may be low, high, or even zero depending on term frequency and document frequency.
-    </div>
     """,
     unsafe_allow_html=True,
     )
@@ -569,53 +553,45 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
     st.markdown(
     """
-    <div class='box'>
         <h3 style='color: #6A0572;'>📈 Case 1: High TF-IDF Values</h3>
         <ul>
             <li>If the word appears <strong>frequently</strong> in a document → <span class='highlight'>High TF-IDF</span></li>
         </ul>
-    </div>
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
-    <div class='box'>
         <h3 style='color: #6A0572;'>📉 Case 2: Low TF-IDF Values</h3>
         <ul>
             <li>If the word appears <strong>rarely</strong> in a document → <span class='highlight'>Low TF-IDF</span></li>
             <li>TF is always in the range: <strong>[0 - 1]</strong></li>
             <li>IDF is in the range: <strong>[0 - ∞)</strong></li>
         </ul>
-    </div>
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
-    <div class='box'>
         <h3 style='color: #6A0572;'>📊 Understanding TF (Term Frequency)</h3>
         <ul>
             <li>TF gives <strong>more importance</strong> to words that occur <strong>frequently</strong> in a document.</li>
             <li>As the word frequency <span class='highlight'>increases</span> → TF <span class='highlight'>increases</span>.</li>
         </ul>
-    </div>
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
-    <div class='box'>
         <h3 style='color: #6A0572;'>📉 Understanding IDF (Inverse Document Frequency)</h3>
         <ul>
             <li>IDF Formula: <span class='highlight'>IDF(wᵢ, C) = log(N/n)</span></li>
             <li><strong>N:</strong> Total number of documents</li>
             <li><strong>n:</strong> Number of documents containing the word</li>
         </ul>
-    </div>
     """,
     unsafe_allow_html=True,
     )
@@ -637,14 +613,12 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
     st.markdown(
     """
-    <div class='box'>
         <h3 style='color: #6A0572;'>📌 TF-IDF Calculation</h3>
         <ul>
             <li><strong>TF</strong> focuses on words <strong>frequent</strong> in a document.</li>
             <li><strong>IDF</strong> focuses on words <strong>rare</strong> in the corpus.</li>
             <li><span class='highlight'>TF-IDF is high</span> for words that appear <strong>often in a document</strong> but <strong>rarely in the corpus</strong>.</li>
         </ul>
-    </div>
     """,
     unsafe_allow_html=True,
     )
@@ -654,42 +628,36 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
     st.markdown(
     """
-    <div class='box'>
-        <h3 style='color: #6A0572;'>📊 Minimum and Maximum Values of N/n</h3>
         <ul>
             <li>When <strong>n is maximum</strong> → <span class='highlight'>N/n = 1</span></li>
             <li>At <strong>training time</strong>: <span class='highlight'>1 ≤ n ≤ N</span></li>
             <li>At <strong>test time</strong>: <span class='highlight'>0 ≤ n ≤ N</span> (due to Out-of-Vocabulary words)</li>
         </ul>
-    </div>
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
-    <div class='box'>
-        <h3 style='color: #6A0572;'>⚖️ IDF Dominance Over TF</h3>
         <ul>
             <li>If <strong>n decreases</strong> → <span class='highlight'>N/n increases (max)</span></li>
             <li>TF scale is very <span class='highlight'>small</span>, but IDF scale is very <span class='highlight'>high</span></li>
             <li>IDF can <span class='highlight'>dominate</span> TF, favoring rare words over frequent ones</li>
         </ul>
-    </div>
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
-    <div class='box'>
-        <h3 style='color: #6A0572;'>🛠️ How Log Solves IDF Dominance?</h3>
         <ul>
             <li>Applying <span class='highlight'>log</span> reduces the dominance of IDF</li>
             <li>Logarithm <span class='highlight'>rounds off</span> values to a balanced scale</li>
             <li>It prevents bias towards rare words and maintains proportionality</li>
         </ul>
-    </div>
     """,
     unsafe_allow_html=True,
     )

     st.markdown(
     """
         <ul>
             <li><strong>Create a vocabulary:</strong> A set of unique words from the corpus.</li>
             <li><strong>Convert each document into a vector:</strong> A d-dimensional representation.</li>
             <li><strong>Calculate Term Frequency (TF):</strong> Measures the importance of a word within a document.</li>
         </ul>
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
         <ul>
             <li><strong>Compute Inverse Document Frequency (IDF):</strong> Measures how important a word is across all documents.</li>
             <li><strong>For every word in the vocabulary, apply IDF:</strong></li>
         </ul>
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
         - <strong>N:</strong> Total number of documents in the corpus.<br>
         - <strong>n:</strong> Number of documents containing the word wᵢ.<br>
         - TF-IDF helps in understanding word significance while reducing the impact of commonly used words.
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
         <ul>
             <li>TF measures how often a word appears in a document.</li>
             <li>Formula: <span class='highlight'>TF(wᵢ, dᵢ) = (Occurrences of wᵢ in dᵢ) / (Total words in dᵢ)</span></li>
             <li>TF values change based on the document.</li>
         </ul>
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
         <ul>
             <li>TF values always range from <strong>0 to 1</strong>.</li>
             <li>Case-1: <span class='highlight'>TF = 0</span> → Word is not present in the document.</li>
             <li>Case-2: <span class='highlight'>TF = 1</span> → Word is the only word in the document.</li>
         </ul>
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
         <ul>
             <li>IDF measures how important a word is across the entire corpus.</li>
             <li>Formula: <span class='highlight'>IDF(wᵢ, C) = log(N/n)</span></li>
             <li>n = Number of documents containing wᵢ.</li>
             <li>IDF values range from <strong>0 to ∞</strong>.</li>
         </ul>
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
         <ul>
             <li>We calculate TF-IDF by multiplying TF and IDF values.</li>
             <li>Formula: <span class='highlight'>TF-IDF = TF * IDF</span></li>
             <li>TF-IDF helps reduce the impact of frequent words while keeping rare words important.</li>
         </ul>
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
         - The final TF-IDF values may be low, high, or even zero depending on term frequency and document frequency.
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
         <h3 style='color: #6A0572;'>📈 Case 1: High TF-IDF Values</h3>
         <ul>
             <li>If the word appears <strong>frequently</strong> in a document → <span class='highlight'>High TF-IDF</span></li>
         </ul>
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
         <h3 style='color: #6A0572;'>📉 Case 2: Low TF-IDF Values</h3>
         <ul>
             <li>If the word appears <strong>rarely</strong> in a document → <span class='highlight'>Low TF-IDF</span></li>
             <li>TF is always in the range: <strong>[0 - 1]</strong></li>
             <li>IDF is in the range: <strong>[0 - ∞)</strong></li>
         </ul>
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
         <h3 style='color: #6A0572;'>📊 Understanding TF (Term Frequency)</h3>
         <ul>
             <li>TF gives <strong>more importance</strong> to words that occur <strong>frequently</strong> in a document.</li>
             <li>As the word frequency <span class='highlight'>increases</span> → TF <span class='highlight'>increases</span>.</li>
         </ul>
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
         <h3 style='color: #6A0572;'>📉 Understanding IDF (Inverse Document Frequency)</h3>
         <ul>
             <li>IDF Formula: <span class='highlight'>IDF(wᵢ, C) = log(N/n)</span></li>
             <li><strong>N:</strong> Total number of documents</li>
             <li><strong>n:</strong> Number of documents containing the word</li>
         </ul>
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
         <h3 style='color: #6A0572;'>📌 TF-IDF Calculation</h3>
         <ul>
             <li><strong>TF</strong> focuses on words <strong>frequent</strong> in a document.</li>
             <li><strong>IDF</strong> focuses on words <strong>rare</strong> in the corpus.</li>
             <li><span class='highlight'>TF-IDF is high</span> for words that appear <strong>often in a document</strong> but <strong>rarely in the corpus</strong>.</li>
         </ul>
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
+        <h3 style='color: #6A0572;'> Minimum and Maximum Values of N/n</h3>
         <ul>
             <li>When <strong>n is maximum</strong> → <span class='highlight'>N/n = 1</span></li>
             <li>At <strong>training time</strong>: <span class='highlight'>1 ≤ n ≤ N</span></li>
             <li>At <strong>test time</strong>: <span class='highlight'>0 ≤ n ≤ N</span> (due to Out-of-Vocabulary words)</li>
         </ul>
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
+        <h3 style='color: #6A0572;'> IDF Dominance Over TF</h3>
         <ul>
             <li>If <strong>n decreases</strong> → <span class='highlight'>N/n increases (max)</span></li>
             <li>TF scale is very <span class='highlight'>small</span>, but IDF scale is very <span class='highlight'>high</span></li>
             <li>IDF can <span class='highlight'>dominate</span> TF, favoring rare words over frequent ones</li>
         </ul>
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
+        <h3 style='color: #6A0572;'>How Log Solves IDF Dominance?</h3>
         <ul>
             <li>Applying <span class='highlight'>log</span> reduces the dominance of IDF</li>
             <li>Logarithm <span class='highlight'>rounds off</span> values to a balanced scale</li>
             <li>It prevents bias towards rare words and maintains proportionality</li>
         </ul>
     """,
     unsafe_allow_html=True,
     )