Spaces:

Harika22
/

Natural_Language_Processing

Sleeping

App Files Files Community

Harika22 commited on Feb 1, 2025

Commit

d8b57db

verified ·

1 Parent(s): 2a35973

Update pages/5_Pre-procesing_of_text.py

Browse files

Files changed (1) hide show

pages/5_Pre-procesing_of_text.py +107 -0

pages/5_Pre-procesing_of_text.py CHANGED Viewed

@@ -127,3 +127,110 @@ st.markdown(
     """,
     unsafe_allow_html=True
 )

     """,
     unsafe_allow_html=True
 )
+st.markdown("<h1 class='header-title'>🌿 Stemming 🔎</h1>", unsafe_allow_html=True)
+st.markdown(
+    """
+    <div class='info-box'>
+        <p>📝 <span class='highlight'>Stemming</span> is the process of reducing an **inflected word** to its root form, known as the <span class='highlight'>stem</span>.</p>
+        <ul>
+            <li>🔹 <span class='highlight'>Inflected word ➝ Root word (Stem)</span></li>
+            <li>⚡ The **stem may not always be a valid English word**.</li>
+            <li>🚀 <span class='highlight'>Performance is faster</span> compared to lemmatization.</li>
+            <li>⚡ It is used only for **Removal**.</li>
+            <li>🔹 Whenever we need **Retrieval system** we use stemming</li>
+        </ul>
+    </div>
+    """,
+    unsafe_allow_html=True
+)
+st.markdown("<h2 class='sub-header'>📌 Types of Stemming</h2>", unsafe_allow_html=True)
+st.markdown("""
+- There are **three** major types of stemming techniques:
+    - 🔹 **Porter Stemmer** 🏛️ (Rule-based, works in 5 stages)
+    - 🔹 **Snowball Stemmer** ❄️ (Rule-base, Language adaptable)
+    - 🔹 **Lancaster Stemmer** 🔁 (Iterative, aggressive removal)
+""")
+st.markdown("<h2 class='sub-header'>🏛️ Porter Stemmer</h2>", unsafe_allow_html=True)
+st.markdown(
+    """
+    <div class='info-box'>
+        <ul>
+            <li>🔹 A **Rule-based Algorithm** for stemming.</li>
+            <li>🔹 It takes a particular word which have some rule.</li>
+            <li>🔹 For a particular rule it'll going on removing suffix till it reaches 5th stage until the inflection is removed.</li>
+            <li>🔹 Works **only for the English language**.</li>
+        </ul>
+    </div>
+    """,
+    unsafe_allow_html=True
+)
+st.markdown("<h2 class='sub-header'>❄️ Snowball Stemmer</h2>", unsafe_allow_html=True)
+st.markdown(
+    """
+    <div class='info-box'>
+        <ul>
+            <li>🔹 An **advanced version of the Porter Stemmer**.</li>
+            <li>🔹 Can be applied to **multiple languages**.</li>
+        </ul>
+    </div>
+    """,
+    unsafe_allow_html=True
+)
+st.markdown("<h2 class='sub-header'>🔁 Lancaster Stemmer</h2>", unsafe_allow_html=True)
+st.markdown(
+    """
+    <div class='info-box'>
+        <ul>
+            <li>🔹 An **Iterative Algorithm** for stemming.</li>
+            <li>🔹 Removes suffixes in **multiple iterations**.</li>
+            <li>⚠️ **More aggressive removal**, which might result in **non-English words**.</li>
+        </ul>
+    </div>
+    """,
+    unsafe_allow_html=True
+)
+st.markdown("<h1 class='header-title'>📖 Lemmatization 🔎</h1>", unsafe_allow_html=True)
+st.markdown(
+    """
+    <div class='info-box'>
+        <p>📝 <span class='highlight'>Lemmatization</span> is the process of reducing an **inflected word** to its root form, known as the <span class='highlight'>lemma</span>.</p>
+        <ul>
+            <li>🔹 <span class='highlight'>Inflected word ➝ Root word (Lemma)</span></li>
+            <li>✅ The **lemma is always an actual English word**.</li>
+            <li>🐢 <span class='highlight'>Performance is slower</span> than stemming.</li>
+            <li>🔍 **Both removal & dictionary-based checking** are performed.</li>
+            <li>📝 **Used when we need to preserve grammar** in text.</li>
+        </ul>
+    </div>
+    """,
+    unsafe_allow_html=True
+)
+st.markdown("<h2 class='sub-header'>📚 WordNet Lemmatizer</h2>", unsafe_allow_html=True)
+st.markdown(
+    """
+    <div class='info-box'>
+        <ul>
+            <li>🔹 Takes an **inflected word** as input.</li>
+            <li>🗄️ Searches in a **huge dictionary (WordNet)** containing millions of English words.</li>
+            <li>🔄 **Iteratively removes suffixes** & checks:</li>
+            <ul>
+                <li>✔️ If it's an **actual English word**, it continues removing more suffixes.</li>
+                <li>❌ If it's **not an English word**, the last valid root word is returned as the lemma.</li>
+            </ul>
+        </ul>
+    </div>
+    """,
+    unsafe_allow_html=True
+)