", unsafe_allow_html=True) st.markdown("

🔍 Transforming Raw Text

", unsafe_allow_html=True) st.markdown("

Convert unstructured text into a clean and structured format

", unsafe_allow_html=True) st.info("📌 **We preprocess text in three key ways:**\n\n✅ Cleaning - Problem-specific\n\n✅ Simple Pre-processing\n\n✅ Advanced Pre-processing") st.markdown("

", unsafe_allow_html=True) st.markdown("

🔍 NLP Data Preprocessing

", unsafe_allow_html=True) st.markdown("

Transforming raw text into structured data for better ML performance

", unsafe_allow_html=True) st.success("📌 **Benefits of Preprocessing:**\n\n✅ Reduces dimensionality\n\n✅ Improves ML performance\n\n✅ Converts raw text into problem-specific structured data") st.markdown("### ✨ **Essential Preprocessing Steps:**") st.markdown( """

""", unsafe_allow_html=True, ) st.markdown("✅ **Converting Text Case** – Reduces dimensionality; case conversion depends on problem statement.") st.markdown("✅ **Removing URLs, Tags, and Mentions** – Retain only if required by the problem statement.") st.markdown("✅ **Handling Emojis** – Preserve or convert emoji data based on context.") st.markdown("✅ **Expanding Contractions & Acronyms** – Convert abbreviations into standard text.") st.markdown("✅ **Stop Words Removal** – Optional, useful for text simplification.") st.markdown("✅ **Stemming & Lemmatization** – Perform only if grammar is **not** crucial for analysis.") st.markdown("

🔍 Stemming & Lemmatization 💬

📝 In English, words are often made up of three components:

🔹 Prefix + Word + Suffix

✅ Words without a suffix are called Root Words.

✅ If a suffix is added to a root word, the resulting word is an Inflected Word:

🛠️ Root Word + Suffix = Inflected Word

💬 The process of removing the suffix from inflected words to get the root word is known as:

✂️ Stemming
🧠 Lemmatization

🌿 Stemming 🔎

📝 Stemming is the process of reducing an **inflected word** to its root form, known as the stem.

🔹 Inflected word ➝ Root word (Stem)
⚡ The **stem may not always be a valid English word**.
🚀 Performance is faster compared to lemmatization.
⚡ It is used only for **Removal**.
🔹 Whenever we need **Retrieval system** we use stemming

📌 Types of Stemming

🏛️ Porter Stemmer

🔹 A Rule-based Algorithm for stemming.
🔹 It takes a particular word which have some rule.
🔹 For a particular rule it'll going on removing suffix till it reaches 5th stage until the inflection is removed.
🔹 Works only for the English language.

❄️ Snowball Stemmer

🔁 Lancaster Stemmer

📖 Lemmatization 🔎

📝 Lemmatization is the process of reducing an inflected word to its root form, known as the lemma.

🔹 Inflected word ➝ Root word (Lemma)
✅ The lemma is always an actual English word.
🐢 Performance is slower than stemming.
🔍 Both removal & dictionary-based checking are performed.
📝 Used when we need to preserve grammar in text.

📚 WordNet Lemmatizer

🔹 Takes an inflected word as input.
🗄️ Searches in a huge dictionary (WordNet) containing millions of English words.
🔄 Iteratively removes suffixes & checks:

✔️ If it's an actual English word, it continues removing more suffixes.
❌ If it's not an English word, the last valid root word is returned as the lemma.