import streamlit as st st.markdown( """ """, unsafe_allow_html=True, ) st.header(":blue[✨ Pre-processing of Text πŸ—ΊοΈ]") st.markdown("
", unsafe_allow_html=True) st.markdown("

πŸ” Transforming Raw Text

", unsafe_allow_html=True) st.markdown("

Convert unstructured text into a clean and structured format

", unsafe_allow_html=True) st.info("πŸ“Œ **We preprocess text in three key ways:**\n\nβœ… Cleaning - Problem-specific\n\nβœ… Simple Pre-processing\n\nβœ… Advanced Pre-processing") st.markdown("
", unsafe_allow_html=True) st.markdown("### ✨ **Essential Preprocessing Techniques:**") st.markdown("βœ… **Convert Text Case** – Convert all words to **uppercase** or **lowercase** to maintain consistency and reduce dimensions.") st.markdown("βœ… **Handle URLs and Tags** – Based on problem statement, either remove or preserve them.") st.markdown("βœ… **Mentions, Digits, Emails** – Generally removed unless required by the analysis.") st.markdown("βœ… **Preserve Emojis** – Emojis carry sentiment and play a crucial role in NLP tasks.") st.markdown("βœ… **Grammar Preservation** – If grammar is needed, avoid removing punctuation.") st.success("πŸš€ Well-structured and clean text significantly boosts ML model performance!") st.markdown("
", unsafe_allow_html=True) st.markdown("

πŸ” NLP Data Preprocessing

", unsafe_allow_html=True) st.markdown("

Transforming raw text into structured data for better ML performance

", unsafe_allow_html=True) st.success("πŸ“Œ **Benefits of Preprocessing:**\n\nβœ… Reduces dimensionality\n\nβœ… Improves ML performance\n\nβœ… Converts raw text into problem-specific structured data") st.markdown("### ✨ **Essential Preprocessing Steps:**") st.markdown( """
""", unsafe_allow_html=True, ) st.markdown("βœ… **Converting Text Case** – Reduces dimensionality; case conversion depends on problem statement.") st.markdown("βœ… **Removing URLs, Tags, and Mentions** – Retain only if required by the problem statement.") st.markdown("βœ… **Handling Emojis** – Preserve or convert emoji data based on context.") st.markdown("βœ… **Expanding Contractions & Acronyms** – Convert abbreviations into standard text.") st.markdown("βœ… **Stop Words Removal** – Optional, useful for text simplification.") st.markdown("βœ… **Stemming & Lemmatization** – Perform only if grammar is **not** crucial for analysis.") st.markdown("
", unsafe_allow_html=True) st.markdown("

πŸ” Stemming & Lemmatization πŸ’¬

", unsafe_allow_html=True) st.markdown( """

πŸ“ In English, words are often made up of three components:

βœ… Words without a suffix are called Root Words.

βœ… If a suffix is added to a root word, the resulting word is an Inflected Word:

πŸ’¬ The process of removing the suffix from inflected words to get the root word is known as:

""", unsafe_allow_html=True ) st.markdown("

🌿 Stemming πŸ”Ž

", unsafe_allow_html=True) st.markdown( """

πŸ“ Stemming is the process of reducing an **inflected word** to its root form, known as the stem.

""", unsafe_allow_html=True ) st.markdown("

πŸ“Œ Types of Stemming

", unsafe_allow_html=True) st.markdown(""" - There are **three** major types of stemming techniques: - πŸ”Ή **Porter Stemmer** πŸ›οΈ (Rule-based, works in 5 stages) - πŸ”Ή **Snowball Stemmer** ❄️ (Rule-base, Language adaptable) - πŸ”Ή **Lancaster Stemmer** πŸ” (Iterative, aggressive removal) """) st.markdown("

πŸ›οΈ Porter Stemmer

", unsafe_allow_html=True) st.markdown( """
""", unsafe_allow_html=True ) st.markdown("

❄️ Snowball Stemmer

", unsafe_allow_html=True) st.markdown( """
""", unsafe_allow_html=True ) st.markdown("

πŸ” Lancaster Stemmer

", unsafe_allow_html=True) st.markdown( """
""", unsafe_allow_html=True ) st.markdown("

πŸ“– Lemmatization πŸ”Ž

", unsafe_allow_html=True) st.markdown( """

πŸ“ Lemmatization is the process of reducing an inflected word to its root form, known as the lemma.

""", unsafe_allow_html=True ) st.markdown("

πŸ“š WordNet Lemmatizer

", unsafe_allow_html=True) st.markdown( """
""", unsafe_allow_html=True ) st.code(''' from nltk.corpus import stopwords from nltk.stem import PorterStemmer,LancasterStemmer,SnowballStemmer,WordNetLemmatizer from nltk.tokenize import sent_tokenize,word_tokenize def pre_process(data,col,case="lower",tags=True,url=True,mail=True,mentions=True,digits=True,dates=True,emojis=True,contraction=True,stopwordss=True,inflection="stem",stemmer="porter",punc=True): stp = stopwords.words("english") stp.remove("not") ps = PorterStemmer() ls = LancasterStemmer() sb = SnowballStemmer(language="english") wl = WordNetLemmatizer() ## emoji if emojis==True: data[col] = data[col].apply(lambda x:emoji.demojize(x,delimiters=('',''))) else: pass ## case if case == "lower": data[col]=data[col].str.lower() elif case == "upper": data[col]=data[col].str.upper() else: pass ## tags if tags==True: data[col] = data[col].apply(lambda x:re.sub("<.*?>"," ",x)) else: pass ## urls if url ==True: data[col] = data[col].apply(lambda x:re.sub("https://\S+"," ",x)) else: pass ## mails if mail ==True: data[col] = data[col].apply(lambda x:re.sub("\S+@\S+"," ",x)) else: pass ## mentions if mentions ==True: data[col] = data[col].apply(lambda x:re.sub("\B[@#]\S+"," ",x)) else: pass ## digits if mentions ==True: data[col] = data[col].apply(lambda x:re.sub("\d"," ",x)) else: pass ## dates if dates==True: data[col] = data[col].apply(lambda x:re.sub(r"^[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4}$"," ",x)) data[col] = data[col].apply(lambda x:re.sub(r"^[0-9]{4}\/[0-9]{1,2}\/[0-9]{1,2}$"," ",x)) else: pass ## contractions if contraction==True: data[col]= data[col].apply(lambda x:contractions.fix(x)) else: pass ## punctuations if punc == True: data[col]=data[col].apply(lambda x:re.sub('[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]'," ",x)) else: pass return data ''') st.markdown(''' - It'll give the pre-processed text data - We'll get the clean processed data on which we can perform feature engineering ''')