import streamlit as st
st.markdown(
"""
""",
unsafe_allow_html=True,
)
st.header(":blue[β¨ Pre-processing of Text πΊοΈ]")
st.markdown("
", unsafe_allow_html=True)
st.markdown("
π Transforming Raw Text
", unsafe_allow_html=True)
st.markdown("
Convert unstructured text into a clean and structured format
", unsafe_allow_html=True)
st.info("π **We preprocess text in three key ways:**\n\nβ
Cleaning - Problem-specific\n\nβ
Simple Pre-processing\n\nβ
Advanced Pre-processing")
st.markdown("
", unsafe_allow_html=True)
st.markdown("### β¨ **Essential Preprocessing Techniques:**")
st.markdown("β
**Convert Text Case** β Convert all words to **uppercase** or **lowercase** to maintain consistency and reduce dimensions.")
st.markdown("β
**Handle URLs and Tags** β Based on problem statement, either remove or preserve them.")
st.markdown("β
**Mentions, Digits, Emails** β Generally removed unless required by the analysis.")
st.markdown("β
**Preserve Emojis** β Emojis carry sentiment and play a crucial role in NLP tasks.")
st.markdown("β
**Grammar Preservation** β If grammar is needed, avoid removing punctuation.")
st.success("π Well-structured and clean text significantly boosts ML model performance!")
st.markdown("", unsafe_allow_html=True)
st.markdown("
π NLP Data Preprocessing
", unsafe_allow_html=True)
st.markdown("
Transforming raw text into structured data for better ML performance
", unsafe_allow_html=True)
st.success("π **Benefits of Preprocessing:**\n\nβ
Reduces dimensionality\n\nβ
Improves ML performance\n\nβ
Converts raw text into problem-specific structured data")
st.markdown("### β¨ **Essential Preprocessing Steps:**")
st.markdown(
"""
""",
unsafe_allow_html=True,
)
st.markdown("β
**Converting Text Case** β Reduces dimensionality; case conversion depends on problem statement.")
st.markdown("β
**Removing URLs, Tags, and Mentions** β Retain only if required by the problem statement.")
st.markdown("β
**Handling Emojis** β Preserve or convert emoji data based on context.")
st.markdown("β
**Expanding Contractions & Acronyms** β Convert abbreviations into standard text.")
st.markdown("β
**Stop Words Removal** β Optional, useful for text simplification.")
st.markdown("β
**Stemming & Lemmatization** β Perform only if grammar is **not** crucial for analysis.")
st.markdown("
", unsafe_allow_html=True)
st.markdown("", unsafe_allow_html=True)
st.markdown(
"""
π In English, words are often made up of three components:
- πΉ Prefix + Word + Suffix
β
Words without a suffix are called Root Words.
β
If a suffix is added to a root word, the resulting word is an Inflected Word:
- π οΈ Root Word + Suffix = Inflected Word
π¬ The process of removing the suffix from inflected words to get the root word is known as:
- βοΈ Stemming
- π§ Lemmatization
""",
unsafe_allow_html=True
)
st.markdown("", unsafe_allow_html=True)
st.markdown(
"""
π Stemming is the process of reducing an **inflected word** to its root form, known as the stem.
- πΉ Inflected word β Root word (Stem)
- β‘ The **stem may not always be a valid English word**.
- π Performance is faster compared to lemmatization.
- β‘ It is used only for **Removal**.
- πΉ Whenever we need **Retrieval system** we use stemming
""",
unsafe_allow_html=True
)
st.markdown("", unsafe_allow_html=True)
st.markdown("""
- There are **three** major types of stemming techniques:
- πΉ **Porter Stemmer** ποΈ (Rule-based, works in 5 stages)
- πΉ **Snowball Stemmer** βοΈ (Rule-base, Language adaptable)
- πΉ **Lancaster Stemmer** π (Iterative, aggressive removal)
""")
st.markdown("", unsafe_allow_html=True)
st.markdown(
"""
- πΉ A Rule-based Algorithm for stemming.
- πΉ It takes a particular word which have some rule.
- πΉ For a particular rule it'll going on removing suffix till it reaches 5th stage until the inflection is removed.
- πΉ Works only for the English language.
""",
unsafe_allow_html=True
)
st.markdown("", unsafe_allow_html=True)
st.markdown(
"""
- πΉ An advanced version of the Porter Stemmer.
- πΉ Can be applied to multiple languages.
""",
unsafe_allow_html=True
)
st.markdown("", unsafe_allow_html=True)
st.markdown(
"""
- πΉ An Iterative Algorithm for stemming.
- πΉ Removes suffixes in multiple iterations.
- β οΈ More aggressive removal, which might result in non-English words.
""",
unsafe_allow_html=True
)
st.markdown("", unsafe_allow_html=True)
st.markdown(
"""
π Lemmatization is the process of reducing an inflected word to its root form, known as the lemma.
- πΉ Inflected word β Root word (Lemma)
- β
The lemma is always an actual English word.
- π’ Performance is slower than stemming.
- π Both removal & dictionary-based checking are performed.
- π Used when we need to preserve grammar in text.
""",
unsafe_allow_html=True
)
st.markdown("", unsafe_allow_html=True)
st.markdown(
"""
- πΉ Takes an inflected word as input.
- ποΈ Searches in a huge dictionary (WordNet) containing millions of English words.
- π Iteratively removes suffixes & checks:
- βοΈ If it's an actual English word, it continues removing more suffixes.
- β If it's not an English word, the last valid root word is returned as the lemma.
""",
unsafe_allow_html=True
)
st.code('''
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer,LancasterStemmer,SnowballStemmer,WordNetLemmatizer
from nltk.tokenize import sent_tokenize,word_tokenize
def pre_process(data,col,case="lower",tags=True,url=True,mail=True,mentions=True,digits=True,dates=True,emojis=True,contraction=True,stopwordss=True,inflection="stem",stemmer="porter",punc=True):
stp = stopwords.words("english")
stp.remove("not")
ps = PorterStemmer()
ls = LancasterStemmer()
sb = SnowballStemmer(language="english")
wl = WordNetLemmatizer()
## emoji
if emojis==True:
data[col] = data[col].apply(lambda x:emoji.demojize(x,delimiters=('','')))
else:
pass
## case
if case == "lower":
data[col]=data[col].str.lower()
elif case == "upper":
data[col]=data[col].str.upper()
else:
pass
## tags
if tags==True:
data[col] = data[col].apply(lambda x:re.sub("<.*?>"," ",x))
else:
pass
## urls
if url ==True:
data[col] = data[col].apply(lambda x:re.sub("https://\S+"," ",x))
else:
pass
## mails
if mail ==True:
data[col] = data[col].apply(lambda x:re.sub("\S+@\S+"," ",x))
else:
pass
## mentions
if mentions ==True:
data[col] = data[col].apply(lambda x:re.sub("\B[@#]\S+"," ",x))
else:
pass
## digits
if mentions ==True:
data[col] = data[col].apply(lambda x:re.sub("\d"," ",x))
else:
pass
## dates
if dates==True:
data[col] = data[col].apply(lambda x:re.sub(r"^[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4}$"," ",x))
data[col] = data[col].apply(lambda x:re.sub(r"^[0-9]{4}\/[0-9]{1,2}\/[0-9]{1,2}$"," ",x))
else:
pass
## contractions
if contraction==True:
data[col]= data[col].apply(lambda x:contractions.fix(x))
else:
pass
## punctuations
if punc == True:
data[col]=data[col].apply(lambda x:re.sub('[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]'," ",x))
else:
pass
return data
''')
st.markdown('''
- It'll give the pre-processed text data
- We'll get the clean processed data on which we can perform feature engineering
''')