Natural_Language_Processing / pages /5_Pre-processing_of_text.py
Harika22's picture
Rename pages/5_Pre-procesing_of_text.py to pages/5_Pre-processing_of_text.py
f5b1f4a verified
import streamlit as st
st.markdown(
"""
<style>
body {
background-color: #f9f9f9; /* Light gray background */
font-family: 'Arial', sans-serif;
}
@keyframes fadeIn {
0% { opacity: 0; transform: translateY(-20px); }
100% { opacity: 1; transform: translateY(0); }
}
.title {
text-align: center;
color: #2c3e50; /* Deep gray-blue */
font-size: 3rem;
font-weight: bold;
animation: fadeIn 1s ease-in-out;
}
.caption {
text-align: center;
font-style: italic;
font-size: 1.2rem;
color: #7f8c8d; /* Soft gray */
animation: fadeIn 1.5s ease-in-out;
}
.section {
font-size: 1.1rem;
text-align: justify;
line-height: 1.8;
color: #34495e; /* Muted gray */
background: #ffffff; /* White card-style background */
padding: 20px;
border-radius: 10px;
box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);
animation: fadeIn 2s ease-in-out;
margin: 10px 0;
}
.image-container {
text-align: center;
margin: 20px 0;
animation: fadeIn 2.5s ease-in-out;
}
.image-container img {
border-radius: 15px;
box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);
transition: transform 0.3s ease-in-out;
}
.image-container img:hover {
transform: scale(1.05); /* Subtle zoom effect */
}
</style>
""",
unsafe_allow_html=True,
)
st.header(":blue[✨ Pre-processing of Text πŸ—ΊοΈ]")
st.markdown("<div class='section'>", unsafe_allow_html=True)
st.markdown("<h2 class='title'>πŸ” Transforming Raw Text</h2>", unsafe_allow_html=True)
st.markdown("<p class='subtitle'>Convert unstructured text into a clean and structured format</p>", unsafe_allow_html=True)
st.info("πŸ“Œ **We preprocess text in three key ways:**\n\nβœ… Cleaning - Problem-specific\n\nβœ… Simple Pre-processing\n\nβœ… Advanced Pre-processing")
st.markdown("</div>", unsafe_allow_html=True)
st.markdown("### ✨ **Essential Preprocessing Techniques:**")
st.markdown("βœ… **Convert Text Case** – Convert all words to **uppercase** or **lowercase** to maintain consistency and reduce dimensions.")
st.markdown("βœ… **Handle URLs and Tags** – Based on problem statement, either remove or preserve them.")
st.markdown("βœ… **Mentions, Digits, Emails** – Generally removed unless required by the analysis.")
st.markdown("βœ… **Preserve Emojis** – Emojis carry sentiment and play a crucial role in NLP tasks.")
st.markdown("βœ… **Grammar Preservation** – If grammar is needed, avoid removing punctuation.")
st.success("πŸš€ Well-structured and clean text significantly boosts ML model performance!")
st.markdown("<div class='section'>", unsafe_allow_html=True)
st.markdown("<h2 class='title'>πŸ” NLP Data Preprocessing</h2>", unsafe_allow_html=True)
st.markdown("<p class='subtitle'>Transforming raw text into structured data for better ML performance</p>", unsafe_allow_html=True)
st.success("πŸ“Œ **Benefits of Preprocessing:**\n\nβœ… Reduces dimensionality\n\nβœ… Improves ML performance\n\nβœ… Converts raw text into problem-specific structured data")
st.markdown("### ✨ **Essential Preprocessing Steps:**")
st.markdown(
"""
<div class='image-container'>
<img src="https://cdn-uploads.huggingface.co/production/uploads/66bde9bf3c885d04498227a0/HtdtNm-UJdfN057BeKSgV.png",width=400>
</div>
""",
unsafe_allow_html=True,
)
st.markdown("βœ… **Converting Text Case** – Reduces dimensionality; case conversion depends on problem statement.")
st.markdown("βœ… **Removing URLs, Tags, and Mentions** – Retain only if required by the problem statement.")
st.markdown("βœ… **Handling Emojis** – Preserve or convert emoji data based on context.")
st.markdown("βœ… **Expanding Contractions & Acronyms** – Convert abbreviations into standard text.")
st.markdown("βœ… **Stop Words Removal** – Optional, useful for text simplification.")
st.markdown("βœ… **Stemming & Lemmatization** – Perform only if grammar is **not** crucial for analysis.")
st.markdown("</div>", unsafe_allow_html=True)
st.markdown("<h1 class='header-title'>πŸ” Stemming & Lemmatization πŸ’¬</h1>", unsafe_allow_html=True)
st.markdown(
"""
<div class='info-box'>
<p>πŸ“ In English, words are often made up of three components:</p>
<ul>
<li>πŸ”Ή <span class='highlight'>Prefix</span> + <span class='highlight'>Word</span> + <span class='highlight'>Suffix</span></li>
</ul>
<p>βœ… Words without a suffix are called <span class='highlight'>Root Words</span>.</p>
<p>βœ… If a suffix is added to a root word, the resulting word is an <span class='highlight'>Inflected Word</span>:</p>
<ul>
<li>πŸ› οΈ <span class='highlight'>Root Word</span> + <span class='highlight'>Suffix</span> = Inflected Word</li>
</ul>
<p>πŸ’¬ The process of removing the suffix from inflected words to get the root word is known as:</p>
<ul>
<li>βœ‚οΈ <span class='highlight'>Stemming</span></li>
<li>🧠 <span class='highlight'>Lemmatization</span></li>
</ul>
</div>
""",
unsafe_allow_html=True
)
st.markdown("<h1 class='header-title'>🌿 Stemming πŸ”Ž</h1>", unsafe_allow_html=True)
st.markdown(
"""
<div class='info-box'>
<p>πŸ“ <span class='highlight'>Stemming</span> is the process of reducing an **inflected word** to its root form, known as the <span class='highlight'>stem</span>.</p>
<ul>
<li>πŸ”Ή <span class='highlight'>Inflected word ➝ Root word (Stem)</span></li>
<li>⚑ The **stem may not always be a valid English word**.</li>
<li>πŸš€ <span class='highlight'>Performance is faster</span> compared to lemmatization.</li>
<li>⚑ It is used only for **Removal**.</li>
<li>πŸ”Ή Whenever we need **Retrieval system** we use stemming</li>
</ul>
</div>
""",
unsafe_allow_html=True
)
st.markdown("<h2 class='sub-header'>πŸ“Œ Types of Stemming</h2>", unsafe_allow_html=True)
st.markdown("""
- There are **three** major types of stemming techniques:
- πŸ”Ή **Porter Stemmer** πŸ›οΈ (Rule-based, works in 5 stages)
- πŸ”Ή **Snowball Stemmer** ❄️ (Rule-base, Language adaptable)
- πŸ”Ή **Lancaster Stemmer** πŸ” (Iterative, aggressive removal)
""")
st.markdown("<h2 class='sub-header'>πŸ›οΈ Porter Stemmer</h2>", unsafe_allow_html=True)
st.markdown(
"""
<div class='info-box'>
<ul>
<li>πŸ”Ή A Rule-based Algorithm for stemming.</li>
<li>πŸ”Ή It takes a particular word which have some rule.</li>
<li>πŸ”Ή For a particular rule it'll going on removing suffix till it reaches 5th stage until the inflection is removed.</li>
<li>πŸ”Ή Works only for the English language.</li>
</ul>
</div>
""",
unsafe_allow_html=True
)
st.markdown("<h2 class='sub-header'>❄️ Snowball Stemmer</h2>", unsafe_allow_html=True)
st.markdown(
"""
<div class='info-box'>
<ul>
<li>πŸ”Ή An advanced version of the Porter Stemmer.</li>
<li>πŸ”Ή Can be applied to multiple languages.</li>
</ul>
</div>
""",
unsafe_allow_html=True
)
st.markdown("<h2 class='sub-header'>πŸ” Lancaster Stemmer</h2>", unsafe_allow_html=True)
st.markdown(
"""
<div class='info-box'>
<ul>
<li>πŸ”Ή An Iterative Algorithm for stemming.</li>
<li>πŸ”Ή Removes suffixes in multiple iterations.</li>
<li>⚠️ More aggressive removal, which might result in non-English words.</li>
</ul>
</div>
""",
unsafe_allow_html=True
)
st.markdown("<h1 class='header-title'>πŸ“– Lemmatization πŸ”Ž</h1>", unsafe_allow_html=True)
st.markdown(
"""
<div class='info-box'>
<p>πŸ“ <span class='highlight'>Lemmatization</span> is the process of reducing an inflected word to its root form, known as the <span class='highlight'>lemma</span>.</p>
<ul>
<li>πŸ”Ή <span class='highlight'>Inflected word ➝ Root word (Lemma)</span></li>
<li>βœ… The lemma is always an actual English word.</li>
<li>🐒 <span class='highlight'>Performance is slower</span> than stemming.</li>
<li>πŸ” Both removal & dictionary-based checking are performed.</li>
<li>πŸ“ Used when we need to preserve grammar in text.</li>
</ul>
</div>
""",
unsafe_allow_html=True
)
st.markdown("<h2 class='sub-header'>πŸ“š WordNet Lemmatizer</h2>", unsafe_allow_html=True)
st.markdown(
"""
<div class='info-box'>
<ul>
<li>πŸ”Ή Takes an inflected word as input.</li>
<li>πŸ—„οΈ Searches in a huge dictionary (WordNet) containing millions of English words.</li>
<li>πŸ”„ Iteratively removes suffixes & checks:</li>
<ul>
<li>βœ”οΈ If it's an actual English word, it continues removing more suffixes.</li>
<li>❌ If it's not an English word, the last valid root word is returned as the lemma.</li>
</ul>
</ul>
</div>
""",
unsafe_allow_html=True
)
st.code('''
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer,LancasterStemmer,SnowballStemmer,WordNetLemmatizer
from nltk.tokenize import sent_tokenize,word_tokenize
def pre_process(data,col,case="lower",tags=True,url=True,mail=True,mentions=True,digits=True,dates=True,emojis=True,contraction=True,stopwordss=True,inflection="stem",stemmer="porter",punc=True):
stp = stopwords.words("english")
stp.remove("not")
ps = PorterStemmer()
ls = LancasterStemmer()
sb = SnowballStemmer(language="english")
wl = WordNetLemmatizer()
## emoji
if emojis==True:
data[col] = data[col].apply(lambda x:emoji.demojize(x,delimiters=('','')))
else:
pass
## case
if case == "lower":
data[col]=data[col].str.lower()
elif case == "upper":
data[col]=data[col].str.upper()
else:
pass
## tags
if tags==True:
data[col] = data[col].apply(lambda x:re.sub("<.*?>"," ",x))
else:
pass
## urls
if url ==True:
data[col] = data[col].apply(lambda x:re.sub("https://\S+"," ",x))
else:
pass
## mails
if mail ==True:
data[col] = data[col].apply(lambda x:re.sub("\S+@\S+"," ",x))
else:
pass
## mentions
if mentions ==True:
data[col] = data[col].apply(lambda x:re.sub("\B[@#]\S+"," ",x))
else:
pass
## digits
if mentions ==True:
data[col] = data[col].apply(lambda x:re.sub("\d"," ",x))
else:
pass
## dates
if dates==True:
data[col] = data[col].apply(lambda x:re.sub(r"^[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4}$"," ",x))
data[col] = data[col].apply(lambda x:re.sub(r"^[0-9]{4}\/[0-9]{1,2}\/[0-9]{1,2}$"," ",x))
else:
pass
## contractions
if contraction==True:
data[col]= data[col].apply(lambda x:contractions.fix(x))
else:
pass
## punctuations
if punc == True:
data[col]=data[col].apply(lambda x:re.sub('[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]'," ",x))
else:
pass
return data
''')
st.markdown('''
- It'll give the pre-processed text data
- We'll get the clean processed data on which we can perform feature engineering
''')