import streamlit as st import pandas as pd st.markdown(""" """, unsafe_allow_html=True) st.header(":red[📊 Simple EDA 💬]") # Introduction to Simple EDA st.markdown("

", unsafe_allow_html=True) st.markdown("

🔍 Understanding Simple EDA

", unsafe_allow_html=True) st.markdown("

Evaluating raw text data quality before processing

", unsafe_allow_html=True) st.info("📌 **Simple EDA is a crucial step in the NLP lifecycle:**\n\n✅ Ensures raw data quality\n\n✅ Not dependent on problem statement\n\n✅ Helps in better data exploration") st.markdown("

", unsafe_allow_html=True) st.subheader(":violet[📃 Major Simple EDA Steps]") st.markdown("✅ **Check Text Case** – Identify if text is in **lowercase, uppercase, or mixed case**.") st.markdown("✅ **Detect HTML & URL Tags** – Analyze if text contains unwanted elements.") st.markdown("✅ **Identify URLs** – Ensure URLs are either preserved or removed based on problem statement.") st.markdown("✅ **Detect Mentions & Hashtags** – Find occurrences of `@mentions` or `#hashtags`.") st.markdown("✅ **Identify Numeric Data** – Detect if text includes **digits or numerical data**.") st.markdown("✅ **Analyze Punctuation Usage** – Check whether punctuation marks affect text clarity.") st.markdown("✅ **Detect Emojis** – Ensure **emoji-based sentiments** are not lost.") st.markdown("✅ **Analyze Date/Time Formats** – Identify the presence of date/time-related text.") st.success("🚀 Performing **Simple EDA** ensures structured and high-quality text data, leading to better NLP model performance!") st.code(''' import pandas as pd import numpy as np import re import emoji def simple_eda(data,column): lower_upper = data[column].apply(lambda x:True if (x.lower()) or (x.upper()) else False).sum() tags = data[column].apply(lambda x:True if re.search("<.*?>",x) else False).sum() urls = data[column].apply(lambda x:True if re.search("https://\S+",x) else False).sum() mails = data[column].apply(lambda x:True if re.search("\S+@\S+",x) else False).sum() mentions = data[column].apply(lambda x:True if re.search("\B[@#]\S+",x) else False).sum() emojis = data[column].apply(lambda x:True if emoji.emoji_count(x) else False).sum() digit = data[column].apply(lambda x:True if re.search("\d",x) else False).sum() punc = data[column].apply(lambda x:True if re.search('[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]',x) else False).sum() dates = data[column].apply(lambda x:True if re.search(r"^[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4}$",x) else False).sum() if lower_upper >0: print("text have combination") if tags > 0: print("text have tags") if urls >0: print("text have urls") if mails > 0: print("text have mails") if mentions >0: print("text have mentions") if emojis > 0: print("text have emojis") if digit >0: print("text have digit") if punc > 0: print("text have punctuations") if dates >0: print("text have dates") ''') st.markdown(''' - By the following code we will check the exploration of the data - Basically it gives the quality of collected text data - After the simple eda we will perform pre-processing on text based on problem statement after knowing quality of the data ''')