import streamlit as st
import pandas as pd
st.markdown("""
""", unsafe_allow_html=True)
st.header(":red[š Simple EDA š¬]")
# Introduction to Simple EDA
st.markdown("
", unsafe_allow_html=True)
st.markdown("
š Understanding Simple EDA
", unsafe_allow_html=True)
st.markdown("
Evaluating raw text data quality before processing
", unsafe_allow_html=True)
st.info("š **Simple EDA is a crucial step in the NLP lifecycle:**\n\nā
Ensures raw data quality\n\nā
Not dependent on problem statement\n\nā
Helps in better data exploration")
st.markdown("
", unsafe_allow_html=True)
st.subheader(":violet[š Major Simple EDA Steps]")
st.markdown("ā
**Check Text Case** ā Identify if text is in **lowercase, uppercase, or mixed case**.")
st.markdown("ā
**Detect HTML & URL Tags** ā Analyze if text contains unwanted elements.")
st.markdown("ā
**Identify URLs** ā Ensure URLs are either preserved or removed based on problem statement.")
st.markdown("ā
**Detect Mentions & Hashtags** ā Find occurrences of `@mentions` or `#hashtags`.")
st.markdown("ā
**Identify Numeric Data** ā Detect if text includes **digits or numerical data**.")
st.markdown("ā
**Analyze Punctuation Usage** ā Check whether punctuation marks affect text clarity.")
st.markdown("ā
**Detect Emojis** ā Ensure **emoji-based sentiments** are not lost.")
st.markdown("ā
**Analyze Date/Time Formats** ā Identify the presence of date/time-related text.")
st.success("š Performing **Simple EDA** ensures structured and high-quality text data, leading to better NLP model performance!")
st.code('''
import pandas as pd
import numpy as np
import re
import emoji
def simple_eda(data,column):
lower_upper = data[column].apply(lambda x:True if (x.lower()) or (x.upper()) else False).sum()
tags = data[column].apply(lambda x:True if re.search("<.*?>",x) else False).sum()
urls = data[column].apply(lambda x:True if re.search("https://\S+",x) else False).sum()
mails = data[column].apply(lambda x:True if re.search("\S+@\S+",x) else False).sum()
mentions = data[column].apply(lambda x:True if re.search("\B[@#]\S+",x) else False).sum()
emojis = data[column].apply(lambda x:True if emoji.emoji_count(x) else False).sum()
digit = data[column].apply(lambda x:True if re.search("\d",x) else False).sum()
punc = data[column].apply(lambda x:True if re.search('[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]',x) else False).sum()
dates = data[column].apply(lambda x:True if re.search(r"^[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4}$",x) else False).sum()
if lower_upper >0:
print("text have combination")
if tags > 0:
print("text have tags")
if urls >0:
print("text have urls")
if mails > 0:
print("text have mails")
if mentions >0:
print("text have mentions")
if emojis > 0:
print("text have emojis")
if digit >0:
print("text have digit")
if punc > 0:
print("text have punctuations")
if dates >0:
print("text have dates")
''')
st.markdown('''
- By the following code we will check the exploration of the data
- Basically it gives the quality of collected text data
- After the simple eda we will perform pre-processing on text based on problem statement after knowing quality of the data
''')