Spaces:

UmaKumpatla
/

NLP

Sleeping

App Files Files Community

NLP / pages /3.Simple EDA.py

UmaKumpatla

Update pages/3.Simple EDA.py

a662b64 verified about 1 year ago

raw

history blame contribute delete

5.23 kB

	import streamlit as st
	import pandas as pd
	import numpy as np
	#import re
	#import emoji

	st.markdown("""
	<style>
	/* Set a soft background color */
	body {
	background-color: #eef2f7;
	}
	/* Style for main title */
	h1 {
	color: black;
	font-family: 'Roboto', sans-serif;
	font-weight: 700;
	text-align: center;
	margin-bottom: 25px;
	}
	/* Style for headers */
	h2 {
	color: black;
	font-family: 'Roboto', sans-serif;
	font-weight: 600;
	margin-top: 30px;
	}
	/* Style for subheaders */
	h3 {
	color: red;
	font-family: 'Roboto', sans-serif;
	font-weight: 500;
	margin-top: 20px;
	}
	.custom-subheader {
	color: black;
	font-family: 'Roboto', sans-serif;
	font-weight: 600;
	margin-bottom: 15px;
	}
	/* Paragraph styling */
	p {
	font-family: 'Georgia', serif;
	line-height: 1.8;
	color: black;
	margin-bottom: 20px;
	}
	/* List styling with checkmark bullets */
	.icon-bullet {
	list-style-type: none;
	padding-left: 20px;
	}
	.icon-bullet li {
	font-family: 'Georgia', serif;
	font-size: 1.1em;
	margin-bottom: 10px;
	color: black;
	}
	.icon-bullet li::before {
	content: "◆";
	padding-right: 10px;
	color: black;
	}
	/* Sidebar styling */
	.sidebar .sidebar-content {
	background-color: #ffffff;
	border-radius: 10px;
	padding: 15px;
	}
	.sidebar h2 {
	color: #495057;
	}
	/* Custom button style */
	.streamlit-button {
	background-color: #00FFFF;
	color: #000000;
	font-weight: bold;
	}
	</style>
	""", unsafe_allow_html=True)

	st.header(":red[📊 Simple EDA 💬]")

	# Introduction to Simple EDA
	st.markdown("<div class='section'>", unsafe_allow_html=True)
	st.markdown("<h2 class='title'>🔍 Understanding Simple EDA</h2>", unsafe_allow_html=True)
	st.markdown("<p class='subtitle'>Evaluating raw text data quality before processing</p>", unsafe_allow_html=True)

	st.info("📌 Simple EDA is a crucial step in the NLP lifecycle:\n\n❁ Ensures raw data quality\n\n❁ Not dependent on problem statement\n\n❁ Helps in better data exploration")

	st.markdown("</div>", unsafe_allow_html=True)

	st.subheader(":violet[📃 Major Simple EDA Steps]")

	st.markdown("❁ Check Text Case – Identify if text is in lowercase, uppercase, or mixed case.")
	st.markdown("❁ Detect HTML & URL Tags – Analyze if text contains unwanted elements.")
	st.markdown("❁ Identify URLs – Ensure URLs are either preserved or removed based on problem statement.")
	st.markdown("❁ Detect Mentions & Hashtags – Find occurrences of `@mentions` or `#hashtags`.")
	st.markdown("❁ Identify Numeric Data – Detect if text includes digits or numerical data.")
	st.markdown("❁ Analyze Punctuation Usage – Check whether punctuation marks affect text clarity.")
	st.markdown("❁ Detect Emojis – Ensure emoji-based sentiments are not lost.")
	st.markdown("❁ Analyze Date/Time Formats – Identify the presence of date/time-related text.")

	st.success("🚀 Performing Simple EDA ensures structured and high-quality text data, leading to better NLP model performance!")

	st.code('''
	import pandas as pd
	import numpy as np
	import re
	import emoji
	def simple_eda(data, column):
	lower_upper = data[column].apply(lambda x: True if (x.lower()) or (x.upper()) else False).sum()
	tags = data[column].apply(lambda x: True if re.search("<.*?>", x) else False).sum()
	urls = data[column].apply(lambda x: True if re.search("https://\\S+", x) else False).sum()
	mails = data[column].apply(lambda x: True if re.search("\\S+@\\S+", x) else False).sum()
	mentions = data[column].apply(lambda x: True if re.search("\\B[@#]\\S+", x) else False).sum()
	emojis = data[column].apply(lambda x: True if emoji.emoji_count(x) else False).sum()
	digit = data[column].apply(lambda x: True if re.search("\\d", x) else False).sum()
	punc = data[column].apply(lambda x: True if re.search('[!"#$%&\'()*+,-./:;<=>?@[\\\\]^_`{\|}~]', x) else False).sum()
	dates = data[column].apply(lambda x: True if re.search(r"^[0-9]{1,2}/[0-9]{1,2}/[0-9]{4}$", x) else False).sum()
	if lower_upper > 0:
	print("Text has a combination of cases.")
	if tags > 0:
	print("Text contains HTML tags.")
	if urls > 0:
	print("Text contains URLs.")
	if mails > 0:
	print("Text contains email addresses.")
	if mentions > 0:
	print("Text contains mentions or hashtags.")
	if emojis > 0:
	print("Text contains emojis.")
	if digit > 0:
	print("Text contains digits.")
	if punc > 0:
	print("Text contains punctuation marks.")
	if dates > 0:
	print("Text contains dates.")
	''')

	st.markdown('''
	- By following this code, we will check the exploration of the data.
	- It essentially gives the quality of the collected text data.
	- After the simple EDA, we will perform pre-processing on text based on the problem statement after knowing the quality of the data.
	''')