Spaces:

Harika22
/

Natural_Language_Processing

Sleeping

App Files Files Community

Natural_Language_Processing / pages /4_Simple_EDA.py

Harika22

Update pages/4_Simple_EDA.py

896d8db verified 11 months ago

raw

history blame contribute delete

5.47 kB

	import streamlit as st
	import pandas as pd

	st.markdown("""
	<style>
	/* Set a soft background color */
	body {
	background-color: #eef2f7;
	}
	/* Style for main title */
	h1 {
	color: black;
	font-family: 'Roboto', sans-serif;
	font-weight: 700;
	text-align: center;
	margin-bottom: 25px;
	}
	/* Style for headers */
	h2 {
	color: black;
	font-family: 'Roboto', sans-serif;
	font-weight: 600;
	margin-top: 30px;
	}

	/* Style for subheaders */
	h3 {
	color: red;
	font-family: 'Roboto', sans-serif;
	font-weight: 500;
	margin-top: 20px;
	}
	.custom-subheader {
	color: black;
	font-family: 'Roboto', sans-serif;
	font-weight: 600;
	margin-bottom: 15px;
	}
	/* Paragraph styling */
	p {
	font-family: 'Georgia', serif;
	line-height: 1.8;
	color: black;
	margin-bottom: 20px;
	}
	/* List styling with checkmark bullets */
	.icon-bullet {
	list-style-type: none;
	padding-left: 20px;
	}
	.icon-bullet li {
	font-family: 'Georgia', serif;
	font-size: 1.1em;
	margin-bottom: 10px;
	color: black;
	}
	.icon-bullet li::before {
	content: "◆";
	padding-right: 10px;
	color: black;
	}
	/* Sidebar styling */
	.sidebar .sidebar-content {
	background-color: #ffffff;
	border-radius: 10px;
	padding: 15px;
	}
	.sidebar h2 {
	color: #495057;
	}
	/* Custom button style */
	.streamlit-button {
	background-color: #00FFFF;
	color: #000000;
	font-weight: bold;
	}
	</style>
	""", unsafe_allow_html=True)

	st.header(":red[📊 Simple EDA 💬]")

	# Introduction to Simple EDA
	st.markdown("<div class='section'>", unsafe_allow_html=True)
	st.markdown("<h2 class='title'>🔍 Understanding Simple EDA</h2>", unsafe_allow_html=True)
	st.markdown("<p class='subtitle'>Evaluating raw text data quality before processing</p>", unsafe_allow_html=True)

	st.info("📌 Simple EDA is a crucial step in the NLP lifecycle:\n\n✅ Ensures raw data quality\n\n✅ Not dependent on problem statement\n\n✅ Helps in better data exploration")

	st.markdown("</div>", unsafe_allow_html=True)

	st.subheader(":violet[📃 Major Simple EDA Steps]")

	st.markdown("✅ Check Text Case – Identify if text is in lowercase, uppercase, or mixed case.")
	st.markdown("✅ Detect HTML & URL Tags – Analyze if text contains unwanted elements.")
	st.markdown("✅ Identify URLs – Ensure URLs are either preserved or removed based on problem statement.")
	st.markdown("✅ Detect Mentions & Hashtags – Find occurrences of `@mentions` or `#hashtags`.")
	st.markdown("✅ Identify Numeric Data – Detect if text includes digits or numerical data.")
	st.markdown("✅ Analyze Punctuation Usage – Check whether punctuation marks affect text clarity.")
	st.markdown("✅ Detect Emojis – Ensure emoji-based sentiments are not lost.")
	st.markdown("✅ Analyze Date/Time Formats – Identify the presence of date/time-related text.")

	st.success("🚀 Performing Simple EDA ensures structured and high-quality text data, leading to better NLP model performance!")


	st.code('''
	import pandas as pd
	import numpy as np
	import re
	import emoji

	def simple_eda(data,column):
	lower_upper = data[column].apply(lambda x:True if (x.lower()) or (x.upper()) else False).sum()
	tags = data[column].apply(lambda x:True if re.search("<.*?>",x) else False).sum()
	urls = data[column].apply(lambda x:True if re.search("https://\S+",x) else False).sum()
	mails = data[column].apply(lambda x:True if re.search("\S+@\S+",x) else False).sum()
	mentions = data[column].apply(lambda x:True if re.search("\B[@#]\S+",x) else False).sum()
	emojis = data[column].apply(lambda x:True if emoji.emoji_count(x) else False).sum()
	digit = data[column].apply(lambda x:True if re.search("\d",x) else False).sum()
	punc = data[column].apply(lambda x:True if re.search('[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{\|}~]',x) else False).sum()
	dates = data[column].apply(lambda x:True if re.search(r"^[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4}$",x) else False).sum()


	if lower_upper >0:
	print("text have combination")
	if tags > 0:
	print("text have tags")
	if urls >0:
	print("text have urls")
	if mails > 0:
	print("text have mails")
	if mentions >0:
	print("text have mentions")
	if emojis > 0:
	print("text have emojis")
	if digit >0:
	print("text have digit")
	if punc > 0:
	print("text have punctuations")
	if dates >0:
	print("text have dates")

	''')

	st.markdown('''
	- By the following code we will check the exploration of the data
	- Basically it gives the quality of collected text data
	- After the simple eda we will perform pre-processing on text based on problem statement after knowing quality of the data
	''')