Spaces:

satya11
/

Natural_Language_Processing

Sleeping

App Files Files Community

Natural_Language_Processing / pages /4.Simple EDA.py

satya11

Update pages/4.Simple EDA.py

6946762 verified 10 months ago

raw

history blame contribute delete

6.31 kB

	import streamlit as st
	import pandas as pd

	st.markdown("""
	<style>
	/* Main background and font settings */
	body {
	background-color: black;
	font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
	}

	/* Main title styling */
	h1 {
	color: #2c3e50;
	font-family: 'Arial', sans-serif;
	font-weight: 700;
	text-align: center;
	margin-bottom: 25px;
	border-bottom: 2px solid #3498db;
	padding-bottom: 10px;
	}

	/* Header styling */
	h2 {
	color: #2c3e50;
	font-family: 'Arial', sans-serif;
	font-weight: 600;
	margin-top: 30px;
	border-left: 4px solid #3498db;
	padding-left: 10px;
	}

	/* Subheader styling */
	h3 {
	color: #2c3e50;
	font-family: 'Arial', sans-serif;
	font-weight: 500;
	margin-top: 20px;
	}

	/* Custom text styling */
	.custom-text {
	font-family: 'Georgia', serif;
	line-height: 1.8;
	color: #34495e;
	margin-bottom: 20px;
	}

	/* List styling */
	.custom-list {
	list-style-type: none;
	padding-left: 20px;
	}

	.custom-list li {
	font-family: 'Georgia', serif;
	font-size: 1.1em;
	margin-bottom: 10px;
	color: #34495e;
	position: relative;
	padding-left: 25px;
	}

	.custom-list li::before {
	content: "•";
	color: #3498db;
	font-weight: bold;
	position: absolute;
	left: 0;
	}

	/* Sidebar styling */
	.sidebar .sidebar-content {
	background-color: #ffffff;
	border-radius: 8px;
	padding: 15px;
	box-shadow: 0 2px 5px rgba(0,0,0,0.1);
	}

	/* Info box styling */
	.stInfo {
	background-color: #e8f4fc;
	border-left: 4px solid #3498db;
	padding: 15px;
	border-radius: 0 4px 4px 0;
	}

	/* Success box styling */
	.stSuccess {
	background-color: #e8f8f0;
	border-left: 4px solid #2ecc71;
	padding: 15px;
	border-radius: 0 4px 4px 0;
	}

	/* Code block styling */
	.stCodeBlock {
	background-color: white;
	border-radius: 4px;
	padding: 15px;
	border-left: 4px solid #95a5a6;
	}
	</style>
	""", unsafe_allow_html=True)

	st.title("Text Data Quality Analysis")

	# Introduction section
	st.markdown("""
	<div class='custom-text'>
	<h2>Understanding Text Data Quality Analysis</h2>
	<p>Evaluating raw text data quality before processing is a critical first step in any text analysis project.</p>
	</div>
	""", unsafe_allow_html=True)

	st.markdown("""
	<div class='stInfo'>
	<strong>Text Data Quality Analysis is crucial because:</strong><br><br>
	• Ensures raw data quality before processing<br>
	• Helps identify potential issues early in the pipeline<br>
	• Provides insights for better data exploration<br>
	• Is independent of the specific problem statement
	</div>
	""", unsafe_allow_html=True)

	# Main analysis steps
	st.markdown("""
	<div class='custom-text'>
	<h2>Key Text Data Quality Checks</h2>
	</div>
	""", unsafe_allow_html=True)

	st.markdown("""
	<ul class='custom-list'>
	<li><strong>Check Text Case</strong> – Identify if text is in lowercase, uppercase, or mixed case</li>
	<li><strong>Detect HTML Tags</strong> – Analyze if text contains unwanted HTML elements</li>
	<li><strong>Identify URLs</strong> – Check for web addresses that may need processing</li>
	<li><strong>Detect Mentions & Hashtags</strong> – Find occurrences of @mentions or #hashtags</li>
	<li><strong>Identify Numeric Data</strong> – Detect if text includes digits or numerical data</li>
	<li><strong>Analyze Punctuation Usage</strong> – Check whether punctuation marks affect text clarity</li>
	<li><strong>Analyze Date/Time Formats</strong> – Identify the presence of date/time-related text</li>
	</ul>
	""", unsafe_allow_html=True)

	st.markdown("""
	<div class='stSuccess'>
	Performing thorough text data quality analysis ensures structured and high-quality text data, leading to better analysis and model performance.
	</div>
	""", unsafe_allow_html=True)

	# Code example
	st.markdown("""
	<div class='custom-text'>
	<h2>Implementation Example</h2>
	<p>Here's a Python function to perform basic text data quality checks:</p>
	</div>
	""", unsafe_allow_html=True)

	st.code('''
	import pandas as pd
	import re

	def text_quality_analysis(data, column):
	# Initialize results dictionary
	results = {}

	# Check for case variations
	results['has_lowercase'] = data[column].str.contains('[a-z]').sum()
	results['has_uppercase'] = data[column].str.contains('[A-Z]').sum()

	# Check for HTML tags
	results['has_html_tags'] = data[column].str.contains('<.*?>', regex=True).sum()

	# Check for URLs
	results['has_urls'] = data[column].str.contains('https?://\\S+', regex=True).sum()

	# Check for email addresses
	results['has_emails'] = data[column].str.contains('\\S+@\\S+', regex=True).sum()

	# Check for mentions and hashtags
	results['has_mentions'] = data[column].str.contains('@\\w+', regex=True).sum()
	results['has_hashtags'] = data[column].str.contains('#\\w+', regex=True).sum()

	# Check for digits
	results['has_digits'] = data[column].str.contains('\\d', regex=True).sum()

	# Check for punctuation
	results['has_punctuation'] = data[column].str.contains('[!"#$%&\\\'()*+,-./:;<=>?@[\\\\]^_`{\|}~]', regex=True).sum()

	# Check for date formats (simple check)
	results['has_dates'] = data[column].str.contains('\\d{1,2}/\\d{1,2}/\\d{2,4}', regex=True).sum()

	return pd.DataFrame.from_dict(results, orient='index', columns=['Count'])
	''', language='python')

	st.markdown("""
	<div class='custom-text'>
	<p>This function provides a comprehensive analysis of text data quality by checking for various common elements that might need special handling during preprocessing.</p>
	<p>The results can help guide your data cleaning strategy based on the specific characteristics of your text data.</p>
	</div>
	""", unsafe_allow_html=True)