satya11's picture
Update pages/4.Simple EDA.py
6946762 verified
import streamlit as st
import pandas as pd
st.markdown("""
<style>
/* Main background and font settings */
body {
background-color: black;
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
}
/* Main title styling */
h1 {
color: #2c3e50;
font-family: 'Arial', sans-serif;
font-weight: 700;
text-align: center;
margin-bottom: 25px;
border-bottom: 2px solid #3498db;
padding-bottom: 10px;
}
/* Header styling */
h2 {
color: #2c3e50;
font-family: 'Arial', sans-serif;
font-weight: 600;
margin-top: 30px;
border-left: 4px solid #3498db;
padding-left: 10px;
}
/* Subheader styling */
h3 {
color: #2c3e50;
font-family: 'Arial', sans-serif;
font-weight: 500;
margin-top: 20px;
}
/* Custom text styling */
.custom-text {
font-family: 'Georgia', serif;
line-height: 1.8;
color: #34495e;
margin-bottom: 20px;
}
/* List styling */
.custom-list {
list-style-type: none;
padding-left: 20px;
}
.custom-list li {
font-family: 'Georgia', serif;
font-size: 1.1em;
margin-bottom: 10px;
color: #34495e;
position: relative;
padding-left: 25px;
}
.custom-list li::before {
content: "β€’";
color: #3498db;
font-weight: bold;
position: absolute;
left: 0;
}
/* Sidebar styling */
.sidebar .sidebar-content {
background-color: #ffffff;
border-radius: 8px;
padding: 15px;
box-shadow: 0 2px 5px rgba(0,0,0,0.1);
}
/* Info box styling */
.stInfo {
background-color: #e8f4fc;
border-left: 4px solid #3498db;
padding: 15px;
border-radius: 0 4px 4px 0;
}
/* Success box styling */
.stSuccess {
background-color: #e8f8f0;
border-left: 4px solid #2ecc71;
padding: 15px;
border-radius: 0 4px 4px 0;
}
/* Code block styling */
.stCodeBlock {
background-color: white;
border-radius: 4px;
padding: 15px;
border-left: 4px solid #95a5a6;
}
</style>
""", unsafe_allow_html=True)
st.title("Text Data Quality Analysis")
# Introduction section
st.markdown("""
<div class='custom-text'>
<h2>Understanding Text Data Quality Analysis</h2>
<p>Evaluating raw text data quality before processing is a critical first step in any text analysis project.</p>
</div>
""", unsafe_allow_html=True)
st.markdown("""
<div class='stInfo'>
<strong>Text Data Quality Analysis is crucial because:</strong><br><br>
β€’ Ensures raw data quality before processing<br>
β€’ Helps identify potential issues early in the pipeline<br>
β€’ Provides insights for better data exploration<br>
β€’ Is independent of the specific problem statement
</div>
""", unsafe_allow_html=True)
# Main analysis steps
st.markdown("""
<div class='custom-text'>
<h2>Key Text Data Quality Checks</h2>
</div>
""", unsafe_allow_html=True)
st.markdown("""
<ul class='custom-list'>
<li><strong>Check Text Case</strong> – Identify if text is in lowercase, uppercase, or mixed case</li>
<li><strong>Detect HTML Tags</strong> – Analyze if text contains unwanted HTML elements</li>
<li><strong>Identify URLs</strong> – Check for web addresses that may need processing</li>
<li><strong>Detect Mentions & Hashtags</strong> – Find occurrences of @mentions or #hashtags</li>
<li><strong>Identify Numeric Data</strong> – Detect if text includes digits or numerical data</li>
<li><strong>Analyze Punctuation Usage</strong> – Check whether punctuation marks affect text clarity</li>
<li><strong>Analyze Date/Time Formats</strong> – Identify the presence of date/time-related text</li>
</ul>
""", unsafe_allow_html=True)
st.markdown("""
<div class='stSuccess'>
Performing thorough text data quality analysis ensures structured and high-quality text data, leading to better analysis and model performance.
</div>
""", unsafe_allow_html=True)
# Code example
st.markdown("""
<div class='custom-text'>
<h2>Implementation Example</h2>
<p>Here's a Python function to perform basic text data quality checks:</p>
</div>
""", unsafe_allow_html=True)
st.code('''
import pandas as pd
import re
def text_quality_analysis(data, column):
# Initialize results dictionary
results = {}
# Check for case variations
results['has_lowercase'] = data[column].str.contains('[a-z]').sum()
results['has_uppercase'] = data[column].str.contains('[A-Z]').sum()
# Check for HTML tags
results['has_html_tags'] = data[column].str.contains('<.*?>', regex=True).sum()
# Check for URLs
results['has_urls'] = data[column].str.contains('https?://\\S+', regex=True).sum()
# Check for email addresses
results['has_emails'] = data[column].str.contains('\\S+@\\S+', regex=True).sum()
# Check for mentions and hashtags
results['has_mentions'] = data[column].str.contains('@\\w+', regex=True).sum()
results['has_hashtags'] = data[column].str.contains('#\\w+', regex=True).sum()
# Check for digits
results['has_digits'] = data[column].str.contains('\\d', regex=True).sum()
# Check for punctuation
results['has_punctuation'] = data[column].str.contains('[!"#$%&\\\'()*+,-./:;<=>?@[\\\\]^_`{|}~]', regex=True).sum()
# Check for date formats (simple check)
results['has_dates'] = data[column].str.contains('\\d{1,2}/\\d{1,2}/\\d{2,4}', regex=True).sum()
return pd.DataFrame.from_dict(results, orient='index', columns=['Count'])
''', language='python')
st.markdown("""
<div class='custom-text'>
<p>This function provides a comprehensive analysis of text data quality by checking for various common elements that might need special handling during preprocessing.</p>
<p>The results can help guide your data cleaning strategy based on the specific characteristics of your text data.</p>
</div>
""", unsafe_allow_html=True)