Spaces:
Sleeping
Sleeping
File size: 6,313 Bytes
66617ac a539a5e 66617ac 6946762 a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 6946762 a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e 66617ac a539a5e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 |
import streamlit as st
import pandas as pd
st.markdown("""
<style>
/* Main background and font settings */
body {
background-color: black;
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
}
/* Main title styling */
h1 {
color: #2c3e50;
font-family: 'Arial', sans-serif;
font-weight: 700;
text-align: center;
margin-bottom: 25px;
border-bottom: 2px solid #3498db;
padding-bottom: 10px;
}
/* Header styling */
h2 {
color: #2c3e50;
font-family: 'Arial', sans-serif;
font-weight: 600;
margin-top: 30px;
border-left: 4px solid #3498db;
padding-left: 10px;
}
/* Subheader styling */
h3 {
color: #2c3e50;
font-family: 'Arial', sans-serif;
font-weight: 500;
margin-top: 20px;
}
/* Custom text styling */
.custom-text {
font-family: 'Georgia', serif;
line-height: 1.8;
color: #34495e;
margin-bottom: 20px;
}
/* List styling */
.custom-list {
list-style-type: none;
padding-left: 20px;
}
.custom-list li {
font-family: 'Georgia', serif;
font-size: 1.1em;
margin-bottom: 10px;
color: #34495e;
position: relative;
padding-left: 25px;
}
.custom-list li::before {
content: "β’";
color: #3498db;
font-weight: bold;
position: absolute;
left: 0;
}
/* Sidebar styling */
.sidebar .sidebar-content {
background-color: #ffffff;
border-radius: 8px;
padding: 15px;
box-shadow: 0 2px 5px rgba(0,0,0,0.1);
}
/* Info box styling */
.stInfo {
background-color: #e8f4fc;
border-left: 4px solid #3498db;
padding: 15px;
border-radius: 0 4px 4px 0;
}
/* Success box styling */
.stSuccess {
background-color: #e8f8f0;
border-left: 4px solid #2ecc71;
padding: 15px;
border-radius: 0 4px 4px 0;
}
/* Code block styling */
.stCodeBlock {
background-color: white;
border-radius: 4px;
padding: 15px;
border-left: 4px solid #95a5a6;
}
</style>
""", unsafe_allow_html=True)
st.title("Text Data Quality Analysis")
# Introduction section
st.markdown("""
<div class='custom-text'>
<h2>Understanding Text Data Quality Analysis</h2>
<p>Evaluating raw text data quality before processing is a critical first step in any text analysis project.</p>
</div>
""", unsafe_allow_html=True)
st.markdown("""
<div class='stInfo'>
<strong>Text Data Quality Analysis is crucial because:</strong><br><br>
β’ Ensures raw data quality before processing<br>
β’ Helps identify potential issues early in the pipeline<br>
β’ Provides insights for better data exploration<br>
β’ Is independent of the specific problem statement
</div>
""", unsafe_allow_html=True)
# Main analysis steps
st.markdown("""
<div class='custom-text'>
<h2>Key Text Data Quality Checks</h2>
</div>
""", unsafe_allow_html=True)
st.markdown("""
<ul class='custom-list'>
<li><strong>Check Text Case</strong> β Identify if text is in lowercase, uppercase, or mixed case</li>
<li><strong>Detect HTML Tags</strong> β Analyze if text contains unwanted HTML elements</li>
<li><strong>Identify URLs</strong> β Check for web addresses that may need processing</li>
<li><strong>Detect Mentions & Hashtags</strong> β Find occurrences of @mentions or #hashtags</li>
<li><strong>Identify Numeric Data</strong> β Detect if text includes digits or numerical data</li>
<li><strong>Analyze Punctuation Usage</strong> β Check whether punctuation marks affect text clarity</li>
<li><strong>Analyze Date/Time Formats</strong> β Identify the presence of date/time-related text</li>
</ul>
""", unsafe_allow_html=True)
st.markdown("""
<div class='stSuccess'>
Performing thorough text data quality analysis ensures structured and high-quality text data, leading to better analysis and model performance.
</div>
""", unsafe_allow_html=True)
# Code example
st.markdown("""
<div class='custom-text'>
<h2>Implementation Example</h2>
<p>Here's a Python function to perform basic text data quality checks:</p>
</div>
""", unsafe_allow_html=True)
st.code('''
import pandas as pd
import re
def text_quality_analysis(data, column):
# Initialize results dictionary
results = {}
# Check for case variations
results['has_lowercase'] = data[column].str.contains('[a-z]').sum()
results['has_uppercase'] = data[column].str.contains('[A-Z]').sum()
# Check for HTML tags
results['has_html_tags'] = data[column].str.contains('<.*?>', regex=True).sum()
# Check for URLs
results['has_urls'] = data[column].str.contains('https?://\\S+', regex=True).sum()
# Check for email addresses
results['has_emails'] = data[column].str.contains('\\S+@\\S+', regex=True).sum()
# Check for mentions and hashtags
results['has_mentions'] = data[column].str.contains('@\\w+', regex=True).sum()
results['has_hashtags'] = data[column].str.contains('#\\w+', regex=True).sum()
# Check for digits
results['has_digits'] = data[column].str.contains('\\d', regex=True).sum()
# Check for punctuation
results['has_punctuation'] = data[column].str.contains('[!"#$%&\\\'()*+,-./:;<=>?@[\\\\]^_`{|}~]', regex=True).sum()
# Check for date formats (simple check)
results['has_dates'] = data[column].str.contains('\\d{1,2}/\\d{1,2}/\\d{2,4}', regex=True).sum()
return pd.DataFrame.from_dict(results, orient='index', columns=['Count'])
''', language='python')
st.markdown("""
<div class='custom-text'>
<p>This function provides a comprehensive analysis of text data quality by checking for various common elements that might need special handling during preprocessing.</p>
<p>The results can help guide your data cleaning strategy based on the specific characteristics of your text data.</p>
</div>
""", unsafe_allow_html=True) |