File size: 6,313 Bytes
66617ac
 
 
 
 
a539a5e
66617ac
6946762
a539a5e
66617ac
a539a5e
 
66617ac
a539a5e
 
66617ac
 
 
a539a5e
 
66617ac
a539a5e
 
66617ac
a539a5e
 
66617ac
 
a539a5e
 
66617ac
 
a539a5e
 
 
 
66617ac
 
 
a539a5e
 
 
66617ac
 
a539a5e
66617ac
 
a539a5e
 
 
66617ac
 
 
a539a5e
 
66617ac
 
 
a539a5e
 
 
66617ac
a539a5e
 
 
 
 
 
 
66617ac
a539a5e
66617ac
 
 
a539a5e
66617ac
a539a5e
66617ac
a539a5e
 
 
 
 
 
 
66617ac
a539a5e
 
 
 
 
 
 
66617ac
a539a5e
 
 
6946762
a539a5e
66617ac
a539a5e
66617ac
 
 
 
a539a5e
66617ac
a539a5e
 
 
 
 
 
 
66617ac
a539a5e
 
 
 
 
 
 
 
 
66617ac
a539a5e
 
 
 
 
 
66617ac
a539a5e
 
 
 
 
 
 
 
 
 
 
66617ac
a539a5e
 
 
 
 
66617ac
a539a5e
 
 
 
 
 
 
66617ac
 
a539a5e
 
66617ac
a539a5e
 
66617ac
 
a539a5e
 
 
66617ac
a539a5e
 
 
 
 
66617ac
a539a5e
 
66617ac
a539a5e
 
 
66617ac
a539a5e
 
 
 
 
 
 
 
 
 
66617ac
 
 
a539a5e
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
import streamlit as st
import pandas as pd

st.markdown("""
    <style>
    /* Main background and font settings */
    body {
        background-color: black;
        font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
    }
    
    /* Main title styling */
    h1 {
        color: #2c3e50;
        font-family: 'Arial', sans-serif;
        font-weight: 700;
        text-align: center;
        margin-bottom: 25px;
        border-bottom: 2px solid #3498db;
        padding-bottom: 10px;
    }
    
    /* Header styling */
    h2 {
        color: #2c3e50;
        font-family: 'Arial', sans-serif;
        font-weight: 600;
        margin-top: 30px;
        border-left: 4px solid #3498db;
        padding-left: 10px;
    }
    
    /* Subheader styling */
    h3 {
        color: #2c3e50;
        font-family: 'Arial', sans-serif;
        font-weight: 500;
        margin-top: 20px;
    }
    
    /* Custom text styling */
    .custom-text {
        font-family: 'Georgia', serif;
        line-height: 1.8;
        color: #34495e;
        margin-bottom: 20px;
    }
    
    /* List styling */
    .custom-list {
        list-style-type: none;
        padding-left: 20px;
    }
    
    .custom-list li {
        font-family: 'Georgia', serif;
        font-size: 1.1em;
        margin-bottom: 10px;
        color: #34495e;
        position: relative;
        padding-left: 25px;
    }
    
    .custom-list li::before {
        content: "β€’";
        color: #3498db;
        font-weight: bold;
        position: absolute;
        left: 0;
    }
    
    /* Sidebar styling */
    .sidebar .sidebar-content {
        background-color: #ffffff;
        border-radius: 8px;
        padding: 15px;
        box-shadow: 0 2px 5px rgba(0,0,0,0.1);
    }
    
    /* Info box styling */
    .stInfo {
        background-color: #e8f4fc;
        border-left: 4px solid #3498db;
        padding: 15px;
        border-radius: 0 4px 4px 0;
    }
    
    /* Success box styling */
    .stSuccess {
        background-color: #e8f8f0;
        border-left: 4px solid #2ecc71;
        padding: 15px;
        border-radius: 0 4px 4px 0;
    }
    
    /* Code block styling */
    .stCodeBlock {
        background-color: white;
        border-radius: 4px;
        padding: 15px;
        border-left: 4px solid #95a5a6;
    }
    </style>
    """, unsafe_allow_html=True)

st.title("Text Data Quality Analysis")

# Introduction section
st.markdown("""
    <div class='custom-text'>
    <h2>Understanding Text Data Quality Analysis</h2>
    <p>Evaluating raw text data quality before processing is a critical first step in any text analysis project.</p>
    </div>
    """, unsafe_allow_html=True)

st.markdown("""
    <div class='stInfo'>
    <strong>Text Data Quality Analysis is crucial because:</strong><br><br>
    β€’ Ensures raw data quality before processing<br>
    β€’ Helps identify potential issues early in the pipeline<br>
    β€’ Provides insights for better data exploration<br>
    β€’ Is independent of the specific problem statement
    </div>
    """, unsafe_allow_html=True)

# Main analysis steps
st.markdown("""
    <div class='custom-text'>
    <h2>Key Text Data Quality Checks</h2>
    </div>
    """, unsafe_allow_html=True)

st.markdown("""
    <ul class='custom-list'>
    <li><strong>Check Text Case</strong> – Identify if text is in lowercase, uppercase, or mixed case</li>
    <li><strong>Detect HTML Tags</strong> – Analyze if text contains unwanted HTML elements</li>
    <li><strong>Identify URLs</strong> – Check for web addresses that may need processing</li>
    <li><strong>Detect Mentions & Hashtags</strong> – Find occurrences of @mentions or #hashtags</li>
    <li><strong>Identify Numeric Data</strong> – Detect if text includes digits or numerical data</li>
    <li><strong>Analyze Punctuation Usage</strong> – Check whether punctuation marks affect text clarity</li>
    <li><strong>Analyze Date/Time Formats</strong> – Identify the presence of date/time-related text</li>
    </ul>
    """, unsafe_allow_html=True)

st.markdown("""
    <div class='stSuccess'>
    Performing thorough text data quality analysis ensures structured and high-quality text data, leading to better analysis and model performance.
    </div>
    """, unsafe_allow_html=True)

# Code example
st.markdown("""
    <div class='custom-text'>
    <h2>Implementation Example</h2>
    <p>Here's a Python function to perform basic text data quality checks:</p>
    </div>
    """, unsafe_allow_html=True)

st.code('''
import pandas as pd
import re

def text_quality_analysis(data, column):
    # Initialize results dictionary
    results = {}
    
    # Check for case variations
    results['has_lowercase'] = data[column].str.contains('[a-z]').sum()
    results['has_uppercase'] = data[column].str.contains('[A-Z]').sum()
    
    # Check for HTML tags
    results['has_html_tags'] = data[column].str.contains('<.*?>', regex=True).sum()
    
    # Check for URLs
    results['has_urls'] = data[column].str.contains('https?://\\S+', regex=True).sum()
    
    # Check for email addresses
    results['has_emails'] = data[column].str.contains('\\S+@\\S+', regex=True).sum()
    
    # Check for mentions and hashtags
    results['has_mentions'] = data[column].str.contains('@\\w+', regex=True).sum()
    results['has_hashtags'] = data[column].str.contains('#\\w+', regex=True).sum()
    
    # Check for digits
    results['has_digits'] = data[column].str.contains('\\d', regex=True).sum()
    
    # Check for punctuation
    results['has_punctuation'] = data[column].str.contains('[!"#$%&\\\'()*+,-./:;<=>?@[\\\\]^_`{|}~]', regex=True).sum()
    
    # Check for date formats (simple check)
    results['has_dates'] = data[column].str.contains('\\d{1,2}/\\d{1,2}/\\d{2,4}', regex=True).sum()
    
    return pd.DataFrame.from_dict(results, orient='index', columns=['Count'])
''', language='python')

st.markdown("""
    <div class='custom-text'>
    <p>This function provides a comprehensive analysis of text data quality by checking for various common elements that might need special handling during preprocessing.</p>
    <p>The results can help guide your data cleaning strategy based on the specific characteristics of your text data.</p>
    </div>
    """, unsafe_allow_html=True)