Spaces:
Sleeping
Sleeping
| import streamlit as st | |
| import pandas as pd | |
| st.markdown(""" | |
| <style> | |
| /* Main background and font settings */ | |
| body { | |
| background-color: black; | |
| font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; | |
| } | |
| /* Main title styling */ | |
| h1 { | |
| color: #2c3e50; | |
| font-family: 'Arial', sans-serif; | |
| font-weight: 700; | |
| text-align: center; | |
| margin-bottom: 25px; | |
| border-bottom: 2px solid #3498db; | |
| padding-bottom: 10px; | |
| } | |
| /* Header styling */ | |
| h2 { | |
| color: #2c3e50; | |
| font-family: 'Arial', sans-serif; | |
| font-weight: 600; | |
| margin-top: 30px; | |
| border-left: 4px solid #3498db; | |
| padding-left: 10px; | |
| } | |
| /* Subheader styling */ | |
| h3 { | |
| color: #2c3e50; | |
| font-family: 'Arial', sans-serif; | |
| font-weight: 500; | |
| margin-top: 20px; | |
| } | |
| /* Custom text styling */ | |
| .custom-text { | |
| font-family: 'Georgia', serif; | |
| line-height: 1.8; | |
| color: #34495e; | |
| margin-bottom: 20px; | |
| } | |
| /* List styling */ | |
| .custom-list { | |
| list-style-type: none; | |
| padding-left: 20px; | |
| } | |
| .custom-list li { | |
| font-family: 'Georgia', serif; | |
| font-size: 1.1em; | |
| margin-bottom: 10px; | |
| color: #34495e; | |
| position: relative; | |
| padding-left: 25px; | |
| } | |
| .custom-list li::before { | |
| content: "β’"; | |
| color: #3498db; | |
| font-weight: bold; | |
| position: absolute; | |
| left: 0; | |
| } | |
| /* Sidebar styling */ | |
| .sidebar .sidebar-content { | |
| background-color: #ffffff; | |
| border-radius: 8px; | |
| padding: 15px; | |
| box-shadow: 0 2px 5px rgba(0,0,0,0.1); | |
| } | |
| /* Info box styling */ | |
| .stInfo { | |
| background-color: #e8f4fc; | |
| border-left: 4px solid #3498db; | |
| padding: 15px; | |
| border-radius: 0 4px 4px 0; | |
| } | |
| /* Success box styling */ | |
| .stSuccess { | |
| background-color: #e8f8f0; | |
| border-left: 4px solid #2ecc71; | |
| padding: 15px; | |
| border-radius: 0 4px 4px 0; | |
| } | |
| /* Code block styling */ | |
| .stCodeBlock { | |
| background-color: white; | |
| border-radius: 4px; | |
| padding: 15px; | |
| border-left: 4px solid #95a5a6; | |
| } | |
| </style> | |
| """, unsafe_allow_html=True) | |
| st.title("Text Data Quality Analysis") | |
| # Introduction section | |
| st.markdown(""" | |
| <div class='custom-text'> | |
| <h2>Understanding Text Data Quality Analysis</h2> | |
| <p>Evaluating raw text data quality before processing is a critical first step in any text analysis project.</p> | |
| </div> | |
| """, unsafe_allow_html=True) | |
| st.markdown(""" | |
| <div class='stInfo'> | |
| <strong>Text Data Quality Analysis is crucial because:</strong><br><br> | |
| β’ Ensures raw data quality before processing<br> | |
| β’ Helps identify potential issues early in the pipeline<br> | |
| β’ Provides insights for better data exploration<br> | |
| β’ Is independent of the specific problem statement | |
| </div> | |
| """, unsafe_allow_html=True) | |
| # Main analysis steps | |
| st.markdown(""" | |
| <div class='custom-text'> | |
| <h2>Key Text Data Quality Checks</h2> | |
| </div> | |
| """, unsafe_allow_html=True) | |
| st.markdown(""" | |
| <ul class='custom-list'> | |
| <li><strong>Check Text Case</strong> β Identify if text is in lowercase, uppercase, or mixed case</li> | |
| <li><strong>Detect HTML Tags</strong> β Analyze if text contains unwanted HTML elements</li> | |
| <li><strong>Identify URLs</strong> β Check for web addresses that may need processing</li> | |
| <li><strong>Detect Mentions & Hashtags</strong> β Find occurrences of @mentions or #hashtags</li> | |
| <li><strong>Identify Numeric Data</strong> β Detect if text includes digits or numerical data</li> | |
| <li><strong>Analyze Punctuation Usage</strong> β Check whether punctuation marks affect text clarity</li> | |
| <li><strong>Analyze Date/Time Formats</strong> β Identify the presence of date/time-related text</li> | |
| </ul> | |
| """, unsafe_allow_html=True) | |
| st.markdown(""" | |
| <div class='stSuccess'> | |
| Performing thorough text data quality analysis ensures structured and high-quality text data, leading to better analysis and model performance. | |
| </div> | |
| """, unsafe_allow_html=True) | |
| # Code example | |
| st.markdown(""" | |
| <div class='custom-text'> | |
| <h2>Implementation Example</h2> | |
| <p>Here's a Python function to perform basic text data quality checks:</p> | |
| </div> | |
| """, unsafe_allow_html=True) | |
| st.code(''' | |
| import pandas as pd | |
| import re | |
| def text_quality_analysis(data, column): | |
| # Initialize results dictionary | |
| results = {} | |
| # Check for case variations | |
| results['has_lowercase'] = data[column].str.contains('[a-z]').sum() | |
| results['has_uppercase'] = data[column].str.contains('[A-Z]').sum() | |
| # Check for HTML tags | |
| results['has_html_tags'] = data[column].str.contains('<.*?>', regex=True).sum() | |
| # Check for URLs | |
| results['has_urls'] = data[column].str.contains('https?://\\S+', regex=True).sum() | |
| # Check for email addresses | |
| results['has_emails'] = data[column].str.contains('\\S+@\\S+', regex=True).sum() | |
| # Check for mentions and hashtags | |
| results['has_mentions'] = data[column].str.contains('@\\w+', regex=True).sum() | |
| results['has_hashtags'] = data[column].str.contains('#\\w+', regex=True).sum() | |
| # Check for digits | |
| results['has_digits'] = data[column].str.contains('\\d', regex=True).sum() | |
| # Check for punctuation | |
| results['has_punctuation'] = data[column].str.contains('[!"#$%&\\\'()*+,-./:;<=>?@[\\\\]^_`{|}~]', regex=True).sum() | |
| # Check for date formats (simple check) | |
| results['has_dates'] = data[column].str.contains('\\d{1,2}/\\d{1,2}/\\d{2,4}', regex=True).sum() | |
| return pd.DataFrame.from_dict(results, orient='index', columns=['Count']) | |
| ''', language='python') | |
| st.markdown(""" | |
| <div class='custom-text'> | |
| <p>This function provides a comprehensive analysis of text data quality by checking for various common elements that might need special handling during preprocessing.</p> | |
| <p>The results can help guide your data cleaning strategy based on the specific characteristics of your text data.</p> | |
| </div> | |
| """, unsafe_allow_html=True) |