Spaces:

satya11
/

Natural_Language_Processing

Sleeping

File size: 6,313 Bytes

66617ac
 
 
 
 
a539a5e
66617ac
6946762
a539a5e
66617ac
a539a5e
 
66617ac
a539a5e
 
66617ac
 
 
a539a5e
 
66617ac
a539a5e
 
66617ac
a539a5e
 
66617ac
 
a539a5e
 
66617ac
 
a539a5e
 
 
 
66617ac
 
 
a539a5e
 
 
66617ac
 
a539a5e
66617ac
 
a539a5e
 
 
66617ac
 
 
a539a5e
 
66617ac
 
 
a539a5e
 
 
66617ac
a539a5e
 
 
 
 
 
 
66617ac
a539a5e
66617ac
 
 
a539a5e
66617ac
a539a5e
66617ac
a539a5e
 
 
 
 
 
 
66617ac
a539a5e
 
 
 
 
 
 
66617ac
a539a5e
 
 
6946762
a539a5e
66617ac
a539a5e
66617ac
 
 
 
a539a5e
66617ac
a539a5e
 
 
 
 
 
 
66617ac
a539a5e
 
 
 
 
 
 
 
 
66617ac
a539a5e
 
 
 
 
 
66617ac
a539a5e
 
 
 
 
 
 
 
 
 
 
66617ac
a539a5e
 
 
 
 
66617ac
a539a5e
 
 
 
 
 
 
66617ac
 
a539a5e
 
66617ac
a539a5e
 
66617ac
 
a539a5e
 
 
66617ac
a539a5e
 
 
 
 
66617ac
a539a5e
 
66617ac
a539a5e
 
 
66617ac
a539a5e
 
 
 
 
 
 
 
 
 
66617ac
 
 
a539a5e

import streamlit as st
import pandas as pd

st.markdown("""
    <style>
    /* Main background and font settings */
    body {
        background-color: black;
        font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
    }
    
    /* Main title styling */
    h1 {
        color: #2c3e50;
        font-family: 'Arial', sans-serif;
        font-weight: 700;
        text-align: center;
        margin-bottom: 25px;
        border-bottom: 2px solid #3498db;
        padding-bottom: 10px;
    }
    
    /* Header styling */
    h2 {
        color: #2c3e50;
        font-family: 'Arial', sans-serif;
        font-weight: 600;
        margin-top: 30px;
        border-left: 4px solid #3498db;
        padding-left: 10px;
    }
    
    /* Subheader styling */
    h3 {
        color: #2c3e50;
        font-family: 'Arial', sans-serif;
        font-weight: 500;
        margin-top: 20px;
    }
    
    /* Custom text styling */
    .custom-text {
        font-family: 'Georgia', serif;
        line-height: 1.8;
        color: #34495e;
        margin-bottom: 20px;
    }
    
    /* List styling */
    .custom-list {
        list-style-type: none;
        padding-left: 20px;
    }
    
    .custom-list li {
        font-family: 'Georgia', serif;
        font-size: 1.1em;
        margin-bottom: 10px;
        color: #34495e;
        position: relative;
        padding-left: 25px;
    }
    
    .custom-list li::before {
        content: "•";
        color: #3498db;
        font-weight: bold;
        position: absolute;
        left: 0;
    }
    
    /* Sidebar styling */
    .sidebar .sidebar-content {
        background-color: #ffffff;
        border-radius: 8px;
        padding: 15px;
        box-shadow: 0 2px 5px rgba(0,0,0,0.1);
    }
    
    /* Info box styling */
    .stInfo {
        background-color: #e8f4fc;
        border-left: 4px solid #3498db;
        padding: 15px;
        border-radius: 0 4px 4px 0;
    }
    
    /* Success box styling */
    .stSuccess {
        background-color: #e8f8f0;
        border-left: 4px solid #2ecc71;
        padding: 15px;
        border-radius: 0 4px 4px 0;
    }
    
    /* Code block styling */
    .stCodeBlock {
        background-color: white;
        border-radius: 4px;
        padding: 15px;
        border-left: 4px solid #95a5a6;
    }
    </style>
    """, unsafe_allow_html=True)

st.title("Text Data Quality Analysis")

# Introduction section
st.markdown("""
    <div class='custom-text'>
    <h2>Understanding Text Data Quality Analysis</h2>
    <p>Evaluating raw text data quality before processing is a critical first step in any text analysis project.</p>
    </div>
    """, unsafe_allow_html=True)

st.markdown("""
    <div class='stInfo'>
    <strong>Text Data Quality Analysis is crucial because:</strong><br><br>
    • Ensures raw data quality before processing<br>
    • Helps identify potential issues early in the pipeline<br>
    • Provides insights for better data exploration<br>
    • Is independent of the specific problem statement
    </div>
    """, unsafe_allow_html=True)

# Main analysis steps
st.markdown("""
    <div class='custom-text'>
    <h2>Key Text Data Quality Checks</h2>
    </div>
    """, unsafe_allow_html=True)

st.markdown("""
    <ul class='custom-list'>
    <li><strong>Check Text Case</strong> – Identify if text is in lowercase, uppercase, or mixed case</li>
    <li><strong>Detect HTML Tags</strong> – Analyze if text contains unwanted HTML elements</li>
    <li><strong>Identify URLs</strong> – Check for web addresses that may need processing</li>
    <li><strong>Detect Mentions & Hashtags</strong> – Find occurrences of @mentions or #hashtags</li>
    <li><strong>Identify Numeric Data</strong> – Detect if text includes digits or numerical data</li>
    <li><strong>Analyze Punctuation Usage</strong> – Check whether punctuation marks affect text clarity</li>
    <li><strong>Analyze Date/Time Formats</strong> – Identify the presence of date/time-related text</li>
    </ul>
    """, unsafe_allow_html=True)

st.markdown("""
    <div class='stSuccess'>
    Performing thorough text data quality analysis ensures structured and high-quality text data, leading to better analysis and model performance.
    </div>
    """, unsafe_allow_html=True)

# Code example
st.markdown("""
    <div class='custom-text'>
    <h2>Implementation Example</h2>
    <p>Here's a Python function to perform basic text data quality checks:</p>
    </div>
    """, unsafe_allow_html=True)

st.code('''
import pandas as pd
import re

def text_quality_analysis(data, column):
    # Initialize results dictionary
    results = {}
    
    # Check for case variations
    results['has_lowercase'] = data[column].str.contains('[a-z]').sum()
    results['has_uppercase'] = data[column].str.contains('[A-Z]').sum()
    
    # Check for HTML tags
    results['has_html_tags'] = data[column].str.contains('<.*?>', regex=True).sum()
    
    # Check for URLs
    results['has_urls'] = data[column].str.contains('https?://\\S+', regex=True).sum()
    
    # Check for email addresses
    results['has_emails'] = data[column].str.contains('\\S+@\\S+', regex=True).sum()
    
    # Check for mentions and hashtags
    results['has_mentions'] = data[column].str.contains('@\\w+', regex=True).sum()
    results['has_hashtags'] = data[column].str.contains('#\\w+', regex=True).sum()
    
    # Check for digits
    results['has_digits'] = data[column].str.contains('\\d', regex=True).sum()
    
    # Check for punctuation
    results['has_punctuation'] = data[column].str.contains('[!"#$%&\\\'()*+,-./:;<=>?@[\\\\]^_`{|}~]', regex=True).sum()
    
    # Check for date formats (simple check)
    results['has_dates'] = data[column].str.contains('\\d{1,2}/\\d{1,2}/\\d{2,4}', regex=True).sum()
    
    return pd.DataFrame.from_dict(results, orient='index', columns=['Count'])
''', language='python')

st.markdown("""
    <div class='custom-text'>
    <p>This function provides a comprehensive analysis of text data quality by checking for various common elements that might need special handling during preprocessing.</p>
    <p>The results can help guide your data cleaning strategy based on the specific characteristics of your text data.</p>
    </div>
    """, unsafe_allow_html=True)