Spaces:
Sleeping
Sleeping
| import streamlit as st | |
| def main(): | |
| st.title("Step 3: Simple Exploratory Data Analysis (EDA) :bar_chart:") | |
| # Introduction Section | |
| st.markdown( | |
| """ | |
| Exploratory Data Analysis (EDA) is a crucial step in understanding your text data before diving into modeling. It involves inspecting, visualizing, and summarizing the dataset to uncover patterns, trends, and insights. | |
| Think of it as getting to know your data: **What does it contain? What issues might need fixing? How is the text distributed?** | |
| """ | |
| ) | |
| st.divider() | |
| # Section 1: Inspect the Data | |
| st.subheader("1. Inspect the Data :card_file_box:") | |
| st.write( | |
| """ | |
| Start by taking a close look at the data to understand its structure and content. This step ensures you know what you are working with and can identify any anomalies. | |
| """ | |
| ) | |
| st.markdown( | |
| """ | |
| **Key Actions:** | |
| - Display the first few rows of the dataset. | |
| - Check for missing values, duplicates, and null entries. | |
| - Examine the shape (number of rows and columns). | |
| **Example Code (using pandas):** | |
| ```python | |
| import pandas as pd | |
| df = pd.read_csv("data.csv") | |
| print(df.head()) # View first rows | |
| print(df.info()) # Check structure and null values | |
| print(df.duplicated().sum()) # Count duplicates | |
| ``` | |
| **Outcome:** | |
| You gain a high-level understanding of the data's structure and identify initial issues like missing values or duplicates. | |
| """ | |
| ) | |
| st.divider() | |
| # Section 2: Analyze Text Distribution | |
| st.subheader("2. Analyze Text Distribution :arrows_counterclockwise:") | |
| st.write( | |
| """ | |
| Text data often varies in length and quality. Analyzing the distribution of text length helps you understand how your data is spread and whether it needs preprocessing. | |
| """ | |
| ) | |
| st.markdown( | |
| """ | |
| **Key Actions:** | |
| - Calculate the number of words or characters per text entry. | |
| - Plot a histogram to visualize the distribution. | |
| **Example Code (using matplotlib):** | |
| ```python | |
| import matplotlib.pyplot as plt | |
| df['text_length'] = df['text_column'].apply(len) # Calculate text length | |
| plt.hist(df['text_length'], bins=50, color='skyblue') | |
| plt.title("Distribution of Text Length") | |
| plt.xlabel("Text Length") | |
| plt.ylabel("Frequency") | |
| plt.show() | |
| ``` | |
| **Outcome:** | |
| You get insights into the range of text lengths, which can guide tokenization and preprocessing later. | |
| """ | |
| ) | |
| st.divider() | |
| # Section 3: Check for Class Imbalance | |
| st.subheader("3. Check for Class Imbalance :small_red_triangle:") | |
| st.write( | |
| """ | |
| In tasks like classification, it's important to ensure your target labels are balanced. If one class dominates, the model might become biased. | |
| """ | |
| ) | |
| st.markdown( | |
| """ | |
| **Key Actions:** | |
| - Count the occurrences of each class in the target column. | |
| - Visualize the class distribution using bar plots. | |
| **Example Code:** | |
| ```python | |
| class_counts = df['target_column'].value_counts() | |
| class_counts.plot(kind='bar', color=['skyblue', 'orange', 'green']) | |
| plt.title("Class Distribution") | |
| plt.xlabel("Classes") | |
| plt.ylabel("Count") | |
| plt.show() | |
| ``` | |
| **Outcome:** | |
| You identify whether class imbalance exists, helping you plan techniques like resampling or weighted loss functions. | |
| """ | |
| ) | |
| st.divider() | |
| # Section 4: Visualize Word Frequencies | |
| st.subheader("4. Visualize Word Frequencies :mag:") | |
| st.write( | |
| """ | |
| Understanding the most common words in your dataset provides insights into its content and themes. Visualization tools like word clouds make this process easier. | |
| """ | |
| ) | |
| st.markdown( | |
| """ | |
| **Key Actions:** | |
| - Tokenize the text to split it into words. | |
| - Calculate word frequencies. | |
| - Generate visualizations like word clouds or bar charts. | |
| **Example Code (using WordCloud):** | |
| ```python | |
| from wordcloud import WordCloud | |
| text_data = ' '.join(df['text_column']) # Combine all text | |
| wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text_data) | |
| plt.figure(figsize=(10, 6)) | |
| plt.imshow(wordcloud, interpolation='bilinear') | |
| plt.axis("off") | |
| plt.show() | |
| ``` | |
| **Outcome:** | |
| You identify common words, helping you decide whether to remove stopwords or focus on key terms. | |
| """ | |
| ) | |
| st.divider() | |
| # Summary Section | |
| st.subheader("Summary::star2:") | |
| st.markdown( | |
| """ | |
| Performing EDA helps you **understand your data** before modeling. Key steps include: | |
| 1. **Inspect the Data**: Check structure, null values, and duplicates. | |
| 2. **Analyze Text Distribution**: Visualize text length to detect patterns. | |
| 3. **Check for Class Imbalance**: Ensure target labels are balanced. | |
| 4. **Visualize Word Frequencies**: Explore word-level patterns with word clouds or bar charts. | |
| **Friendly Tip :bulb::** | |
| EDA is an iterative process. Explore your data thoroughly, as the insights you gain will guide your preprocessing and modeling steps. | |
| """ | |
| ) | |
| st.divider() | |
| main() |