NLPHub / stages /simple_eda.py
NeonSamurai's picture
Update stages/simple_eda.py
4c3b1fd verified
import streamlit as st
def main():
st.title("Step 3: Simple Exploratory Data Analysis (EDA) :bar_chart:")
# Introduction Section
st.markdown(
"""
Exploratory Data Analysis (EDA) is a crucial step in understanding your text data before diving into modeling. It involves inspecting, visualizing, and summarizing the dataset to uncover patterns, trends, and insights.
Think of it as getting to know your data: **What does it contain? What issues might need fixing? How is the text distributed?**
"""
)
st.divider()
# Section 1: Inspect the Data
st.subheader("1. Inspect the Data :card_file_box:")
st.write(
"""
Start by taking a close look at the data to understand its structure and content. This step ensures you know what you are working with and can identify any anomalies.
"""
)
st.markdown(
"""
**Key Actions:**
- Display the first few rows of the dataset.
- Check for missing values, duplicates, and null entries.
- Examine the shape (number of rows and columns).
**Example Code (using pandas):**
```python
import pandas as pd
df = pd.read_csv("data.csv")
print(df.head()) # View first rows
print(df.info()) # Check structure and null values
print(df.duplicated().sum()) # Count duplicates
```
**Outcome:**
You gain a high-level understanding of the data's structure and identify initial issues like missing values or duplicates.
"""
)
st.divider()
# Section 2: Analyze Text Distribution
st.subheader("2. Analyze Text Distribution :arrows_counterclockwise:")
st.write(
"""
Text data often varies in length and quality. Analyzing the distribution of text length helps you understand how your data is spread and whether it needs preprocessing.
"""
)
st.markdown(
"""
**Key Actions:**
- Calculate the number of words or characters per text entry.
- Plot a histogram to visualize the distribution.
**Example Code (using matplotlib):**
```python
import matplotlib.pyplot as plt
df['text_length'] = df['text_column'].apply(len) # Calculate text length
plt.hist(df['text_length'], bins=50, color='skyblue')
plt.title("Distribution of Text Length")
plt.xlabel("Text Length")
plt.ylabel("Frequency")
plt.show()
```
**Outcome:**
You get insights into the range of text lengths, which can guide tokenization and preprocessing later.
"""
)
st.divider()
# Section 3: Check for Class Imbalance
st.subheader("3. Check for Class Imbalance :small_red_triangle:")
st.write(
"""
In tasks like classification, it's important to ensure your target labels are balanced. If one class dominates, the model might become biased.
"""
)
st.markdown(
"""
**Key Actions:**
- Count the occurrences of each class in the target column.
- Visualize the class distribution using bar plots.
**Example Code:**
```python
class_counts = df['target_column'].value_counts()
class_counts.plot(kind='bar', color=['skyblue', 'orange', 'green'])
plt.title("Class Distribution")
plt.xlabel("Classes")
plt.ylabel("Count")
plt.show()
```
**Outcome:**
You identify whether class imbalance exists, helping you plan techniques like resampling or weighted loss functions.
"""
)
st.divider()
# Section 4: Visualize Word Frequencies
st.subheader("4. Visualize Word Frequencies :mag:")
st.write(
"""
Understanding the most common words in your dataset provides insights into its content and themes. Visualization tools like word clouds make this process easier.
"""
)
st.markdown(
"""
**Key Actions:**
- Tokenize the text to split it into words.
- Calculate word frequencies.
- Generate visualizations like word clouds or bar charts.
**Example Code (using WordCloud):**
```python
from wordcloud import WordCloud
text_data = ' '.join(df['text_column']) # Combine all text
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text_data)
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
```
**Outcome:**
You identify common words, helping you decide whether to remove stopwords or focus on key terms.
"""
)
st.divider()
# Summary Section
st.subheader("Summary::star2:")
st.markdown(
"""
Performing EDA helps you **understand your data** before modeling. Key steps include:
1. **Inspect the Data**: Check structure, null values, and duplicates.
2. **Analyze Text Distribution**: Visualize text length to detect patterns.
3. **Check for Class Imbalance**: Ensure target labels are balanced.
4. **Visualize Word Frequencies**: Explore word-level patterns with word clouds or bar charts.
**Friendly Tip :bulb::**
EDA is an iterative process. Explore your data thoroughly, as the insights you gain will guide your preprocessing and modeling steps.
"""
)
st.divider()
main()