Spaces:

NeonSamurai
/

NLPHub

Sleeping

App Files Files Community

NLPHub / stages /simple_eda.py

NeonSamurai

Update stages/simple_eda.py

4c3b1fd verified over 1 year ago

raw

history blame contribute delete

5.52 kB

	import streamlit as st

	def main():
	st.title("Step 3: Simple Exploratory Data Analysis (EDA) :bar_chart:")

	# Introduction Section
	st.markdown(
	"""
	Exploratory Data Analysis (EDA) is a crucial step in understanding your text data before diving into modeling. It involves inspecting, visualizing, and summarizing the dataset to uncover patterns, trends, and insights.

	Think of it as getting to know your data: What does it contain? What issues might need fixing? How is the text distributed?
	"""
	)
	st.divider()

	# Section 1: Inspect the Data
	st.subheader("1. Inspect the Data :card_file_box:")
	st.write(
	"""
	Start by taking a close look at the data to understand its structure and content. This step ensures you know what you are working with and can identify any anomalies.
	"""
	)
	st.markdown(
	"""
	Key Actions:
	- Display the first few rows of the dataset.
	- Check for missing values, duplicates, and null entries.
	- Examine the shape (number of rows and columns).

	Example Code (using pandas):
	```python
	import pandas as pd
	df = pd.read_csv("data.csv")
	print(df.head()) # View first rows
	print(df.info()) # Check structure and null values
	print(df.duplicated().sum()) # Count duplicates
	```

	Outcome:
	You gain a high-level understanding of the data's structure and identify initial issues like missing values or duplicates.
	"""
	)
	st.divider()

	# Section 2: Analyze Text Distribution
	st.subheader("2. Analyze Text Distribution :arrows_counterclockwise:")
	st.write(
	"""
	Text data often varies in length and quality. Analyzing the distribution of text length helps you understand how your data is spread and whether it needs preprocessing.
	"""
	)
	st.markdown(
	"""
	Key Actions:
	- Calculate the number of words or characters per text entry.
	- Plot a histogram to visualize the distribution.

	Example Code (using matplotlib):
	```python
	import matplotlib.pyplot as plt
	df['text_length'] = df['text_column'].apply(len) # Calculate text length
	plt.hist(df['text_length'], bins=50, color='skyblue')
	plt.title("Distribution of Text Length")
	plt.xlabel("Text Length")
	plt.ylabel("Frequency")
	plt.show()
	```

	Outcome:
	You get insights into the range of text lengths, which can guide tokenization and preprocessing later.
	"""
	)
	st.divider()

	# Section 3: Check for Class Imbalance
	st.subheader("3. Check for Class Imbalance :small_red_triangle:")
	st.write(
	"""
	In tasks like classification, it's important to ensure your target labels are balanced. If one class dominates, the model might become biased.
	"""
	)
	st.markdown(
	"""
	Key Actions:
	- Count the occurrences of each class in the target column.
	- Visualize the class distribution using bar plots.

	Example Code:
	```python
	class_counts = df['target_column'].value_counts()
	class_counts.plot(kind='bar', color=['skyblue', 'orange', 'green'])
	plt.title("Class Distribution")
	plt.xlabel("Classes")
	plt.ylabel("Count")
	plt.show()
	```

	Outcome:
	You identify whether class imbalance exists, helping you plan techniques like resampling or weighted loss functions.
	"""
	)
	st.divider()

	# Section 4: Visualize Word Frequencies
	st.subheader("4. Visualize Word Frequencies :mag:")
	st.write(
	"""
	Understanding the most common words in your dataset provides insights into its content and themes. Visualization tools like word clouds make this process easier.
	"""
	)
	st.markdown(
	"""
	Key Actions:
	- Tokenize the text to split it into words.
	- Calculate word frequencies.
	- Generate visualizations like word clouds or bar charts.

	Example Code (using WordCloud):
	```python
	from wordcloud import WordCloud
	text_data = ' '.join(df['text_column']) # Combine all text
	wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text_data)
	plt.figure(figsize=(10, 6))
	plt.imshow(wordcloud, interpolation='bilinear')
	plt.axis("off")
	plt.show()
	```

	Outcome:
	You identify common words, helping you decide whether to remove stopwords or focus on key terms.
	"""
	)
	st.divider()

	# Summary Section
	st.subheader("Summary::star2:")
	st.markdown(
	"""
	Performing EDA helps you understand your data before modeling. Key steps include:

	1. Inspect the Data: Check structure, null values, and duplicates.
	2. Analyze Text Distribution: Visualize text length to detect patterns.
	3. Check for Class Imbalance: Ensure target labels are balanced.
	4. Visualize Word Frequencies: Explore word-level patterns with word clouds or bar charts.

	Friendly Tip :bulb::
	EDA is an iterative process. Explore your data thoroughly, as the insights you gain will guide your preprocessing and modeling steps.
	"""
	)

	st.divider()
	main()