Update app.py
Browse files
app.py
CHANGED
|
@@ -180,34 +180,46 @@ elif st.session_state.selected_page == "NLP Lifecycle":
|
|
| 180 |
print("XML Data:")
|
| 181 |
print(df_xml.head()) # Display the first few rows
|
| 182 |
```
|
| 183 |
-
|
| 184 |
-
#### Evaluating Data Balance: Balanced vs. Imbalanced Datasets
|
| 185 |
-
After collecting the data, it is essential to check whether the dataset is **balanced or imbalanced**. A balanced dataset has an even distribution of classes or categories, while an imbalanced dataset has one or more classes underrepresented. Addressing this imbalance is crucial for accurate analysis and model performance.
|
| 186 |
""")
|
| 187 |
|
| 188 |
|
| 189 |
elif lifecycle_option == "Simple EDA":
|
| 190 |
st.write("""
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 198 |
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
-
|
| 202 |
-
-
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 209 |
""")
|
| 210 |
|
|
|
|
|
|
|
| 211 |
elif lifecycle_option == "Data Preprocessing":
|
| 212 |
st.write("""
|
| 213 |
#### 🧹 4. Text Preprocessing
|
|
|
|
| 180 |
print("XML Data:")
|
| 181 |
print(df_xml.head()) # Display the first few rows
|
| 182 |
```
|
| 183 |
+
Using the above code structure, we can efficiently extract data from various file formats such as CSV, JSON, Excel, and XML, and load it into a structured format suitable for analysis.
|
|
|
|
|
|
|
| 184 |
""")
|
| 185 |
|
| 186 |
|
| 187 |
elif lifecycle_option == "Simple EDA":
|
| 188 |
st.write("""
|
| 189 |
+
#### 📊 3. Simple EDA
|
| 190 |
+
|
| 191 |
+
#### Checking Data Balance
|
| 192 |
+
Before proceeding with analysis, it's important to evaluate whether the dataset is **balanced or imbalanced**. This involves examining the distribution of classes or categories in the data. By calculating the count or percentage of instances in each class, we can determine if the data is evenly distributed or if certain classes are underrepresented. Addressing imbalanced datasets is crucial to ensure reliable analysis and modeling.
|
| 193 |
+
|
| 194 |
+
**Example**: In a classification dataset:
|
| 195 |
+
- Class Distribution:
|
| 196 |
+
- Class A: 700 instances
|
| 197 |
+
- Class B: 300 instances
|
| 198 |
+
- The dataset shows a 70:30 imbalance, which may require techniques like oversampling, undersampling, or synthetic data generation to correct.
|
| 199 |
+
|
| 200 |
+
#### Simple Exploratory Data Analysis (Simple EDA)
|
| 201 |
+
Simple EDA provides a high-level understanding of the dataset and its characteristics. It focuses on summarizing key features, identifying potential issues, and visualizing distributions to inform further analysis.
|
| 202 |
|
| 203 |
+
- **Basic Data Inspection**: Examine data types, view the first few rows, and understand the overall structure.
|
| 204 |
+
- **Summary Statistics**: Calculate key metrics like mean, median, and standard deviation to summarize numerical variables.
|
| 205 |
+
- **Basic Visualizations**: Use histograms, boxplots, and scatterplots to explore data distributions and relationships.
|
| 206 |
+
- **Missing Values Check**: Identify any missing data in columns and rows for potential cleaning.
|
| 207 |
+
- **Outlier Detection**: Spot extreme values using visualizations and statistical methods.
|
| 208 |
+
|
| 209 |
+
**Example**: In a sales dataset:
|
| 210 |
+
- Data Inspection:
|
| 211 |
+
- Dataset shape: (1000, 5)
|
| 212 |
+
- Sample columns: [Sales, Marketing Spend, Date, etc.]
|
| 213 |
+
- Summary Statistics:
|
| 214 |
+
- Mean Sales: 1000
|
| 215 |
+
- Median Sales: 950
|
| 216 |
+
- Visualizations:
|
| 217 |
+
- Histogram for sales distribution
|
| 218 |
+
- Boxplot to detect outliers
|
| 219 |
""")
|
| 220 |
|
| 221 |
+
|
| 222 |
+
|
| 223 |
elif lifecycle_option == "Data Preprocessing":
|
| 224 |
st.write("""
|
| 225 |
#### 🧹 4. Text Preprocessing
|