Mpavan45 commited on
Commit
fc5491b
·
verified ·
1 Parent(s): 95614fa

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +32 -20
app.py CHANGED
@@ -180,34 +180,46 @@ elif st.session_state.selected_page == "NLP Lifecycle":
180
  print("XML Data:")
181
  print(df_xml.head()) # Display the first few rows
182
  ```
183
-
184
- #### Evaluating Data Balance: Balanced vs. Imbalanced Datasets
185
- After collecting the data, it is essential to check whether the dataset is **balanced or imbalanced**. A balanced dataset has an even distribution of classes or categories, while an imbalanced dataset has one or more classes underrepresented. Addressing this imbalance is crucial for accurate analysis and model performance.
186
  """)
187
 
188
 
189
  elif lifecycle_option == "Simple EDA":
190
  st.write("""
191
- #### 📊 3. Simple EDA
192
- Simple Exploratory Data Analysis (Simple EDA) provides a quick overview of the dataset. It focuses on understanding the basic structure, spotting missing values, checking data types, and visualizing distributions.
193
- - **Basic Data Inspection**: Viewing data types, first few rows, and general structure.
194
- - **Summary Statistics**: Quick summary of key metrics like mean, median, and standard deviation.
195
- - **Basic Visualizations**: Simple charts like histograms and boxplots to explore variable distributions.
196
- - **Missing Values Check**: Identifying columns with missing values.
197
- - **Outlier Detection**: Visual identification of outliers.
 
 
 
 
 
 
198
 
199
- **Example**: In a sales dataset:
200
- - Basic Data Inspection:
201
- - Shape of the dataset: (1000, 5)
202
- - First few rows: [Sales, Marketing Spend, Date, etc.]
203
- - Summary Statistics:
204
- - Mean Sales: 1000
205
- - Median Sales: 950
206
- - Visualizations:
207
- - Histogram for sales distribution
208
- - Boxplot for outlier detection
 
 
 
 
 
 
209
  """)
210
 
 
 
211
  elif lifecycle_option == "Data Preprocessing":
212
  st.write("""
213
  #### 🧹 4. Text Preprocessing
 
180
  print("XML Data:")
181
  print(df_xml.head()) # Display the first few rows
182
  ```
183
+ Using the above code structure, we can efficiently extract data from various file formats such as CSV, JSON, Excel, and XML, and load it into a structured format suitable for analysis.
 
 
184
  """)
185
 
186
 
187
  elif lifecycle_option == "Simple EDA":
188
  st.write("""
189
+ #### 📊 3. Simple EDA
190
+
191
+ #### Checking Data Balance
192
+ Before proceeding with analysis, it's important to evaluate whether the dataset is **balanced or imbalanced**. This involves examining the distribution of classes or categories in the data. By calculating the count or percentage of instances in each class, we can determine if the data is evenly distributed or if certain classes are underrepresented. Addressing imbalanced datasets is crucial to ensure reliable analysis and modeling.
193
+
194
+ **Example**: In a classification dataset:
195
+ - Class Distribution:
196
+ - Class A: 700 instances
197
+ - Class B: 300 instances
198
+ - The dataset shows a 70:30 imbalance, which may require techniques like oversampling, undersampling, or synthetic data generation to correct.
199
+
200
+ #### Simple Exploratory Data Analysis (Simple EDA)
201
+ Simple EDA provides a high-level understanding of the dataset and its characteristics. It focuses on summarizing key features, identifying potential issues, and visualizing distributions to inform further analysis.
202
 
203
+ - **Basic Data Inspection**: Examine data types, view the first few rows, and understand the overall structure.
204
+ - **Summary Statistics**: Calculate key metrics like mean, median, and standard deviation to summarize numerical variables.
205
+ - **Basic Visualizations**: Use histograms, boxplots, and scatterplots to explore data distributions and relationships.
206
+ - **Missing Values Check**: Identify any missing data in columns and rows for potential cleaning.
207
+ - **Outlier Detection**: Spot extreme values using visualizations and statistical methods.
208
+
209
+ **Example**: In a sales dataset:
210
+ - Data Inspection:
211
+ - Dataset shape: (1000, 5)
212
+ - Sample columns: [Sales, Marketing Spend, Date, etc.]
213
+ - Summary Statistics:
214
+ - Mean Sales: 1000
215
+ - Median Sales: 950
216
+ - Visualizations:
217
+ - Histogram for sales distribution
218
+ - Boxplot to detect outliers
219
  """)
220
 
221
+
222
+
223
  elif lifecycle_option == "Data Preprocessing":
224
  st.write("""
225
  #### 🧹 4. Text Preprocessing