Mpavan45 commited on
Commit
435558b
Β·
verified Β·
1 Parent(s): db52ce6

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +50 -4
app.py CHANGED
@@ -126,8 +126,8 @@ elif st.session_state.selected_page == "NLP Lifecycle":
126
  **Example of a problem statement**: The goal could be to classify customer reviews as either positive or negative, or to find the main topics in product reviews.
127
  """)
128
 
129
- elif lifecycle_option == "Data Collection":
130
- st.write("""
131
  #### 2. Data Collection
132
  Data collection is the second step in the NLP lifecycle. It involves gathering data from various sources based on the problem statement, so it can be analyzed and processed.
133
  - **Sources for data collection**:
@@ -138,8 +138,54 @@ elif st.session_state.selected_page == "NLP Lifecycle":
138
  - βœ‹ Manually, when needed.
139
  - In most cases, data is collected from websites, APIs, or through web scraping. However, manual collection may be necessary in rare cases.
140
 
141
- **Example**: Scraping customer reviews from Amazon to analyze sentiment and feedback about a product.
142
- """)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
143
  elif lifecycle_option == "Simple EDA":
144
  st.write("""
145
  #### πŸ“Š 3. Simple EDA
 
126
  **Example of a problem statement**: The goal could be to classify customer reviews as either positive or negative, or to find the main topics in product reviews.
127
  """)
128
 
129
+ elif lifecycle_option == "Data Collection":
130
+ st.write("""
131
  #### 2. Data Collection
132
  Data collection is the second step in the NLP lifecycle. It involves gathering data from various sources based on the problem statement, so it can be analyzed and processed.
133
  - **Sources for data collection**:
 
138
  - βœ‹ Manually, when needed.
139
  - In most cases, data is collected from websites, APIs, or through web scraping. However, manual collection may be necessary in rare cases.
140
 
141
+ **Example**: Scraping customer reviews from Amazon to analyze sentiment and feedback about a product.
142
+
143
+ #### Data Extraction from Files
144
+ After collecting the data, it is often stored in various file formats like JSON, CSV, Excel, or XML. Using the **Pandas** library in Python, we can extract and convert this data into a **DataFrame**, which is a structured format ideal for analysis.
145
+ - **Steps for Data Extraction**:
146
+ 1. Identify the file format (e.g., `.json`, `.csv`, `.xlsx`, `.xml`).
147
+ 2. Use Pandas functions like `pd.read_csv()`, `pd.read_json()`, `pd.read_excel()`, or `pd.read_xml()` to load the data.
148
+ 3. Handle additional parameters based on file structure (e.g., `delimiter` for CSV, `sheet_name` for Excel).
149
+ 4. Verify and clean the data using methods like `df.head()` and `df.info()`.
150
+
151
+ **Example Code for Data Extraction**:
152
+ ```python
153
+ import pandas as pd
154
+
155
+ # 1. Extracting Data from a CSV File
156
+ print("Extracting data from CSV file...")
157
+ csv_file = 'example_data.csv' # Replace with your CSV file path
158
+ df_csv = pd.read_csv(csv_file)
159
+ print("CSV Data:")
160
+ print(df_csv.head()) # Display the first few rows
161
+
162
+ # 2. Extracting Data from a JSON File
163
+ print("\\nExtracting data from JSON file...")
164
+ json_file = 'example_data.json' # Replace with your JSON file path
165
+ df_json = pd.read_json(json_file)
166
+ print("JSON Data:")
167
+ print(df_json.head()) # Display the first few rows
168
+
169
+ # 3. Extracting Data from an Excel File
170
+ print("\\nExtracting data from Excel file...")
171
+ excel_file = 'example_data.xlsx' # Replace with your Excel file path
172
+ df_excel = pd.read_excel(excel_file, sheet_name='Sheet1') # Specify the sheet name if necessary
173
+ print("Excel Data:")
174
+ print(df_excel.head()) # Display the first few rows
175
+
176
+ # 4. Extracting Data from an XML File
177
+ print("\\nExtracting data from XML file...")
178
+ xml_file = 'example_data.xml' # Replace with your XML file path
179
+ df_xml = pd.read_xml(xml_file)
180
+ print("XML Data:")
181
+ print(df_xml.head()) # Display the first few rows
182
+ ```
183
+
184
+ #### Evaluating Data Balance: Balanced vs. Imbalanced Datasets
185
+ After collecting the data, it is essential to check whether the dataset is **balanced or imbalanced**. A balanced dataset has an even distribution of classes or categories, while an imbalanced dataset has one or more classes underrepresented. Addressing this imbalance is crucial for accurate analysis and model performance.
186
+ """)
187
+
188
+
189
  elif lifecycle_option == "Simple EDA":
190
  st.write("""
191
  #### πŸ“Š 3. Simple EDA