Update app.py
Browse files
app.py
CHANGED
|
@@ -126,64 +126,64 @@ elif st.session_state.selected_page == "NLP Lifecycle":
|
|
| 126 |
**Example of a problem statement**: The goal could be to classify customer reviews as either positive or negative, or to find the main topics in product reviews.
|
| 127 |
""")
|
| 128 |
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
|
| 188 |
|
| 189 |
elif lifecycle_option == "Simple EDA":
|
|
|
|
| 126 |
**Example of a problem statement**: The goal could be to classify customer reviews as either positive or negative, or to find the main topics in product reviews.
|
| 127 |
""")
|
| 128 |
|
| 129 |
+
elif lifecycle_option == "Data Collection":
|
| 130 |
+
st.write("""
|
| 131 |
+
#### 2. Data Collection
|
| 132 |
+
Data collection is the second step in the NLP lifecycle. It involves gathering data from various sources based on the problem statement, so it can be analyzed and processed.
|
| 133 |
+
- **Sources for data collection**:
|
| 134 |
+
- π The data should be collected based on a clear understanding of the problem statement.
|
| 135 |
+
- π From datasets available on websites like Kaggle.
|
| 136 |
+
- π Through APIs.
|
| 137 |
+
- πΈοΈ Web scraping can also be used to gather data from websites using tools like Selenium or BeautifulSoup.
|
| 138 |
+
- β Manually, when needed.
|
| 139 |
+
- In most cases, data is collected from websites, APIs, or through web scraping. However, manual collection may be necessary in rare cases.
|
| 140 |
+
|
| 141 |
+
**Example**: Scraping customer reviews from Amazon to analyze sentiment and feedback about a product.
|
| 142 |
+
|
| 143 |
+
#### Data Extraction from Files
|
| 144 |
+
After collecting the data, it is often stored in various file formats like JSON, CSV, Excel, or XML. Using the **Pandas** library in Python, we can extract and convert this data into a **DataFrame**, which is a structured format ideal for analysis.
|
| 145 |
+
- **Steps for Data Extraction**:
|
| 146 |
+
1. Identify the file format (e.g., `.json`, `.csv`, `.xlsx`, `.xml`).
|
| 147 |
+
2. Use Pandas functions like `pd.read_csv()`, `pd.read_json()`, `pd.read_excel()`, or `pd.read_xml()` to load the data.
|
| 148 |
+
3. Handle additional parameters based on file structure (e.g., `delimiter` for CSV, `sheet_name` for Excel).
|
| 149 |
+
4. Verify and clean the data using methods like `df.head()` and `df.info()`.
|
| 150 |
+
|
| 151 |
+
**Example Code for Data Extraction**:
|
| 152 |
+
```python
|
| 153 |
+
import pandas as pd
|
| 154 |
+
|
| 155 |
+
# 1. Extracting Data from a CSV File
|
| 156 |
+
print("Extracting data from CSV file...")
|
| 157 |
+
csv_file = 'example_data.csv' # Replace with your CSV file path
|
| 158 |
+
df_csv = pd.read_csv(csv_file)
|
| 159 |
+
print("CSV Data:")
|
| 160 |
+
print(df_csv.head()) # Display the first few rows
|
| 161 |
+
|
| 162 |
+
# 2. Extracting Data from a JSON File
|
| 163 |
+
print("\\nExtracting data from JSON file...")
|
| 164 |
+
json_file = 'example_data.json' # Replace with your JSON file path
|
| 165 |
+
df_json = pd.read_json(json_file)
|
| 166 |
+
print("JSON Data:")
|
| 167 |
+
print(df_json.head()) # Display the first few rows
|
| 168 |
+
|
| 169 |
+
# 3. Extracting Data from an Excel File
|
| 170 |
+
print("\\nExtracting data from Excel file...")
|
| 171 |
+
excel_file = 'example_data.xlsx' # Replace with your Excel file path
|
| 172 |
+
df_excel = pd.read_excel(excel_file, sheet_name='Sheet1') # Specify the sheet name if necessary
|
| 173 |
+
print("Excel Data:")
|
| 174 |
+
print(df_excel.head()) # Display the first few rows
|
| 175 |
+
|
| 176 |
+
# 4. Extracting Data from an XML File
|
| 177 |
+
print("\\nExtracting data from XML file...")
|
| 178 |
+
xml_file = 'example_data.xml' # Replace with your XML file path
|
| 179 |
+
df_xml = pd.read_xml(xml_file)
|
| 180 |
+
print("XML Data:")
|
| 181 |
+
print(df_xml.head()) # Display the first few rows
|
| 182 |
+
```
|
| 183 |
+
|
| 184 |
+
#### Evaluating Data Balance: Balanced vs. Imbalanced Datasets
|
| 185 |
+
After collecting the data, it is essential to check whether the dataset is **balanced or imbalanced**. A balanced dataset has an even distribution of classes or categories, while an imbalanced dataset has one or more classes underrepresented. Addressing this imbalance is crucial for accurate analysis and model performance.
|
| 186 |
+
""")
|
| 187 |
|
| 188 |
|
| 189 |
elif lifecycle_option == "Simple EDA":
|