Spaces:

Mpavan45
/

NLP_Blog

Build error

App Files Files Community

Mpavan45 commited on Dec 21, 2024

Commit

435558b

verified ·

1 Parent(s): db52ce6

Update app.py

Browse files

Files changed (1) hide show

app.py +50 -4

app.py CHANGED Viewed

@@ -126,8 +126,8 @@ elif st.session_state.selected_page == "NLP Lifecycle":
         **Example of a problem statement**: The goal could be to classify customer reviews as either positive or negative, or to find the main topics in product reviews.
         """)
-    elif lifecycle_option == "Data Collection":
-        st.write("""
        #### 2. Data Collection
        Data collection is the second step in the NLP lifecycle. It involves gathering data from various sources based on the problem statement, so it can be analyzed and processed.
        - **Sources for data collection**:
@@ -138,8 +138,54 @@ elif st.session_state.selected_page == "NLP Lifecycle":
             - ✋ Manually, when needed.
             - In most cases, data is collected from websites, APIs, or through web scraping. However, manual collection may be necessary in rare cases.
-        **Example**: Scraping customer reviews from Amazon to analyze sentiment and feedback about a product.
-        """)
     elif lifecycle_option == "Simple EDA":
         st.write("""
         #### 📊 3. Simple EDA

         **Example of a problem statement**: The goal could be to classify customer reviews as either positive or negative, or to find the main topics in product reviews.
         """)
+   elif lifecycle_option == "Data Collection":
+    st.write("""
        #### 2. Data Collection
        Data collection is the second step in the NLP lifecycle. It involves gathering data from various sources based on the problem statement, so it can be analyzed and processed.
        - **Sources for data collection**:
             - ✋ Manually, when needed.
             - In most cases, data is collected from websites, APIs, or through web scraping. However, manual collection may be necessary in rare cases.
+       **Example**: Scraping customer reviews from Amazon to analyze sentiment and feedback about a product.
+       #### Data Extraction from Files
+       After collecting the data, it is often stored in various file formats like JSON, CSV, Excel, or XML. Using the **Pandas** library in Python, we can extract and convert this data into a **DataFrame**, which is a structured format ideal for analysis.
+       - **Steps for Data Extraction**:
+           1. Identify the file format (e.g., `.json`, `.csv`, `.xlsx`, `.xml`).
+           2. Use Pandas functions like `pd.read_csv()`, `pd.read_json()`, `pd.read_excel()`, or `pd.read_xml()` to load the data.
+           3. Handle additional parameters based on file structure (e.g., `delimiter` for CSV, `sheet_name` for Excel).
+           4. Verify and clean the data using methods like `df.head()` and `df.info()`.
+       **Example Code for Data Extraction**:
+       ```python
+       import pandas as pd
+       # 1. Extracting Data from a CSV File
+       print("Extracting data from CSV file...")
+       csv_file = 'example_data.csv'  # Replace with your CSV file path
+       df_csv = pd.read_csv(csv_file)
+       print("CSV Data:")
+       print(df_csv.head())  # Display the first few rows
+       # 2. Extracting Data from a JSON File
+       print("\\nExtracting data from JSON file...")
+       json_file = 'example_data.json'  # Replace with your JSON file path
+       df_json = pd.read_json(json_file)
+       print("JSON Data:")
+       print(df_json.head())  # Display the first few rows
+       # 3. Extracting Data from an Excel File
+       print("\\nExtracting data from Excel file...")
+       excel_file = 'example_data.xlsx'  # Replace with your Excel file path
+       df_excel = pd.read_excel(excel_file, sheet_name='Sheet1')  # Specify the sheet name if necessary
+       print("Excel Data:")
+       print(df_excel.head())  # Display the first few rows
+       # 4. Extracting Data from an XML File
+       print("\\nExtracting data from XML file...")
+       xml_file = 'example_data.xml'  # Replace with your XML file path
+       df_xml = pd.read_xml(xml_file)
+       print("XML Data:")
+       print(df_xml.head())  # Display the first few rows
+       ```
+       #### Evaluating Data Balance: Balanced vs. Imbalanced Datasets
+       After collecting the data, it is essential to check whether the dataset is **balanced or imbalanced**. A balanced dataset has an even distribution of classes or categories, while an imbalanced dataset has one or more classes underrepresented. Addressing this imbalance is crucial for accurate analysis and model performance.
+    """)
     elif lifecycle_option == "Simple EDA":
         st.write("""
         #### 📊 3. Simple EDA