{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "4ba6aba8" }, "source": [ "# 🤖 **Data Collection, Creation, Storage, and Processing**\n" ] }, { "cell_type": "markdown", "metadata": { "id": "jpASMyIQMaAq" }, "source": [ "## **1.** 📦 Install required packages" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "f48c8f8c", "outputId": "7e50dd72-2e2a-47b1-e59b-ed7c5511d9cd" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.12/dist-packages (4.13.5)\n", "Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2)\n", "Requirement already satisfied: matplotlib in /usr/local/lib/python3.12/dist-packages (3.10.0)\n", "Requirement already satisfied: seaborn in /usr/local/lib/python3.12/dist-packages (0.13.2)\n", "Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (2.0.2)\n", "Requirement already satisfied: textblob in /usr/local/lib/python3.12/dist-packages (0.19.0)\n", "Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4) (2.8.3)\n", "Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4) (4.15.0)\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0)\n", "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2)\n", "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.3)\n", "Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (1.3.3)\n", "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (0.12.1)\n", "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (4.62.1)\n", "Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (1.5.0)\n", "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (26.0)\n", "Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (11.3.0)\n", "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (3.3.2)\n", "Requirement already satisfied: nltk>=3.9 in /usr/local/lib/python3.12/dist-packages (from textblob) (3.9.1)\n", "Requirement already satisfied: click in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (8.3.1)\n", "Requirement already satisfied: joblib in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (1.5.3)\n", "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (2025.11.3)\n", "Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (4.67.3)\n", "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n" ] } ], "source": [ "!pip install beautifulsoup4 pandas matplotlib seaborn numpy textblob" ] }, { "cell_type": "markdown", "metadata": { "id": "lquNYCbfL9IM" }, "source": [ "## **2.** ⛏ Load the Superstore dataset from Kaggle\n", "\n", "Dataset source: [Superstore Dataset – Kaggle](https://www.kaggle.com/datasets/vivek468/superstore-dataset-final?resource=download)" ] }, { "cell_type": "markdown", "metadata": { "id": "0IWuNpxxYDJF" }, "source": [ "### *a. Initial setup*\n", "Define the base url of the dataset source as well as how and what you will load" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "91d52125" }, "outputs": [], "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "import pandas as pd\n", "import time\n", "\n", "# Dataset source\n", "base_url = \"https://www.kaggle.com/datasets/vivek468/superstore-dataset-final?resource=download\"\n", "headers = {\"User-Agent\": \"Mozilla/5.0\"}\n", "\n", "df_raw = pd.read_csv(\"Sample - Superstore.csv\", encoding=\"latin-1\")\n", "df_raw[\"Order Date\"] = pd.to_datetime(df_raw[\"Order Date\"], format=\"%m/%d/%Y\")\n", "df_raw[\"Ship Date\"] = pd.to_datetime(df_raw[\"Ship Date\"], format=\"%m/%d/%Y\")\n", "\n", "sub_categories, avg_prices, avg_profits = [], [], []" ] }, { "cell_type": "markdown", "metadata": { "id": "oCdTsin2Yfp3" }, "source": [ "### *b. Fill sub_categories, avg_prices, and avg_profits from the dataset*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xqO5Y3dnYhxt" }, "outputs": [], "source": [ "# Aggregate over all Sub-Categories\n", "for sub_cat, group in df_raw.groupby(\"Sub-Category\"):\n", " sub_categories.append(sub_cat)\n", " avg_prices.append(round(group[\"Sales\"].sum() / group[\"Quantity\"].sum(), 2))\n", " avg_profits.append(round(group[\"Profit\"].mean(), 2))\n", "\n", " time.sleep(0) # kept for structural parity" ] }, { "cell_type": "markdown", "metadata": { "id": "T0TOeRC4Yrnn" }, "source": [ "### *c. ✋🏻🛑⛔️ Create a dataframe df_products that contains the now complete \"sub_category\", \"avg_price\", and \"avg_profit\" objects*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "l5FkkNhUYTHh" }, "outputs": [], "source": [ "# 🗂️ Create DataFrame\n", "df_products = pd.DataFrame({\n", " \"sub_category\": sub_categories,\n", " \"avg_price\": avg_prices,\n", " \"avg_profit\": avg_profits\n", "})" ] }, { "cell_type": "markdown", "metadata": { "id": "duI5dv3CZYvF" }, "source": [ "### *d. Save dataframe either as a CSV or Excel file*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lC1U_YHtZifh" }, "outputs": [], "source": [ "# 💾 Save to CSV\n", "df_products.to_csv(\"superstore_data.csv\", index=False)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "qMjRKMBQZlJi" }, "source": [ "### *e. ✋🏻🛑⛔️ View first few lines*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "O_wIvTxYZqCK", "outputId": "5c622472-47de-4352-daa9-c7da226d0c30" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " sub_category avg_price avg_profit\n", "0 Accessories 57.42 55.81\n", "1 Appliances 62.69 38.47\n", "2 Art 9.09 8.15\n", "3 Binders 33.07 20.72\n", "4 Bookcases 124.61 -19.17" ], "text/html": [ "\n", "
| \n", " | sub_category | \n", "avg_price | \n", "avg_profit | \n", "
|---|---|---|---|
| 0 | \n", "Accessories | \n", "57.42 | \n", "55.81 | \n", "
| 1 | \n", "Appliances | \n", "62.69 | \n", "38.47 | \n", "
| 2 | \n", "Art | \n", "9.09 | \n", "8.15 | \n", "
| 3 | \n", "Binders | \n", "33.07 | \n", "20.72 | \n", "
| 4 | \n", "Bookcases | \n", "124.61 | \n", "-19.17 | \n", "