{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "title" }, "source": [ "# 🤖 **Data Collection, Creation, Storage, and Processing**\n", "### Open Food Facts — Supermarket Product Pricing & Health Perception" ] }, { "cell_type": "markdown", "metadata": { "id": "install_md" }, "source": [ "## **1.** 📦 Install required packages" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "id": "install", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "a0658fe0-7174-4c44-9378-aa1bf2b3bf15" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.12/dist-packages (4.13.5)\n", "Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2)\n", "Requirement already satisfied: matplotlib in /usr/local/lib/python3.12/dist-packages (3.10.0)\n", "Requirement already satisfied: seaborn in /usr/local/lib/python3.12/dist-packages (0.13.2)\n", "Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (2.0.2)\n", "Requirement already satisfied: textblob in /usr/local/lib/python3.12/dist-packages (0.19.0)\n", "Requirement already satisfied: requests in /usr/local/lib/python3.12/dist-packages (2.32.4)\n", "Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4) (2.8.3)\n", "Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4) (4.15.0)\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0)\n", "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2)\n", "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2026.1)\n", "Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (1.3.3)\n", "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (0.12.1)\n", "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (4.62.1)\n", "Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (1.5.0)\n", "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (26.0)\n", "Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (11.3.0)\n", "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (3.3.2)\n", "Requirement already satisfied: nltk>=3.9 in /usr/local/lib/python3.12/dist-packages (from textblob) (3.9.1)\n", "Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests) (3.4.7)\n", "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests) (3.11)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests) (2.5.0)\n", "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests) (2026.2.25)\n", "Requirement already satisfied: click in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (8.3.2)\n", "Requirement already satisfied: joblib in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (1.5.3)\n", "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (2025.11.3)\n", "Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from nltk>=3.9->textblob) (4.67.3)\n", "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n" ] } ], "source": [ "!pip install beautifulsoup4 pandas matplotlib seaborn numpy textblob requests" ] }, { "cell_type": "markdown", "metadata": { "id": "scrape_md" }, "source": [ "## **2.** ⛏ Load real-world food data from Open Food Facts\n", "We download the official Open Food Facts CSV directly — no API needed, much more reliable!" ] }, { "cell_type": "markdown", "metadata": { "id": "setup_md" }, "source": [ "### *a. Download the Open Food Facts dataset directly*\n", "This downloads a pre-filtered CSV of 1000 products directly from Open Food Facts — no scraping needed." ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "id": "setup", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "11eda5f3-ac06-4ee2-aac1-c2a2bcbf3be2" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Libraries loaded ✅\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import requests\n", "import time\n", "import io\n", "\n", "print('Libraries loaded ✅')\n" ] }, { "cell_type": "markdown", "metadata": { "id": "fetch_md" }, "source": [ "### *b. ✋🏻🛑⛔️ Load the data and create df_products*" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "id": "fetch", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "386f84c3-2f0f-45f8-9fc5-75bd8e8caba7" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Status for snacks: 200\n", " -> Got 50 products, 49 valid\n", "Status for dairies: 200\n", " -> Got 50 products, 39 valid\n", "Status for beverages: 503\n", " -> Error for beverages: Expecting value: line 1 column 1 (char 0)\n", "Status for cereals: 200\n", " -> Got 50 products, 50 valid\n", "Status for frozen-foods: 503\n", " -> Error for frozen-foods: Expecting value: line 1 column 1 (char 0)\n", "\n", "Total valid products: 138\n" ] } ], "source": [ "# Download Open Food Facts data directly via their export URL\n", "# This is the most reliable method — official data export\n", "url = \"https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv.gz\"\n", "\n", "# Instead of downloading the huge full file, we use their search API with proper headers\n", "import requests\n", "\n", "categories_to_fetch = [\"snacks\", \"dairies\", \"beverages\", \"cereals\", \"frozen-foods\"]\n", "products = []\n", "\n", "for category in categories_to_fetch:\n", " url = f\"https://world.openfoodfacts.org/cgi/search.pl\"\n", " params = {\n", " \"action\": \"process\",\n", " \"tagtype_0\": \"categories\",\n", " \"tag_contains_0\": \"contains\",\n", " \"tag_0\": category,\n", " \"fields\": \"product_name,nutrition_grades,brands,categories_tags\",\n", " \"page_size\": 50,\n", " \"page\": 1,\n", " \"json\": \"true\"\n", " }\n", " headers = {\n", " \"User-Agent\": \"StudentProject/1.0 (escp-bdm-project@example.com)\"\n", " }\n", "\n", " try:\n", " r = requests.get(url, params=params, headers=headers, timeout=20)\n", " print(f\"Status for {category}: {r.status_code}\")\n", " data = r.json()\n", " fetched = data.get(\"products\", [])\n", " for p in fetched:\n", " name = str(p.get(\"product_name\", \"\")).strip()\n", " grade = str(p.get(\"nutrition_grades\", \"\")).strip().upper()\n", " brand = str(p.get(\"brands\", \"Unknown\")).strip()\n", " if name and len(name) > 2 and grade in [\"A\", \"B\", \"C\", \"D\", \"E\"]:\n", " products.append({\n", " \"product_name\": name,\n", " \"nutrition_grade\": grade,\n", " \"brand\": brand,\n", " \"category\": category\n", " })\n", " print(f\" -> Got {len(fetched)} products, {len([p for p in products if p['category']==category])} valid\")\n", " except Exception as e:\n", " print(f\" -> Error for {category}: {e}\")\n", " time.sleep(2)\n", "\n", "print(f\"\\nTotal valid products: {len(products)}\")\n", "\n", "# FALLBACK: if API fails completely, create a small hardcoded dataset so notebook still works\n", "if len(products) < 10:\n", " print(\"\\n⚠️ API returned too few results — using built-in fallback dataset\")\n", " products = [\n", " {\"product_name\": \"Organic Rolled Oats\", \"nutrition_grade\": \"A\", \"brand\": \"Quaker\", \"category\": \"cereals\"},\n", " {\"product_name\": \"Whole Grain Bread\", \"nutrition_grade\": \"A\", \"brand\": \"Warburtons\", \"category\": \"cereals\"},\n", " {\"product_name\": \"Greek Yogurt Natural\", \"nutrition_grade\": \"A\", \"brand\": \"Fage\", \"category\": \"dairies\"},\n", " {\"product_name\": \"Skyr Vanilla\", \"nutrition_grade\": \"A\", \"brand\": \"Arla\", \"category\": \"dairies\"},\n", " {\"product_name\": \"Sparkling Water\", \"nutrition_grade\": \"A\", \"brand\": \"Evian\", \"category\": \"beverages\"},\n", " {\"product_name\": \"Green Tea\", \"nutrition_grade\": \"A\", \"brand\": \"Lipton\", \"category\": \"beverages\"},\n", " {\"product_name\": \"Mixed Nuts\", \"nutrition_grade\": \"B\", \"brand\": \"Planters\", \"category\": \"snacks\"},\n", " {\"product_name\": \"Rice Cakes\", \"nutrition_grade\": \"B\", \"brand\": \"Kallo\", \"category\": \"snacks\"},\n", " {\"product_name\": \"Muesli Bar\", \"nutrition_grade\": \"B\", \"brand\": \"Jordan's\", \"category\": \"cereals\"},\n", " {\"product_name\": \"Semi-Skimmed Milk\", \"nutrition_grade\": \"B\", \"brand\": \"Arla\", \"category\": \"dairies\"},\n", " {\"product_name\": \"Orange Juice\", \"nutrition_grade\": \"C\", \"brand\": \"Tropicana\", \"category\": \"beverages\"},\n", " {\"product_name\": \"Cheddar Cheese\", \"nutrition_grade\": \"C\", \"brand\": \"Cathedral City\", \"category\": \"dairies\"},\n", " {\"product_name\": \"Granola Bar\", \"nutrition_grade\": \"C\", \"brand\": \"Nature Valley\", \"category\": \"snacks\"},\n", " {\"product_name\": \"Frozen Pizza Margherita\", \"nutrition_grade\": \"C\", \"brand\": \"Dr. Oetker\", \"category\": \"frozen-foods\"},\n", " {\"product_name\": \"Tomato Ketchup\", \"nutrition_grade\": \"C\", \"brand\": \"Heinz\", \"category\": \"snacks\"},\n", " {\"product_name\": \"Cola Drink\", \"nutrition_grade\": \"D\", \"brand\": \"Coca-Cola\", \"category\": \"beverages\"},\n", " {\"product_name\": \"Chocolate Biscuits\", \"nutrition_grade\": \"D\", \"brand\": \"McVitie's\", \"category\": \"snacks\"},\n", " {\"product_name\": \"Cheese Puffs\", \"nutrition_grade\": \"D\", \"brand\": \"Cheetos\", \"category\": \"snacks\"},\n", " {\"product_name\": \"Frozen Chicken Nuggets\", \"nutrition_grade\": \"D\", \"brand\": \"Birds Eye\", \"category\": \"frozen-foods\"},\n", " {\"product_name\": \"Energy Drink\", \"nutrition_grade\": \"D\", \"brand\": \"Red Bull\", \"category\": \"beverages\"},\n", " {\"product_name\": \"Chocolate Spread\", \"nutrition_grade\": \"E\", \"brand\": \"Nutella\", \"category\": \"snacks\"},\n", " {\"product_name\": \"Salted Crisps\", \"nutrition_grade\": \"E\", \"brand\": \"Lay's\", \"category\": \"snacks\"},\n", " {\"product_name\": \"Instant Noodles\", \"nutrition_grade\": \"E\", \"brand\": \"Pot Noodle\", \"category\": \"frozen-foods\"},\n", " {\"product_name\": \"Chocolate Bar\", \"nutrition_grade\": \"E\", \"brand\": \"Snickers\", \"category\": \"snacks\"},\n", " {\"product_name\": \"Cream Cheese\", \"nutrition_grade\": \"D\", \"brand\": \"Philadelphia\", \"category\": \"dairies\"},\n", " {\"product_name\": \"Butter Croissant\", \"nutrition_grade\": \"D\", \"brand\": \"La Boulangere\", \"category\": \"cereals\"},\n", " {\"product_name\": \"Whole Milk\", \"nutrition_grade\": \"C\", \"brand\": \"Lactel\", \"category\": \"dairies\"},\n", " {\"product_name\": \"Apple Juice\", \"nutrition_grade\": \"C\", \"brand\": \"Innocent\", \"category\": \"beverages\"},\n", " {\"product_name\": \"Frozen Lasagne\", \"nutrition_grade\": \"C\", \"brand\": \"Findus\", \"category\": \"frozen-foods\"},\n", " {\"product_name\": \"Protein Bar\", \"nutrition_grade\": \"B\", \"brand\": \"Grenade\", \"category\": \"snacks\"},\n", " ]\n", " # Repeat to get 150 products total\n", " import random\n", " random.seed(2025)\n", " extended = products.copy()\n", " for i in range(120):\n", " base = random.choice(products).copy()\n", " base[\"product_name\"] = base[\"product_name\"] + f\" #{i+2}\"\n", " extended.append(base)\n", " products = extended\n", " print(f\"Fallback dataset ready: {len(products)} products ✅\")\n" ] }, { "cell_type": "markdown", "metadata": { "id": "df_md" }, "source": [ "### *c. ✋🏻🛑⛔️ Create a dataframe df_products from the fetched products*" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "id": "df", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "e7173367-f053-46fc-d447-0c8656578350" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Products after cleaning: 128\n" ] } ], "source": [ "# Create DataFrame and remove duplicates\n", "df_products = pd.DataFrame(products).drop_duplicates(subset=\"product_name\").reset_index(drop=True)\n", "\n", "# Keep only first 200 products to keep things manageable\n", "df_products = df_products.head(200)\n", "\n", "print(f\"Products after cleaning: {len(df_products)}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "save_md" }, "source": [ "### *d. Save the real-world dataframe as a CSV file*" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "id": "save" }, "outputs": [], "source": [ "df_products.to_csv(\"food_products_data.csv\", index=False)" ] }, { "cell_type": "markdown", "metadata": { "id": "view_md" }, "source": [ "### *e. ✋🏻🛑⛔️ View first few lines*" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "id": "view", "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "outputId": "03d95264-892f-42d4-c6fc-0d472c111995" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " product_name nutrition_grade \\\n", "0 Perly A \n", "1 Prince Goût Chocolat au blé complet E \n", "2 Edelbitter-Schokolade D \n", "3 Tartines craquantes au sarrasin imp A \n", "4 Sésame C \n", "\n", " brand category \n", "0 Jaouda snacks \n", "1 LU snacks \n", "2 Lindt snacks \n", "3 Ekibio,Le pain des Fleurs snacks \n", "4 Gerblé snacks " ], "text/html": [ "\n", "
| \n", " | product_name | \n", "nutrition_grade | \n", "brand | \n", "category | \n", "
|---|---|---|---|---|
| 0 | \n", "Perly | \n", "A | \n", "Jaouda | \n", "snacks | \n", "
| 1 | \n", "Prince Goût Chocolat au blé complet | \n", "E | \n", "LU | \n", "snacks | \n", "
| 2 | \n", "Edelbitter-Schokolade | \n", "D | \n", "Lindt | \n", "snacks | \n", "
| 3 | \n", "Tartines craquantes au sarrasin imp | \n", "A | \n", "Ekibio,Le pain des Fleurs | \n", "snacks | \n", "
| 4 | \n", "Sésame | \n", "C | \n", "Gerblé | \n", "snacks | \n", "