{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "kFEwutXUvYCs", "outputId": "e883733e-f919-401a-e6d3-4edf7f0d012e" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Mounted at /content/drive\n" ] } ], "source": [ "# Accessing our google drive since the data is rather heavy\n", "from google.colab import drive\n", "drive.mount('/content/drive', force_remount=True)" ] }, { "cell_type": "markdown", "source": [ "# Subsetting and resizing our images" ], "metadata": { "id": "uBoCyPh5B5P6" } }, { "cell_type": "markdown", "metadata": { "id": "OUxBi_iSvYCu" }, "source": [ "Our computer and the free version of Google Colab cannot handle taking care of all our data. To be able to do anything, we need to reduce the total amount of images and their size." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wtG9723wvYCv" }, "outputs": [], "source": [ "!unzip /content/drive/MyDrive/Projet_Artefact_Memes/data.zip -d data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tZtMuK37vYCw" }, "outputs": [], "source": [ "# Importing packages\n", "import os\n", "import numpy as np\n", "import pandas as pd\n", "import cv2\n", "from tqdm import tqdm" ] }, { "cell_type": "markdown", "source": [ "We are going to reduce drastically the number of images: from 9000 usable images, we are going to keep only 5000. Therefore, only loading the content of *train.json* is more than enough for taking care of our images.\n", "\n", "Howerver, we are going to also load our *dev.json* file and concatenate them for later use." ], "metadata": { "id": "rHop57d3C7fJ" } }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "I3QOZrAWvYCw", "outputId": "85764aed-b306-4ed0-b8a9-5bda9929e1ba" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(8500, 4)" ] }, "metadata": {}, "execution_count": 3 } ], "source": [ "# We read our data\n", "filepath = './drive/MyDrive/Projet_Artefact_Memes'\n", "\n", "with open(f'{filepath}/train.jsonl') as f:\n", " df = pd.read_json(f, lines=True)\n", "df.shape" ] }, { "cell_type": "code", "source": [ "with open(f'{filepath}/dev.jsonl') as f:\n", " df_dev = pd.read_json(f, lines=True)\n", "df_dev.shape" ], "metadata": { "id": "_-_5bzGGUbJe" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# We concat our datasets\n", "df_all = pd.concat([df, df_dev])\n", "\n", "# We save our dataframe as a csv file for later use\n", "df_all.to_csv(f'{filepath}/data_all.csv', index=False)" ], "metadata": { "id": "UDBCk8j1Uqtx" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "We decide on having a balanced dataset, so we keep a random subset of 2500 hateful memes and 2500 non-hateful memes." ], "metadata": { "id": "v5iZxRfODLrs" } }, { "cell_type": "code", "source": [ "df_subset_hate = df[df['label']==1]\n", "df_subset_hate = df_subset_hate.sample(2500, random_state=24)\n", "\n", "df_subset_no_hate = df[df['label']==0]\n", "df_subset_no_hate = df_subset_no_hate.sample(2500, random_state=24)" ], "metadata": { "id": "cWH0Zw__dG4h" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# We concat the result to have a single dataframe\n", "df = pd.concat([df_subset_hate, df_subset_no_hate])\n", "df.shape" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "LW7y8Ky3dG7D", "outputId": "3d60e251-e08d-415f-a59c-746de04350dd" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(5000, 4)" ] }, "metadata": {}, "execution_count": 6 } ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "tPAPf0O5vYCw", "outputId": "7c308aed-c06a-4a16-e42d-034edea29752" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "1 2500\n", "0 2500\n", "Name: label, dtype: int64" ] }, "metadata": {}, "execution_count": 7 } ], "source": [ "# We check the balancing of our classes\n", "df['label'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "p_irx94gvYCx" }, "outputs": [], "source": [ "# We split to only keep the image name\n", "df['img'] = df['img'].apply(lambda x: x.split('/')[1])\n", "\n", "# We save our dataframe as a csv file for later use\n", "df.to_csv(f'{filepath}/data_5K_balanced.csv', index=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5c6rwhv1vYCx" }, "outputs": [], "source": [ "# We order our images just in case\n", "# (not necessary in this case but better safe than sorry)\n", "images_ordered = []\n", "for item in os.listdir('./data/data/img'):\n", " if item in df['img'].values:\n", " images_ordered.append(item)\n", "\n", "images_ordered = sorted(images_ordered)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "20yYghIcvYCx", "outputId": "b8f0eea4-10f0-4240-a427-1b39b391b4ed" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "5000\n" ] } ], "source": [ "# We check that we have indeed 5000 images\n", "print(len(images_ordered))" ] }, { "cell_type": "markdown", "source": [ "Now, we can resize our images to a reduce shape (120 x 120)." ], "metadata": { "id": "vKJjg-BOE-ur" } }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ndez6r98vYCy", "outputId": "47eaf76d-9886-47bc-fe95-f6c8f19d57b5" }, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "100%|██████████| 5000/5000 [01:11<00:00, 69.94it/s]\n" ] } ], "source": [ "images = []\n", "for i in tqdm(range(5000)):\n", " if images_ordered[i].endswith('.png'):\n", "\n", " # On lit l'image\n", " img = cv2.imread(f'./data/data/img/{images_ordered[i]}')\n", "\n", " # On remet en RGB et on ne laisse pas en BGR\n", " img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)\n", "\n", " # On resize nos images\n", " resized_img = cv2.resize(img, (120, 120), interpolation = cv2.INTER_AREA)\n", "\n", " images.append(resized_img)" ] }, { "cell_type": "markdown", "source": [ "We then save our list of images as a numpy array." ], "metadata": { "id": "76bsZNbrHRFR" } }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7vl-XZKWvYCy", "outputId": "e76ae26c-f0bf-4691-952c-18ca52731924" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(5000, 120, 120, 3)" ] }, "metadata": {}, "execution_count": 13 } ], "source": [ "X = np.array(images)\n", "X.shape" ] }, { "cell_type": "markdown", "source": [ "Then, we scale our images." ], "metadata": { "id": "nGvpcCEfHVnh" } }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Bb3CVTV-vYC0" }, "outputs": [], "source": [ "X = X / 255" ] }, { "cell_type": "markdown", "source": [ "Finally, we save our array of scaled images in a .npz file." ], "metadata": { "id": "R-cXQclYHaiA" } }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DHithg4566m-" }, "outputs": [], "source": [ "np.savez_compressed('./drive/MyDrive/Projet_Artefact_Memes/data_scaled_120_balanced.npz', a=X)" ] }, { "cell_type": "code", "source": [], "metadata": { "id": "Q2yaXobsRLrj" }, "execution_count": null, "outputs": [] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "T4", "provenance": [] }, "gpuClass": "standard", "kernelspec": { "display_name": "artefact", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10.4" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 0 }