{ "cells": [ { "cell_type": "markdown", "id": "781ff07a-10bb-49a5-9daf-2b87b86774e0", "metadata": {}, "source": [ "## Basic description of data \n", "\n", "Dataset of Thai fashion and cosmetics retail sellers. Each sellers Facebook post being of a different nature (video, photos, and links). Engagement metrics consist of comments, shares, and reactions (likes, shares, comments, etc)." ] }, { "cell_type": "markdown", "id": "8aff33d6-5510-428c-a10b-db8c84cb6fd0", "metadata": {}, "source": [ "## Clustering is unsupervised learning...\n", "this means that we are given no labels for each point, not knowing what each group is. The goal of clustering is to find how close alike (or not at all) features are within the dataset and find any possible patterns." ] }, { "cell_type": "markdown", "id": "73f9dca4-0aae-4a09-8403-9199bb885996", "metadata": {}, "source": [ "## Do you expect the model to work well? If not, why?\n", "I belive that the clustering model will work well with this dataset becasue of the different variables used and its value range is large." ] }, { "cell_type": "code", "execution_count": 82, "id": "b29f6fc7-5e98-4bbb-8458-71c6d5947424", "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "import numpy as np\n", "from sklearn import datasets\n", "from sklearn.cluster import KMeans\n", "from sklearn import metrics\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import itertools\n", "import itertools as it\n", "import string\n", "from sklearn.model_selection import train_test_split\n", "import os\n", "import json\n", "from huggingface_hub import HfApi\n", "import skops.io as sio\n", "from skops import card\n", "\n", "sns.set_theme(palette='colorblind')\n", "\n", "# set global random seed so that the notes are the same each time the site builds\n", "np.random.seed(1103)\n", "np.random.seed(113)" ] }, { "cell_type": "code", "execution_count": 83, "id": "57b88ef8-aef7-4395-bc2c-d140ddb125f1", "metadata": {}, "outputs": [], "source": [ "from ucimlrepo import fetch_ucirepo\n", "import pandas as pd\n", "\n", "Thai_datase = fetch_ucirepo(id=488)\n", "\n", "thai_X = pd.DataFrame(Thai_datase.data.features) \n", "\n", "thai_df = pd.DataFrame(Thai_datase.data.features) " ] }, { "cell_type": "code", "execution_count": 84, "id": "56ec5b49-eed7-4031-b3ea-f12ec43db395", "metadata": {}, "outputs": [], "source": [ "thai_X = thai_X.drop(columns=['status_type','status_published'])" ] }, { "cell_type": "code", "execution_count": 85, "id": "ab242b36-818b-4cee-859b-abdf1d06da50", "metadata": {}, "outputs": [], "source": [ "thai_X_train, thai_X_test= train_test_split(thai_X, train_size = 0.8)" ] }, { "cell_type": "code", "execution_count": 86, "id": "a9f09724-a0b0-48de-9c73-1f84e8eed0f0", "metadata": {}, "outputs": [], "source": [ "thai_df = thai_df.drop(columns=['status_type','status_published'])" ] }, { "cell_type": "code", "execution_count": null, "id": "2f2525ec-a98b-428e-9c53-68520bdb6c0e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 87, "id": "7fc50f0b-6870-4917-9ecc-0cb9a943e22a", "metadata": {}, "outputs": [], "source": [ "km = KMeans(n_clusters=5)" ] }, { "cell_type": "code", "execution_count": 88, "id": "433f8230-b5f9-4c76-8bc7-02f68e514e77", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
KMeans(n_clusters=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
| \n", " | n_clusters | \n", "5 | \n", "
| \n", " | init | \n", "'k-means++' | \n", "
| \n", " | n_init | \n", "'auto' | \n", "
| \n", " | max_iter | \n", "300 | \n", "
| \n", " | tol | \n", "0.0001 | \n", "
| \n", " | verbose | \n", "0 | \n", "
| \n", " | random_state | \n", "None | \n", "
| \n", " | copy_x | \n", "True | \n", "
| \n", " | algorithm | \n", "'lloyd' | \n", "