{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "v7EvaRVtQakR" }, "source": [ "## Description:\n", "- This dataset, sourced from PyTorch's official tutorial, comprises popular names across 18 distinct languages, namely Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Irish, Italian, Japanese, Korean, Polish, Portuguese, Russian, Scottish, Spanish, and Vietnamese. Each language's names are contained in separate text files for easy extraction and categorization.\n", "\n", "## App link\n", "- https://huggingface.co/spaces/kanneboinakumar/Name_Classification\n", "\n", "## Data link\n", "- https://www.kaggle.com/datasets/shubhampatel231/name-classification" ] }, { "cell_type": "markdown", "metadata": { "id": "sfW4vLsbyJDJ" }, "source": [ "# 1.Packages" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "LKVYNknXyJDM" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "import os\n", "import glob\n", "import unicodedata\n", "from sklearn.preprocessing import LabelEncoder\n", "from sklearn.model_selection import train_test_split\n", "from joblib import load, dump\n", "import torch\n", "import torch.nn as nn\n", "from torch.nn.utils.rnn import pad_sequence\n", "from torch.utils.data import TensorDataset, DataLoader" ] }, { "cell_type": "markdown", "metadata": { "id": "DBgB-_BHyJDO" }, "source": [ "# 2.Load text files and Create a Dataframe" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "GTlEm1u2yUG5", "outputId": "ecb38c05-f7de-43e8-e7fe-089c2e27226e" }, "outputs": [], "source": [ "# from google.colab import drive\n", "# drive.mount('/content/drive')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "XJjsYuw8yJDP", "outputId": "fc4e4937-435d-473e-dfbf-9658247e5e8d" }, "outputs": [ { "data": { "text/html": [ "
| \n", " | Name | \n", "Country | \n", "
|---|---|---|
| 0 | \n", "Khoury | \n", "Arabic | \n", "
| 1 | \n", "Nahas | \n", "Arabic | \n", "
| 2 | \n", "Daher | \n", "Arabic | \n", "
| 3 | \n", "Gerges | \n", "Arabic | \n", "
| 4 | \n", "Nazari | \n", "Arabic | \n", "