mschonhardt
/

latin-lemmatizer

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "id": "intro",
+      "metadata": {},
+      "source": [
+        "# Lemmatizing Latin with Flair\n",
+        "\n",
+        "This notebook uses the model `mschonhardt/latin-lemmatizer`.\n",
+        "\n",
+        "**Important:** this is a **Flair** lemmatizer checkpoint (pickled `.pt`), not a 🤗 Transformers `text2text-generation` model. The intended usage is via `flair.models.Lemmatizer` and token labels of type `predicted`.\n",
+        "\n",
+        "Model can be found on [Hugging Face](https://huggingface.co/mschonhardt/latin-lemmatizer) and [Zenodo](https://doi.org/10.5281/zenodo.18632650).\n",
+        "\n",
+        "![](https://zenodo.org/badge/DOI/10.5281/zenodo.18632650.svg)\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "install",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# If needed (run once):\n",
+        "# !pip install -U flair huggingface_hub pandas tqdm\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "setup_md",
+      "metadata": {},
+      "source": [
+        "## 1) Setup\n",
+        "Imports, device selection, and two small workarounds:\n",
+        "\n",
+        "- **PyTorch ≥ 2.6** changed `torch.load` defaults around `weights_only`, which can break loading pickled Flair models unless we force `weights_only=False`. :contentReference[oaicite:3]{index=3}\n",
+        "- Some GPU setups need `pack_padded_sequence` to keep `lengths` on CPU.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 1,
+      "id": "setup",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import torch\n",
+        "import torch.nn.utils.rnn as rnn\n",
+        "\n",
+        "# Patch needed to run on GPU\n",
+        "if not getattr(rnn.pack_padded_sequence, \"_cpu_lengths_patched\", False):\n",
+        "    _orig_pack = rnn.pack_padded_sequence\n",
+        "\n",
+        "    def pack_padded_sequence_cpu_lengths(input, lengths, *args, **kwargs):\n",
+        "        if isinstance(input, rnn.PackedSequence):\n",
+        "            return input\n",
+        "        # PyTorch requires CPU lengths if it's a tensor\n",
+        "        if torch.is_tensor(lengths):\n",
+        "            lengths = lengths.detach().cpu()\n",
+        "        return _orig_pack(input, lengths, *args, **kwargs)\n",
+        "\n",
+        "    pack_padded_sequence_cpu_lengths._cpu_lengths_patched = True\n",
+        "    rnn.pack_padded_sequence = pack_padded_sequence_cpu_lengths\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "load_md",
+      "metadata": {},
+      "source": [
+        "## 2) Load the lemmatizer\n",
+        "We download `best-model.pt` and load it with Flair.\n",
+        "\n",
+        "Key point: during `Lemmatizer.load(...)` we temporarily patch `torch.load` to pass `weights_only=False`, so the pickled model object is reconstructed correctly (otherwise you often get only weights and end up with `O O O ...`). :contentReference[oaicite:4]{index=4}\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "id": "9727c8c2",
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Load model from Hugging Face Hub...\n",
+            "Model loaded.\n"
+          ]
+        }
+      ],
+      "source": [
+        "from huggingface_hub import hf_hub_download\n",
+        "from flair.models import Lemmatizer\n",
+        "from flair.data import Sentence\n",
+        "from flair.tokenization import SpaceTokenizer\n",
+        "\n",
+        "print(\"Load model from Hugging Face Hub...\")\n",
+        "model_file = hf_hub_download(\"mschonhardt/latin-lemmatizer\", filename=\"best-model.pt\")\n",
+        "lemmatizer = Lemmatizer.load(model_file)\n",
+        "print(\"Model loaded.\")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "single_md",
+      "metadata": {},
+      "source": [
+        "## 3) Lemmatize a single text\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 7,
+      "id": "load_model",
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Et -> et\n",
+            "videtur -> video\n",
+            ", -> ,\n",
+            "quod -> quod\n",
+            "sic -> sic\n",
+            ", -> ,\n",
+            "quia -> quia\n",
+            "res -> res\n",
+            "empta -> empta\n",
+            "de -> de\n",
+            "pecunia -> pecunia\n",
+            "pupilli -> pupillus\n",
+            "efficitur -> efficio\n",
+            "\n",
+            "Note that no model is perfect, as can be seen in wrong lemmatization of 'empta'.\n"
+          ]
+        }
+      ],
+      "source": [
+        "sent = Sentence(\n",
+        "    \"Et videtur , quod sic , quia res empta de pecunia pupilli efficitur\",\n",
+        "    use_tokenizer=SpaceTokenizer(),\n",
+        ")\n",
+        "\n",
+        "lemmatizer.predict(sent)\n",
+        "\n",
+        "for tok in sent:\n",
+        "    print(tok.text, \"->\", tok.get_label(\"predicted\").value)\n",
+        "\n",
+        "print(\"\\nNote that no model is perfect, as can be seen in wrong lemmatization of 'empta'.\")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "batch_md",
+      "metadata": {},
+      "source": [
+        "## 4) Lemmatize multiple texts (chunking)\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "id": "batch",
+      "metadata": {},
+      "outputs": [
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "d16d81bedd304da4aa9ed212f9e83909",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "Lemmatizing:   0%|          | 0/1 [00:00<?, ?it/s]"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "text/html": [
+              "<div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>text</th>\n",
+              "      <th>lemmatized_text</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>Et videtur , quod sic , quia res empta de pecunia pupilli efficitur</td>\n",
+              "      <td>et video , quod sic , quia res empta de pecunia pupillus efficio</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>In nomine sanctae et individuae trinitatis .</td>\n",
+              "      <td>in nomen sanctus et individuus trinitas .</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>Quod infames uocentur qui ex consanguineis nascuntur .</td>\n",
+              "      <td>quod infamis voco qui ex consanguineus nascor .</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>Si quis clericus furtum fecerit , deponatur .</td>\n",
+              "      <td>si quis clericus furtum facio , depono .</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>"
+            ],
+            "text/plain": [
+              "                                                                  text  \\\n",
+              "0  Et videtur , quod sic , quia res empta de pecunia pupilli efficitur   \n",
+              "1                         In nomine sanctae et individuae trinitatis .   \n",
+              "2               Quod infames uocentur qui ex consanguineis nascuntur .   \n",
+              "3                        Si quis clericus furtum fecerit , deponatur .   \n",
+              "\n",
+              "                                                    lemmatized_text  \n",
+              "0  et video , quod sic , quia res empta de pecunia pupillus efficio  \n",
+              "1                         in nomen sanctus et individuus trinitas .  \n",
+              "2                   quod infamis voco qui ex consanguineus nascor .  \n",
+              "3                          si quis clericus furtum facio , depono .  "
+            ]
+          },
+          "execution_count": 5,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "import pandas as pd\n",
+        "from tqdm.auto import tqdm\n",
+        "from flair.data import Sentence\n",
+        "from flair.tokenization import SpaceTokenizer\n",
+        "\n",
+        "def lemmatize_texts(texts, chunk_size=256, batch_size=32):\n",
+        "    out = []\n",
+        "    for i in tqdm(range(0, len(texts), chunk_size), desc=\"Lemmatizing\"):\n",
+        "        chunk = texts[i:i + chunk_size]\n",
+        "\n",
+        "        sentences = [\n",
+        "            Sentence(t, use_tokenizer=SpaceTokenizer())\n",
+        "            for t in chunk\n",
+        "        ]\n",
+        "\n",
+        "        lemmatizer.predict(\n",
+        "            sentences,\n",
+        "            mini_batch_size=batch_size,\n",
+        "            embedding_storage_mode=\"none\",\n",
+        "        )\n",
+        "\n",
+        "        out.extend([\n",
+        "            \" \".join(tok.get_label(\"predicted\").value for tok in s)\n",
+        "            for s in sentences\n",
+        "        ])\n",
+        "\n",
+        "    return out\n",
+        "\n",
+        "texts = [\n",
+        "    \"Et videtur , quod sic , quia res empta de pecunia pupilli efficitur\",\n",
+        "    \"In nomine sanctae et individuae trinitatis .\",\n",
+        "    \"Quod infames uocentur qui ex consanguineis nascuntur .\",\n",
+        "    \"Si quis clericus furtum fecerit , deponatur .\"\n",
+        "]\n",
+        "\n",
+        "lemmatized_texts = lemmatize_texts(texts, chunk_size=256, batch_size=16)\n",
+        "df = pd.DataFrame({\"text\": texts, \"lemmatized_text\": lemmatized_texts})\n",
+        "pd.set_option(\"display.max_colwidth\", 300) \n",
+        "df"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "export_md",
+      "metadata": {},
+      "source": [
+        "## 5) (Optional) Export\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "export",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# df.to_csv(\"latin_lemmatization_demo.csv\", index=False)\n",
+        "# print(\"Saved latin_lemmatization_demo.csv\")\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "venv-jupyter",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python",
+      "version": "3.12.3"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}