Delete latin_abbreviation_expansion.ipynb

Browse files

Files changed (1) hide show

latin_abbreviation_expansion.ipynb +0 -238

latin_abbreviation_expansion.ipynb DELETED Viewed

@@ -1,238 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "1f175efa",
-   "metadata": {},
-   "source": [
-    "# Latin Interpunctuator\n",
-    "\n",
-    "This notebook demonstrates how to use the mt5 model `mschonhardt/mt5-latin-punctuator-large`.\n",
-    "It applies interpunctuation and text formatting standards to Latin text.\n",
-    "\n",
-    "## Quick check"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "id": "1cd29ad2",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Device set to use cuda:0\n",
-      "Both `max_new_tokens` (=256) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Source: Vt ep̅i conꝓuinciales peregrina iu¬\n",
-      "Expanded: Vt episcopi comprouinciales peregrina iu¬\n"
-     ]
-    }
-   ],
-   "source": [
-    "from transformers import pipeline\n",
-    "\n",
-    "# Load the expander\n",
-    "expander = pipeline(\"text2text-generation\", model=\"mschonhardt/abbreviationes-v2\")\n",
-    "\n",
-    "# Example: \"Vt ep̅i conꝓuinciales peregrina iu¬\" abbreviated\n",
-    "text = \"Vt ep̅i conꝓuinciales peregrina iu¬\"\n",
-    "result = expander(text, max_length=512)\n",
-    "\n",
-    "print(f\"Source: {text}\")\n",
-    "print(f\"Expanded: {result[0]['generated_text']}\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "id": "b87f3e45",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "## Setup Environment"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "044ae4ef",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Import necessary libraries\n",
-    "import torch\n",
-    "from transformers import AutoTokenizer, AutoModelForSeq2SeqLM\n",
-    "\n",
-    "# Model should be used with GPU (cuda) if available for faster inference\n",
-    "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
-    "\n",
-    "print(f\"Torch version: {torch.__version__}\")\n",
-    "print(f\"Device: {device}\")\n",
-    "\n",
-    "print(\"Environment ready.\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "4de2def2",
-   "metadata": {},
-   "source": [
-    "## Load the Model from Hugging Face"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "aa5810a8",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Load the model and tokenizer from Huggingface\n",
-    "model_name = \"https://huggingface.co/mschonhardt/abbreviationes-v2\" \n",
-    "print(f\"Loading model: {model_name} ...\")\n",
-    "tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)\n",
-    "model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)\n",
-    "print(\"Model loaded successfully!\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2dd05d72",
-   "metadata": {},
-   "source": [
-    "### Prediction Logic\n",
-    "Model was trained on prefix \"punctuate: \". `Num_beams` needs to be adjusted when running into hallucinations or repetitions. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "e858df99",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def punctuate(text: str) -> str:\n",
-    "    # Best practice: Add prefix 'punctuate: 'and lowercase as per training script\n",
-    "    input_text = \"punctuate: \" + text.lower()\n",
-    "    \n",
-    "    inputs = tokenizer(\n",
-    "        input_text,\n",
-    "        return_tensors=\"pt\",\n",
-    "        truncation=True,\n",
-    "        max_length=1024,\n",
-    "    ).to(device)\n",
-    "\n",
-    "    with torch.no_grad():\n",
-    "        output_ids = model.generate(\n",
-    "            **inputs,\n",
-    "            max_length=1024,\n",
-    "            # Adjust numbeams if hallucination occurs, but 4 is a good starting point for better punctuation\n",
-    "            num_beams=4,\n",
-    "            early_stopping=True,\n",
-    "        )\n",
-    "    return tokenizer.decode(output_ids[0], skip_special_tokens=True)\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "52fd09e1",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "text = \"\"\"\n",
-    "Si quis Patrem et Filium et Spiritum Sanctum non confitetur tres personas unius substantiae et virtutis ac potestatis, \n",
-    "sicut catholica et apostolica ecclesia docet, sed unam tantum ac solitariam dicit esse personam, \n",
-    "ita ut ipse sit Pater qui Filius, ipse etiam sit Paraclitus Spiritus, sicut Sabellius et Priscillianus dixerunt, anathema sit.\"\"\""
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e582b0e4",
-   "metadata": {},
-   "source": [
-    "Model was trained on lower case input to prevent overfitting on capital letters and force learning of linguistic pattern."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6573900a",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "text_without_punctuation = text.replace(\".\",\"\").replace(\",\",\"\").replace(\";\",\"\").replace(\":\",\"\").replace(\"?\",\"\").replace(\"!\",\"\").replace(\"-\",\"\").replace(\"(\",\"\").replace(\")\",\"\").replace(\"[\",\"\").replace(\"]\",\"\").replace(\"{\",\"\").replace(\"}\",\"\").replace(\"\\\"\",\"\")\n",
-    "text_without_punctuation = text_without_punctuation.lower()\n",
-    "import textwrap\n",
-    "print(textwrap.fill(text_without_punctuation, width=80))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "843757c0",
-   "metadata": {},
-   "source": [
-    "### Run Inference"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "86c7521d",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Model will predict punctuation for the input text as well as appropriate use of capital letters\n",
-    "# Note: The model will reflect conventions of material it has seen, which might differ from your expectations.\n",
-    "text_with_punctuation = punctuate(text_without_punctuation)\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "02ef7e70",
-   "metadata": {},
-   "source": [
-    "As the model does apply conventions it has learned from training data, the models decision might differ from your own conventions and expectations. It has not been designed to prepare a 'perfect' text, but to provide structure to unstrucutred text enabling downstream tasks,"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "1c908d35",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "print(textwrap.fill(text_with_punctuation, width=80))"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "venv-jupyter",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.12.3"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}