mschonhardt commited on
Commit
14b4d17
·
verified ·
1 Parent(s): ccadaa3

Delete latin_abbreviation_expansion.ipynb

Browse files
Files changed (1) hide show
  1. latin_abbreviation_expansion.ipynb +0 -238
latin_abbreviation_expansion.ipynb DELETED
@@ -1,238 +0,0 @@
1
- {
2
- "cells": [
3
- {
4
- "cell_type": "markdown",
5
- "id": "1f175efa",
6
- "metadata": {},
7
- "source": [
8
- "# Latin Interpunctuator\n",
9
- "\n",
10
- "This notebook demonstrates how to use the mt5 model `mschonhardt/mt5-latin-punctuator-large`.\n",
11
- "It applies interpunctuation and text formatting standards to Latin text.\n",
12
- "\n",
13
- "## Quick check"
14
- ]
15
- },
16
- {
17
- "cell_type": "code",
18
- "execution_count": 1,
19
- "id": "1cd29ad2",
20
- "metadata": {},
21
- "outputs": [
22
- {
23
- "name": "stderr",
24
- "output_type": "stream",
25
- "text": [
26
- "Device set to use cuda:0\n",
27
- "Both `max_new_tokens` (=256) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)\n"
28
- ]
29
- },
30
- {
31
- "name": "stdout",
32
- "output_type": "stream",
33
- "text": [
34
- "Source: Vt ep̅i conꝓuinciales peregrina iu¬\n",
35
- "Expanded: Vt episcopi comprouinciales peregrina iu¬\n"
36
- ]
37
- }
38
- ],
39
- "source": [
40
- "from transformers import pipeline\n",
41
- "\n",
42
- "# Load the expander\n",
43
- "expander = pipeline(\"text2text-generation\", model=\"mschonhardt/abbreviationes-v2\")\n",
44
- "\n",
45
- "# Example: \"Vt ep̅i conꝓuinciales peregrina iu¬\" abbreviated\n",
46
- "text = \"Vt ep̅i conꝓuinciales peregrina iu¬\"\n",
47
- "result = expander(text, max_length=512)\n",
48
- "\n",
49
- "print(f\"Source: {text}\")\n",
50
- "print(f\"Expanded: {result[0]['generated_text']}\")"
51
- ]
52
- },
53
- {
54
- "cell_type": "code",
55
- "execution_count": 2,
56
- "id": "b87f3e45",
57
- "metadata": {},
58
- "outputs": [],
59
- "source": [
60
- "## Setup Environment"
61
- ]
62
- },
63
- {
64
- "cell_type": "code",
65
- "execution_count": null,
66
- "id": "044ae4ef",
67
- "metadata": {},
68
- "outputs": [],
69
- "source": [
70
- "# Import necessary libraries\n",
71
- "import torch\n",
72
- "from transformers import AutoTokenizer, AutoModelForSeq2SeqLM\n",
73
- "\n",
74
- "# Model should be used with GPU (cuda) if available for faster inference\n",
75
- "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
76
- "\n",
77
- "print(f\"Torch version: {torch.__version__}\")\n",
78
- "print(f\"Device: {device}\")\n",
79
- "\n",
80
- "print(\"Environment ready.\")"
81
- ]
82
- },
83
- {
84
- "cell_type": "markdown",
85
- "id": "4de2def2",
86
- "metadata": {},
87
- "source": [
88
- "## Load the Model from Hugging Face"
89
- ]
90
- },
91
- {
92
- "cell_type": "code",
93
- "execution_count": null,
94
- "id": "aa5810a8",
95
- "metadata": {},
96
- "outputs": [],
97
- "source": [
98
- "# Load the model and tokenizer from Huggingface\n",
99
- "model_name = \"https://huggingface.co/mschonhardt/abbreviationes-v2\" \n",
100
- "print(f\"Loading model: {model_name} ...\")\n",
101
- "tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)\n",
102
- "model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)\n",
103
- "print(\"Model loaded successfully!\")"
104
- ]
105
- },
106
- {
107
- "cell_type": "markdown",
108
- "id": "2dd05d72",
109
- "metadata": {},
110
- "source": [
111
- "### Prediction Logic\n",
112
- "Model was trained on prefix \"punctuate: \". `Num_beams` needs to be adjusted when running into hallucinations or repetitions. "
113
- ]
114
- },
115
- {
116
- "cell_type": "code",
117
- "execution_count": null,
118
- "id": "e858df99",
119
- "metadata": {},
120
- "outputs": [],
121
- "source": [
122
- "def punctuate(text: str) -> str:\n",
123
- " # Best practice: Add prefix 'punctuate: 'and lowercase as per training script\n",
124
- " input_text = \"punctuate: \" + text.lower()\n",
125
- " \n",
126
- " inputs = tokenizer(\n",
127
- " input_text,\n",
128
- " return_tensors=\"pt\",\n",
129
- " truncation=True,\n",
130
- " max_length=1024,\n",
131
- " ).to(device)\n",
132
- "\n",
133
- " with torch.no_grad():\n",
134
- " output_ids = model.generate(\n",
135
- " **inputs,\n",
136
- " max_length=1024,\n",
137
- " # Adjust numbeams if hallucination occurs, but 4 is a good starting point for better punctuation\n",
138
- " num_beams=4,\n",
139
- " early_stopping=True,\n",
140
- " )\n",
141
- " return tokenizer.decode(output_ids[0], skip_special_tokens=True)\n"
142
- ]
143
- },
144
- {
145
- "cell_type": "code",
146
- "execution_count": null,
147
- "id": "52fd09e1",
148
- "metadata": {},
149
- "outputs": [],
150
- "source": [
151
- "text = \"\"\"\n",
152
- "Si quis Patrem et Filium et Spiritum Sanctum non confitetur tres personas unius substantiae et virtutis ac potestatis, \n",
153
- "sicut catholica et apostolica ecclesia docet, sed unam tantum ac solitariam dicit esse personam, \n",
154
- "ita ut ipse sit Pater qui Filius, ipse etiam sit Paraclitus Spiritus, sicut Sabellius et Priscillianus dixerunt, anathema sit.\"\"\""
155
- ]
156
- },
157
- {
158
- "cell_type": "markdown",
159
- "id": "e582b0e4",
160
- "metadata": {},
161
- "source": [
162
- "Model was trained on lower case input to prevent overfitting on capital letters and force learning of linguistic pattern."
163
- ]
164
- },
165
- {
166
- "cell_type": "code",
167
- "execution_count": null,
168
- "id": "6573900a",
169
- "metadata": {},
170
- "outputs": [],
171
- "source": [
172
- "text_without_punctuation = text.replace(\".\",\"\").replace(\",\",\"\").replace(\";\",\"\").replace(\":\",\"\").replace(\"?\",\"\").replace(\"!\",\"\").replace(\"-\",\"\").replace(\"(\",\"\").replace(\")\",\"\").replace(\"[\",\"\").replace(\"]\",\"\").replace(\"{\",\"\").replace(\"}\",\"\").replace(\"\\\"\",\"\")\n",
173
- "text_without_punctuation = text_without_punctuation.lower()\n",
174
- "import textwrap\n",
175
- "print(textwrap.fill(text_without_punctuation, width=80))"
176
- ]
177
- },
178
- {
179
- "cell_type": "markdown",
180
- "id": "843757c0",
181
- "metadata": {},
182
- "source": [
183
- "### Run Inference"
184
- ]
185
- },
186
- {
187
- "cell_type": "code",
188
- "execution_count": null,
189
- "id": "86c7521d",
190
- "metadata": {},
191
- "outputs": [],
192
- "source": [
193
- "# Model will predict punctuation for the input text as well as appropriate use of capital letters\n",
194
- "# Note: The model will reflect conventions of material it has seen, which might differ from your expectations.\n",
195
- "text_with_punctuation = punctuate(text_without_punctuation)\n"
196
- ]
197
- },
198
- {
199
- "cell_type": "markdown",
200
- "id": "02ef7e70",
201
- "metadata": {},
202
- "source": [
203
- "As the model does apply conventions it has learned from training data, the models decision might differ from your own conventions and expectations. It has not been designed to prepare a 'perfect' text, but to provide structure to unstrucutred text enabling downstream tasks,"
204
- ]
205
- },
206
- {
207
- "cell_type": "code",
208
- "execution_count": null,
209
- "id": "1c908d35",
210
- "metadata": {},
211
- "outputs": [],
212
- "source": [
213
- "print(textwrap.fill(text_with_punctuation, width=80))"
214
- ]
215
- }
216
- ],
217
- "metadata": {
218
- "kernelspec": {
219
- "display_name": "venv-jupyter",
220
- "language": "python",
221
- "name": "python3"
222
- },
223
- "language_info": {
224
- "codemirror_mode": {
225
- "name": "ipython",
226
- "version": 3
227
- },
228
- "file_extension": ".py",
229
- "mimetype": "text/x-python",
230
- "name": "python",
231
- "nbconvert_exporter": "python",
232
- "pygments_lexer": "ipython3",
233
- "version": "3.12.3"
234
- }
235
- },
236
- "nbformat": 4,
237
- "nbformat_minor": 5
238
- }