mschonhardt commited on
Commit
9bc9cb0
·
verified ·
1 Parent(s): e56677f

Upload latin_lemma.ipynb

Browse files
Files changed (1) hide show
  1. latin_lemma.ipynb +328 -0
latin_lemma.ipynb ADDED
@@ -0,0 +1,328 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "intro",
6
+ "metadata": {},
7
+ "source": [
8
+ "# Lemmatizing Latin with Flair\n",
9
+ "\n",
10
+ "This notebook uses the model `mschonhardt/latin-lemmatizer`.\n",
11
+ "\n",
12
+ "**Important:** this is a **Flair** lemmatizer checkpoint (pickled `.pt`), not a 🤗 Transformers `text2text-generation` model. The intended usage is via `flair.models.Lemmatizer` and token labels of type `predicted`.\n",
13
+ "\n",
14
+ "Model can be found on [Hugging Face](https://huggingface.co/mschonhardt/latin-lemmatizer) and [Zenodo](https://doi.org/10.5281/zenodo.18632650).\n",
15
+ "\n",
16
+ "![](https://zenodo.org/badge/DOI/10.5281/zenodo.18632650.svg)\n"
17
+ ]
18
+ },
19
+ {
20
+ "cell_type": "code",
21
+ "execution_count": null,
22
+ "id": "install",
23
+ "metadata": {},
24
+ "outputs": [],
25
+ "source": [
26
+ "# If needed (run once):\n",
27
+ "# !pip install -U flair huggingface_hub pandas tqdm\n"
28
+ ]
29
+ },
30
+ {
31
+ "cell_type": "markdown",
32
+ "id": "setup_md",
33
+ "metadata": {},
34
+ "source": [
35
+ "## 1) Setup\n",
36
+ "Imports, device selection, and two small workarounds:\n",
37
+ "\n",
38
+ "- **PyTorch ≥ 2.6** changed `torch.load` defaults around `weights_only`, which can break loading pickled Flair models unless we force `weights_only=False`. :contentReference[oaicite:3]{index=3}\n",
39
+ "- Some GPU setups need `pack_padded_sequence` to keep `lengths` on CPU.\n"
40
+ ]
41
+ },
42
+ {
43
+ "cell_type": "code",
44
+ "execution_count": 1,
45
+ "id": "setup",
46
+ "metadata": {},
47
+ "outputs": [],
48
+ "source": [
49
+ "import torch\n",
50
+ "import torch.nn.utils.rnn as rnn\n",
51
+ "\n",
52
+ "# Patch needed to run on GPU\n",
53
+ "if not getattr(rnn.pack_padded_sequence, \"_cpu_lengths_patched\", False):\n",
54
+ " _orig_pack = rnn.pack_padded_sequence\n",
55
+ "\n",
56
+ " def pack_padded_sequence_cpu_lengths(input, lengths, *args, **kwargs):\n",
57
+ " if isinstance(input, rnn.PackedSequence):\n",
58
+ " return input\n",
59
+ " # PyTorch requires CPU lengths if it's a tensor\n",
60
+ " if torch.is_tensor(lengths):\n",
61
+ " lengths = lengths.detach().cpu()\n",
62
+ " return _orig_pack(input, lengths, *args, **kwargs)\n",
63
+ "\n",
64
+ " pack_padded_sequence_cpu_lengths._cpu_lengths_patched = True\n",
65
+ " rnn.pack_padded_sequence = pack_padded_sequence_cpu_lengths\n"
66
+ ]
67
+ },
68
+ {
69
+ "cell_type": "markdown",
70
+ "id": "load_md",
71
+ "metadata": {},
72
+ "source": [
73
+ "## 2) Load the lemmatizer\n",
74
+ "We download `best-model.pt` and load it with Flair.\n",
75
+ "\n",
76
+ "Key point: during `Lemmatizer.load(...)` we temporarily patch `torch.load` to pass `weights_only=False`, so the pickled model object is reconstructed correctly (otherwise you often get only weights and end up with `O O O ...`). :contentReference[oaicite:4]{index=4}\n"
77
+ ]
78
+ },
79
+ {
80
+ "cell_type": "code",
81
+ "execution_count": 2,
82
+ "id": "9727c8c2",
83
+ "metadata": {},
84
+ "outputs": [
85
+ {
86
+ "name": "stdout",
87
+ "output_type": "stream",
88
+ "text": [
89
+ "Load model from Hugging Face Hub...\n",
90
+ "Model loaded.\n"
91
+ ]
92
+ }
93
+ ],
94
+ "source": [
95
+ "from huggingface_hub import hf_hub_download\n",
96
+ "from flair.models import Lemmatizer\n",
97
+ "from flair.data import Sentence\n",
98
+ "from flair.tokenization import SpaceTokenizer\n",
99
+ "\n",
100
+ "print(\"Load model from Hugging Face Hub...\")\n",
101
+ "model_file = hf_hub_download(\"mschonhardt/latin-lemmatizer\", filename=\"best-model.pt\")\n",
102
+ "lemmatizer = Lemmatizer.load(model_file)\n",
103
+ "print(\"Model loaded.\")\n"
104
+ ]
105
+ },
106
+ {
107
+ "cell_type": "markdown",
108
+ "id": "single_md",
109
+ "metadata": {},
110
+ "source": [
111
+ "## 3) Lemmatize a single text\n"
112
+ ]
113
+ },
114
+ {
115
+ "cell_type": "code",
116
+ "execution_count": 7,
117
+ "id": "load_model",
118
+ "metadata": {},
119
+ "outputs": [
120
+ {
121
+ "name": "stdout",
122
+ "output_type": "stream",
123
+ "text": [
124
+ "Et -> et\n",
125
+ "videtur -> video\n",
126
+ ", -> ,\n",
127
+ "quod -> quod\n",
128
+ "sic -> sic\n",
129
+ ", -> ,\n",
130
+ "quia -> quia\n",
131
+ "res -> res\n",
132
+ "empta -> empta\n",
133
+ "de -> de\n",
134
+ "pecunia -> pecunia\n",
135
+ "pupilli -> pupillus\n",
136
+ "efficitur -> efficio\n",
137
+ "\n",
138
+ "Note that no model is perfect, as can be seen in wrong lemmatization of 'empta'.\n"
139
+ ]
140
+ }
141
+ ],
142
+ "source": [
143
+ "sent = Sentence(\n",
144
+ " \"Et videtur , quod sic , quia res empta de pecunia pupilli efficitur\",\n",
145
+ " use_tokenizer=SpaceTokenizer(),\n",
146
+ ")\n",
147
+ "\n",
148
+ "lemmatizer.predict(sent)\n",
149
+ "\n",
150
+ "for tok in sent:\n",
151
+ " print(tok.text, \"->\", tok.get_label(\"predicted\").value)\n",
152
+ "\n",
153
+ "print(\"\\nNote that no model is perfect, as can be seen in wrong lemmatization of 'empta'.\")\n"
154
+ ]
155
+ },
156
+ {
157
+ "cell_type": "markdown",
158
+ "id": "batch_md",
159
+ "metadata": {},
160
+ "source": [
161
+ "## 4) Lemmatize multiple texts (chunking)\n"
162
+ ]
163
+ },
164
+ {
165
+ "cell_type": "code",
166
+ "execution_count": 5,
167
+ "id": "batch",
168
+ "metadata": {},
169
+ "outputs": [
170
+ {
171
+ "data": {
172
+ "application/vnd.jupyter.widget-view+json": {
173
+ "model_id": "d16d81bedd304da4aa9ed212f9e83909",
174
+ "version_major": 2,
175
+ "version_minor": 0
176
+ },
177
+ "text/plain": [
178
+ "Lemmatizing: 0%| | 0/1 [00:00<?, ?it/s]"
179
+ ]
180
+ },
181
+ "metadata": {},
182
+ "output_type": "display_data"
183
+ },
184
+ {
185
+ "data": {
186
+ "text/html": [
187
+ "<div>\n",
188
+ "<style scoped>\n",
189
+ " .dataframe tbody tr th:only-of-type {\n",
190
+ " vertical-align: middle;\n",
191
+ " }\n",
192
+ "\n",
193
+ " .dataframe tbody tr th {\n",
194
+ " vertical-align: top;\n",
195
+ " }\n",
196
+ "\n",
197
+ " .dataframe thead th {\n",
198
+ " text-align: right;\n",
199
+ " }\n",
200
+ "</style>\n",
201
+ "<table border=\"1\" class=\"dataframe\">\n",
202
+ " <thead>\n",
203
+ " <tr style=\"text-align: right;\">\n",
204
+ " <th></th>\n",
205
+ " <th>text</th>\n",
206
+ " <th>lemmatized_text</th>\n",
207
+ " </tr>\n",
208
+ " </thead>\n",
209
+ " <tbody>\n",
210
+ " <tr>\n",
211
+ " <th>0</th>\n",
212
+ " <td>Et videtur , quod sic , quia res empta de pecunia pupilli efficitur</td>\n",
213
+ " <td>et video , quod sic , quia res empta de pecunia pupillus efficio</td>\n",
214
+ " </tr>\n",
215
+ " <tr>\n",
216
+ " <th>1</th>\n",
217
+ " <td>In nomine sanctae et individuae trinitatis .</td>\n",
218
+ " <td>in nomen sanctus et individuus trinitas .</td>\n",
219
+ " </tr>\n",
220
+ " <tr>\n",
221
+ " <th>2</th>\n",
222
+ " <td>Quod infames uocentur qui ex consanguineis nascuntur .</td>\n",
223
+ " <td>quod infamis voco qui ex consanguineus nascor .</td>\n",
224
+ " </tr>\n",
225
+ " <tr>\n",
226
+ " <th>3</th>\n",
227
+ " <td>Si quis clericus furtum fecerit , deponatur .</td>\n",
228
+ " <td>si quis clericus furtum facio , depono .</td>\n",
229
+ " </tr>\n",
230
+ " </tbody>\n",
231
+ "</table>\n",
232
+ "</div>"
233
+ ],
234
+ "text/plain": [
235
+ " text \\\n",
236
+ "0 Et videtur , quod sic , quia res empta de pecunia pupilli efficitur \n",
237
+ "1 In nomine sanctae et individuae trinitatis . \n",
238
+ "2 Quod infames uocentur qui ex consanguineis nascuntur . \n",
239
+ "3 Si quis clericus furtum fecerit , deponatur . \n",
240
+ "\n",
241
+ " lemmatized_text \n",
242
+ "0 et video , quod sic , quia res empta de pecunia pupillus efficio \n",
243
+ "1 in nomen sanctus et individuus trinitas . \n",
244
+ "2 quod infamis voco qui ex consanguineus nascor . \n",
245
+ "3 si quis clericus furtum facio , depono . "
246
+ ]
247
+ },
248
+ "execution_count": 5,
249
+ "metadata": {},
250
+ "output_type": "execute_result"
251
+ }
252
+ ],
253
+ "source": [
254
+ "import pandas as pd\n",
255
+ "from tqdm.auto import tqdm\n",
256
+ "from flair.data import Sentence\n",
257
+ "from flair.tokenization import SpaceTokenizer\n",
258
+ "\n",
259
+ "def lemmatize_texts(texts, chunk_size=256, batch_size=32):\n",
260
+ " out = []\n",
261
+ " for i in tqdm(range(0, len(texts), chunk_size), desc=\"Lemmatizing\"):\n",
262
+ " chunk = texts[i:i + chunk_size]\n",
263
+ "\n",
264
+ " sentences = [\n",
265
+ " Sentence(t, use_tokenizer=SpaceTokenizer())\n",
266
+ " for t in chunk\n",
267
+ " ]\n",
268
+ "\n",
269
+ " lemmatizer.predict(\n",
270
+ " sentences,\n",
271
+ " mini_batch_size=batch_size,\n",
272
+ " embedding_storage_mode=\"none\",\n",
273
+ " )\n",
274
+ "\n",
275
+ " out.extend([\n",
276
+ " \" \".join(tok.get_label(\"predicted\").value for tok in s)\n",
277
+ " for s in sentences\n",
278
+ " ])\n",
279
+ "\n",
280
+ " return out\n",
281
+ "\n",
282
+ "texts = [\n",
283
+ " \"Et videtur , quod sic , quia res empta de pecunia pupilli efficitur\",\n",
284
+ " \"In nomine sanctae et individuae trinitatis .\",\n",
285
+ " \"Quod infames uocentur qui ex consanguineis nascuntur .\",\n",
286
+ " \"Si quis clericus furtum fecerit , deponatur .\"\n",
287
+ "]\n",
288
+ "\n",
289
+ "lemmatized_texts = lemmatize_texts(texts, chunk_size=256, batch_size=16)\n",
290
+ "df = pd.DataFrame({\"text\": texts, \"lemmatized_text\": lemmatized_texts})\n",
291
+ "pd.set_option(\"display.max_colwidth\", 300) \n",
292
+ "df"
293
+ ]
294
+ },
295
+ {
296
+ "cell_type": "markdown",
297
+ "id": "export_md",
298
+ "metadata": {},
299
+ "source": [
300
+ "## 5) (Optional) Export\n"
301
+ ]
302
+ },
303
+ {
304
+ "cell_type": "code",
305
+ "execution_count": null,
306
+ "id": "export",
307
+ "metadata": {},
308
+ "outputs": [],
309
+ "source": [
310
+ "# df.to_csv(\"latin_lemmatization_demo.csv\", index=False)\n",
311
+ "# print(\"Saved latin_lemmatization_demo.csv\")\n"
312
+ ]
313
+ }
314
+ ],
315
+ "metadata": {
316
+ "kernelspec": {
317
+ "display_name": "venv-jupyter",
318
+ "language": "python",
319
+ "name": "python3"
320
+ },
321
+ "language_info": {
322
+ "name": "python",
323
+ "version": "3.12.3"
324
+ }
325
+ },
326
+ "nbformat": 4,
327
+ "nbformat_minor": 5
328
+ }