mschonhardt commited on
Commit
ed78d71
·
verified ·
1 Parent(s): 14b4d17

Upload latin_abbreviation_expansion.ipynb

Browse files
Files changed (1) hide show
  1. latin_abbreviation_expansion.ipynb +199 -0
latin_abbreviation_expansion.ipynb ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "1f175efa",
6
+ "metadata": {},
7
+ "source": [
8
+ "# Latin Abbreviation Expansion\n",
9
+ "\n",
10
+ "This notebook demonstrates how to use the byt5 model `mschonhardt/abbreviationes-v2`.\n",
11
+ "It expands medieval abbreviations based on a fixed set of special characters.\n",
12
+ "\n",
13
+ "## Quick check\n",
14
+ "You can use `pipeline` to quickly convert input text. "
15
+ ]
16
+ },
17
+ {
18
+ "cell_type": "code",
19
+ "execution_count": 13,
20
+ "id": "1cd29ad2",
21
+ "metadata": {},
22
+ "outputs": [
23
+ {
24
+ "name": "stderr",
25
+ "output_type": "stream",
26
+ "text": [
27
+ "Device set to use cuda:0\n",
28
+ "Both `max_new_tokens` (=256) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)\n"
29
+ ]
30
+ },
31
+ {
32
+ "name": "stdout",
33
+ "output_type": "stream",
34
+ "text": [
35
+ "Source: aut ferrum lapsū de manubrio\n",
36
+ "Expanded: aut ferrum lapsum de manubrio\n"
37
+ ]
38
+ }
39
+ ],
40
+ "source": [
41
+ "from transformers import pipeline\n",
42
+ "\n",
43
+ "# Load the expander\n",
44
+ "expander = pipeline(\"text2text-generation\", model=\"mschonhardt/abbreviationes-v2\")\n",
45
+ "\n",
46
+ "# Example: \"aut ferrum lapsū de manubrio\" abbreviated\n",
47
+ "text = \"aut ferrum lapsū de manubrio\"\n",
48
+ "result = expander(text, max_length=512)\n",
49
+ "\n",
50
+ "print(f\"Source: {text}\")\n",
51
+ "print(f\"Expanded: {result[0]['generated_text']}\")"
52
+ ]
53
+ },
54
+ {
55
+ "cell_type": "markdown",
56
+ "id": "b87f3e45",
57
+ "metadata": {},
58
+ "source": [
59
+ "The model can also be used and exemplified in a more detailed way. \n",
60
+ "\n",
61
+ "## Setup Environment"
62
+ ]
63
+ },
64
+ {
65
+ "cell_type": "code",
66
+ "execution_count": 14,
67
+ "id": "044ae4ef",
68
+ "metadata": {},
69
+ "outputs": [
70
+ {
71
+ "name": "stdout",
72
+ "output_type": "stream",
73
+ "text": [
74
+ "Torch version: 2.10.0+cu128\n",
75
+ "Device: cuda\n",
76
+ "Environment ready.\n"
77
+ ]
78
+ }
79
+ ],
80
+ "source": [
81
+ "# Import necessary libraries\n",
82
+ "import torch\n",
83
+ "from transformers import AutoTokenizer, AutoModelForSeq2SeqLM\n",
84
+ "\n",
85
+ "# Model should be used with GPU (cuda) if available for faster inference\n",
86
+ "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
87
+ "\n",
88
+ "print(f\"Torch version: {torch.__version__}\")\n",
89
+ "print(f\"Device: {device}\")\n",
90
+ "\n",
91
+ "print(\"Environment ready.\")"
92
+ ]
93
+ },
94
+ {
95
+ "cell_type": "markdown",
96
+ "id": "4de2def2",
97
+ "metadata": {},
98
+ "source": [
99
+ "## Load the Model from Hugging Face"
100
+ ]
101
+ },
102
+ {
103
+ "cell_type": "code",
104
+ "execution_count": 15,
105
+ "id": "aa5810a8",
106
+ "metadata": {},
107
+ "outputs": [
108
+ {
109
+ "name": "stdout",
110
+ "output_type": "stream",
111
+ "text": [
112
+ "Loading model: mschonhardt/abbreviationes-v2 ...\n",
113
+ "Model loaded successfully!\n"
114
+ ]
115
+ }
116
+ ],
117
+ "source": [
118
+ "# Load the model and tokenizer from Huggingface\n",
119
+ "model_name = \"mschonhardt/abbreviationes-v2\" \n",
120
+ "print(f\"Loading model: {model_name} ...\")\n",
121
+ "tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)\n",
122
+ "model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)\n",
123
+ "print(\"Model loaded successfully!\")"
124
+ ]
125
+ },
126
+ {
127
+ "cell_type": "markdown",
128
+ "id": "2dd05d72",
129
+ "metadata": {},
130
+ "source": [
131
+ "### Prediction Logic\n",
132
+ "The model was trained on abbreviated text lines from manuscripts. Quality might degrade if used for longer passages.\n",
133
+ "\n",
134
+ "### Run Inference"
135
+ ]
136
+ },
137
+ {
138
+ "cell_type": "code",
139
+ "execution_count": 16,
140
+ "id": "e858df99",
141
+ "metadata": {},
142
+ "outputs": [
143
+ {
144
+ "name": "stdout",
145
+ "output_type": "stream",
146
+ "text": [
147
+ "Input: aut ferrum lapsū de manubrio\n",
148
+ "Expanded: aut ferrum lapsum de manubrio\n",
149
+ "Input: ei᷒ et surgens ꝑcusserit eum et\n",
150
+ "Expanded: eius et surgens percusserit eum et\n",
151
+ "Input: tur ab ultore sanguinis ꝓximi sui\n",
152
+ "Expanded: tur ab ultore sanguinis proximi sui\n",
153
+ "Input: et illū qui armis c̅tra iniquitatē\n",
154
+ "Expanded: et illum qui armis contra iniquitatem\n"
155
+ ]
156
+ }
157
+ ],
158
+ "source": [
159
+ "# The abbreviated Medieval Latin text\n",
160
+ "lines = [\"aut ferrum lapsū de manubrio\", \"ei᷒ et surgens ꝑcusserit eum et\", \"tur ab ultore sanguinis ꝓximi sui\", \"et illū qui armis c̅tra iniquitatē\"]\n",
161
+ "\n",
162
+ "for input_text in lines:\n",
163
+ "\n",
164
+ " # 1. Tokenize input\n",
165
+ " inputs = tokenizer(input_text, return_tensors=\"pt\").to(device)\n",
166
+ "\n",
167
+ " # 2. Generate output tokens\n",
168
+ " output_tokens = model.generate(**inputs, max_length=128)\n",
169
+ "\n",
170
+ " # 3. Decode back to text\n",
171
+ " expanded_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)\n",
172
+ "\n",
173
+ " print(f\"Input: {input_text}\")\n",
174
+ " print(f\"Expanded: {expanded_text}\")\n"
175
+ ]
176
+ }
177
+ ],
178
+ "metadata": {
179
+ "kernelspec": {
180
+ "display_name": "venv-jupyter",
181
+ "language": "python",
182
+ "name": "python3"
183
+ },
184
+ "language_info": {
185
+ "codemirror_mode": {
186
+ "name": "ipython",
187
+ "version": 3
188
+ },
189
+ "file_extension": ".py",
190
+ "mimetype": "text/x-python",
191
+ "name": "python",
192
+ "nbconvert_exporter": "python",
193
+ "pygments_lexer": "ipython3",
194
+ "version": "3.12.3"
195
+ }
196
+ },
197
+ "nbformat": 4,
198
+ "nbformat_minor": 5
199
+ }