AntonioCGF commited on
Commit
dd4ba92
·
verified ·
1 Parent(s): 9676fce

Upload 5 files

Browse files
.gitattributes CHANGED
@@ -1,35 +1,4 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
- *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
- *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
1
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
2
+ *.pt filter=lfs diff=lfs merge=lfs -text
3
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
4
+ *.bin filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Proyecto_Hugging_Face.ipynb ADDED
@@ -0,0 +1,1133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {
6
+ "id": "bMYkkVla0zjn"
7
+ },
8
+ "source": [
9
+ "# Proyecto: Fine-Tuning y Despliegue de un Modelo Transformer\n",
10
+ "\n",
11
+ "**Instrucciones Generales:**\n",
12
+ "En este proyecto deberás seleccionar un problema de negocio o investigación que involucre el procesamiento de lenguaje natural (NLP). Algunos ejemplos incluyen: clasificación de reviews de e-commerce, detección de spam, análisis de sentimientos, o resumen de noticias financieras.\n",
13
+ "\n",
14
+ "**Entregables esperados:**\n",
15
+ "1. **Dataset:** Selección y carga de un dataset (propio o de Hugging Face) distinto a los vistos en clase.\n",
16
+ " - Tened en cuenta la complejidad del dataset y la tokenización.\n",
17
+ " - También recomiendo utilizar un subset para aligerar el posterior entrenamiento. No buscamos maximizar resultados, sólo demostrar lo aprendido.\n",
18
+ "2. **Entrenamiento:** Proceso de finetuning de un modelo:\n",
19
+ " - Elección de un modelo.\n",
20
+ " - Fine-tuning de un modelo Transformer sobre los datos.\n",
21
+ " - Reporte de métricas de evaluación en el conjunto de test.\n",
22
+ "3. **Despliegue (Model y Space):** El modelo final debe estar subido al Hub de Hugging Face y debe crearse un \"Space\" (demo en Gradio) funcional donde se pueda probar el modelo introduciendo texto en vivo*.\n",
23
+ "4. **Model Card:** El repositorio del modelo en Hugging Face debe contener un `README.md` explicando qué hace el modelo, sus limitaciones y las métricas obtenidas.\n",
24
+ "\n",
25
+ "\\* Si tenéis problemas con el finetuning, el modelo desplegado puede ser un modelo ya existente.\n",
26
+ "\n",
27
+ "> **Nota sobre la organización:**\n",
28
+ ">\n",
29
+ ">Este notebook está diseñado para que lo utilices como plantilla. **En principio, todo el ciclo de vida del proyecto (carga, entrenamiento, evaluación y push al Hub) se puede realizar dentro de este mismo notebook.** Sin embargo, siéntete libre de dividirlo en varios notebooks separados (ej. uno para entrenamiento y otro para el despliegue) si lo consideras más organizado."
30
+ ]
31
+ },
32
+ {
33
+ "cell_type": "markdown",
34
+ "metadata": {},
35
+ "source": [
36
+ "El código del proyecto, y una demo, puede encontrarse en https://huggingface.co/spaces/antcaesar/resuemenes_hugginface_TECP"
37
+ ]
38
+ },
39
+ {
40
+ "cell_type": "code",
41
+ "execution_count": null,
42
+ "metadata": {
43
+ "id": "SWa-5d910tPC"
44
+ },
45
+ "outputs": [],
46
+ "source": [
47
+ "import math\n",
48
+ "import numpy as np\n",
49
+ "import pandas as pd\n",
50
+ "import torch\n",
51
+ "from datasets import Dataset\n",
52
+ "from torch.utils.data import DataLoader\n",
53
+ "from sklearn.model_selection import train_test_split"
54
+ ]
55
+ },
56
+ {
57
+ "cell_type": "code",
58
+ "execution_count": null,
59
+ "metadata": {},
60
+ "outputs": [
61
+ {
62
+ "data": {
63
+ "text/html": [
64
+ "<div>\n",
65
+ "<style scoped>\n",
66
+ " .dataframe tbody tr th:only-of-type {\n",
67
+ " vertical-align: middle;\n",
68
+ " }\n",
69
+ "\n",
70
+ " .dataframe tbody tr th {\n",
71
+ " vertical-align: top;\n",
72
+ " }\n",
73
+ "\n",
74
+ " .dataframe thead th {\n",
75
+ " text-align: right;\n",
76
+ " }\n",
77
+ "</style>\n",
78
+ "<table border=\"1\" class=\"dataframe\">\n",
79
+ " <thead>\n",
80
+ " <tr style=\"text-align: right;\">\n",
81
+ " <th></th>\n",
82
+ " <th>prompt</th>\n",
83
+ " <th>solution0</th>\n",
84
+ " <th>solution1</th>\n",
85
+ " <th>label</th>\n",
86
+ " <th>language</th>\n",
87
+ " <th>eng_translated0</th>\n",
88
+ " <th>eng_translated1</th>\n",
89
+ " <th>approx_cultural_score</th>\n",
90
+ " <th>llm_used</th>\n",
91
+ " <th>example_id</th>\n",
92
+ " <th>supplement</th>\n",
93
+ " </tr>\n",
94
+ " </thead>\n",
95
+ " <tbody>\n",
96
+ " <tr>\n",
97
+ " <th>0</th>\n",
98
+ " <td>Para ver la iglesia del pantano de Sau complet...</td>\n",
99
+ " <td>tienes que esperar un período sin niebla.</td>\n",
100
+ " <td>tienes que esperar un período de sequía.</td>\n",
101
+ " <td>1</td>\n",
102
+ " <td>spa_latn_spai</td>\n",
103
+ " <td>To see the church at the Sau swamp in its enti...</td>\n",
104
+ " <td>To see the church at the Sau swamp in its enti...</td>\n",
105
+ " <td>1</td>\n",
106
+ " <td>0</td>\n",
107
+ " <td>group0042_ex000035_spa_latn_spai_0_v1</td>\n",
108
+ " <td>{\"topic\": \"place\", \"cultural_type\": \"cultural ...</td>\n",
109
+ " </tr>\n",
110
+ " <tr>\n",
111
+ " <th>1</th>\n",
112
+ " <td>En la coca de pimiento y tomate</td>\n",
113
+ " <td>se le añaden piñones y atún.</td>\n",
114
+ " <td>se le añaden piñones y butifarra.</td>\n",
115
+ " <td>0</td>\n",
116
+ " <td>spa_latn_spai</td>\n",
117
+ " <td>In the pepper and tomato coca pastry, pine nut...</td>\n",
118
+ " <td>In the pepper and tomato coca pastry, pine nut...</td>\n",
119
+ " <td>1</td>\n",
120
+ " <td>0</td>\n",
121
+ " <td>group0042_ex000070_spa_latn_spai_0_v1</td>\n",
122
+ " <td>{\"topic\": \"food\", \"cultural_type\": \"cultural C...</td>\n",
123
+ " </tr>\n",
124
+ " <tr>\n",
125
+ " <th>2</th>\n",
126
+ " <td>¿Cómo se sirven los calçots?</td>\n",
127
+ " <td>En un restaurante te pondrán una teja con unos...</td>\n",
128
+ " <td>En un restaurante te pondrán una teja con unos...</td>\n",
129
+ " <td>1</td>\n",
130
+ " <td>spa_latn_spai</td>\n",
131
+ " <td>How are calçots served? In a restaurant, you w...</td>\n",
132
+ " <td>How are calçots served? In a restaurant, you w...</td>\n",
133
+ " <td>1</td>\n",
134
+ " <td>0</td>\n",
135
+ " <td>group0042_ex000021_spa_latn_spai_0_v1</td>\n",
136
+ " <td>{\"topic\": \"food\", \"cultural_type\": \"cultural C...</td>\n",
137
+ " </tr>\n",
138
+ " <tr>\n",
139
+ " <th>3</th>\n",
140
+ " <td>Estás haciendo un viaje desde Madrid a tu pueb...</td>\n",
141
+ " <td>Utilizas el dibujo profundo, ya que evacua mej...</td>\n",
142
+ " <td>Utilizas el dibujo liso, ya que evacua mejor e...</td>\n",
143
+ " <td>0</td>\n",
144
+ " <td>spa_latn_spai</td>\n",
145
+ " <td>You are taking a trip from Madrid to your town...</td>\n",
146
+ " <td>You are taking a trip from Madrid to your town...</td>\n",
147
+ " <td>1</td>\n",
148
+ " <td>0</td>\n",
149
+ " <td>group0126_ex000024_spa_latn_spai_1_v1</td>\n",
150
+ " <td>{\"uncorrected_eng_translated0\": \"You are takin...</td>\n",
151
+ " </tr>\n",
152
+ " <tr>\n",
153
+ " <th>4</th>\n",
154
+ " <td>Has abierto un chorizo curado y te sobra la mi...</td>\n",
155
+ " <td>Envuélvelo en papel y guárdalo en la nevera en...</td>\n",
156
+ " <td>Envuélvelo en film y guárdalo en la nevera en ...</td>\n",
157
+ " <td>1</td>\n",
158
+ " <td>spa_latn_spai</td>\n",
159
+ " <td>You have opened a cured chorizo and have half ...</td>\n",
160
+ " <td>You have opened a cured chorizo and have half ...</td>\n",
161
+ " <td>1</td>\n",
162
+ " <td>0</td>\n",
163
+ " <td>group0126_ex000010_spa_latn_spai_1_v1</td>\n",
164
+ " <td>{\"uncorrected_eng_translated0\": \"You have open...</td>\n",
165
+ " </tr>\n",
166
+ " <tr>\n",
167
+ " <th>...</th>\n",
168
+ " <td>...</td>\n",
169
+ " <td>...</td>\n",
170
+ " <td>...</td>\n",
171
+ " <td>...</td>\n",
172
+ " <td>...</td>\n",
173
+ " <td>...</td>\n",
174
+ " <td>...</td>\n",
175
+ " <td>...</td>\n",
176
+ " <td>...</td>\n",
177
+ " <td>...</td>\n",
178
+ " <td>...</td>\n",
179
+ " </tr>\n",
180
+ " <tr>\n",
181
+ " <th>95</th>\n",
182
+ " <td>Voy a a cortar jamón serrano para un aperitivo...</td>\n",
183
+ " <td>Usaré cuchillo de sierra corto, con cortes cor...</td>\n",
184
+ " <td>Usaré un cuchillo jamonero bien afilado, con c...</td>\n",
185
+ " <td>1</td>\n",
186
+ " <td>spa_latn_spai</td>\n",
187
+ " <td>I am going to slice serrano ham for an appetiz...</td>\n",
188
+ " <td>I am going to slice serrano ham for an appetiz...</td>\n",
189
+ " <td>1</td>\n",
190
+ " <td>0</td>\n",
191
+ " <td>group0126_ex000039_spa_latn_spai_1_v1</td>\n",
192
+ " <td>{\"uncorrected_eng_translated0\": \"I am going to...</td>\n",
193
+ " </tr>\n",
194
+ " <tr>\n",
195
+ " <th>96</th>\n",
196
+ " <td>¿Qué les pasa a las figuras de cartón y madera...</td>\n",
197
+ " <td>Se endurecen con el fuego.</td>\n",
198
+ " <td>Se queman con el fuego.</td>\n",
199
+ " <td>1</td>\n",
200
+ " <td>spa_latn_spai</td>\n",
201
+ " <td>What happens to the cardboard and wood figures...</td>\n",
202
+ " <td>What happens to the cardboard and wood figures...</td>\n",
203
+ " <td>1</td>\n",
204
+ " <td>0</td>\n",
205
+ " <td>group0134_ex000019_spa_latn_spai_2_v1</td>\n",
206
+ " <td>{\"uncorrected_eng_translated0\": \"What happens ...</td>\n",
207
+ " </tr>\n",
208
+ " <tr>\n",
209
+ " <th>97</th>\n",
210
+ " <td>Para hacer una figura decorativa, mezclamos el...</td>\n",
211
+ " <td>Moldeamos la figura y esperamos unas horas par...</td>\n",
212
+ " <td>Moldeamos la figura y esperamos unas horas par...</td>\n",
213
+ " <td>0</td>\n",
214
+ " <td>spa_latn_spai</td>\n",
215
+ " <td>To make a decorative figure, we mix gypsum pla...</td>\n",
216
+ " <td>To make a decorative figure, we mix gypsum pla...</td>\n",
217
+ " <td>1</td>\n",
218
+ " <td>0</td>\n",
219
+ " <td>group0134_ex000063_spa_latn_spai_2_v1</td>\n",
220
+ " <td>{\"uncorrected_eng_translated0\": \"To make a dec...</td>\n",
221
+ " </tr>\n",
222
+ " <tr>\n",
223
+ " <th>98</th>\n",
224
+ " <td>Cómo hacer ratafía en casa.</td>\n",
225
+ " <td>La ratafía es un licor de hierbas con base de ...</td>\n",
226
+ " <td>La ratafía es un licor de hierbas con base de ...</td>\n",
227
+ " <td>0</td>\n",
228
+ " <td>spa_latn_spai</td>\n",
229
+ " <td>How to make ratafia at home. Ratafia is a herb...</td>\n",
230
+ " <td>How to make ratafia at home. Ratafia is a herb...</td>\n",
231
+ " <td>1</td>\n",
232
+ " <td>0</td>\n",
233
+ " <td>group0042_ex000037_spa_latn_spai_0_v1</td>\n",
234
+ " <td>{\"topic\": \"food\", \"cultural_type\": \"cultural C...</td>\n",
235
+ " </tr>\n",
236
+ " <tr>\n",
237
+ " <th>99</th>\n",
238
+ " <td>Haces gazpacho andaluz en verano para la comid...</td>\n",
239
+ " <td>Deja el gazpacho en nevera antes de servir.</td>\n",
240
+ " <td>Deja el gazpacho fuera de nevera antes de servir.</td>\n",
241
+ " <td>0</td>\n",
242
+ " <td>spa_latn_spai</td>\n",
243
+ " <td>You are making Andalusian gazpacho in the summ...</td>\n",
244
+ " <td>You are making Andalusian gazpacho in the summ...</td>\n",
245
+ " <td>1</td>\n",
246
+ " <td>0</td>\n",
247
+ " <td>group0126_ex000037_spa_latn_spai_1_v1</td>\n",
248
+ " <td>{\"uncorrected_eng_translated0\": \"You make gazp...</td>\n",
249
+ " </tr>\n",
250
+ " </tbody>\n",
251
+ "</table>\n",
252
+ "<p>100 rows × 11 columns</p>\n",
253
+ "</div>"
254
+ ],
255
+ "text/plain": [
256
+ " prompt \\\n",
257
+ "0 Para ver la iglesia del pantano de Sau complet... \n",
258
+ "1 En la coca de pimiento y tomate \n",
259
+ "2 ¿Cómo se sirven los calçots? \n",
260
+ "3 Estás haciendo un viaje desde Madrid a tu pueb... \n",
261
+ "4 Has abierto un chorizo curado y te sobra la mi... \n",
262
+ ".. ... \n",
263
+ "95 Voy a a cortar jamón serrano para un aperitivo... \n",
264
+ "96 ¿Qué les pasa a las figuras de cartón y madera... \n",
265
+ "97 Para hacer una figura decorativa, mezclamos el... \n",
266
+ "98 Cómo hacer ratafía en casa. \n",
267
+ "99 Haces gazpacho andaluz en verano para la comid... \n",
268
+ "\n",
269
+ " solution0 \\\n",
270
+ "0 tienes que esperar un período sin niebla. \n",
271
+ "1 se le añaden piñones y atún. \n",
272
+ "2 En un restaurante te pondrán una teja con unos... \n",
273
+ "3 Utilizas el dibujo profundo, ya que evacua mej... \n",
274
+ "4 Envuélvelo en papel y guárdalo en la nevera en... \n",
275
+ ".. ... \n",
276
+ "95 Usaré cuchillo de sierra corto, con cortes cor... \n",
277
+ "96 Se endurecen con el fuego. \n",
278
+ "97 Moldeamos la figura y esperamos unas horas par... \n",
279
+ "98 La ratafía es un licor de hierbas con base de ... \n",
280
+ "99 Deja el gazpacho en nevera antes de servir. \n",
281
+ "\n",
282
+ " solution1 label language \\\n",
283
+ "0 tienes que esperar un período de sequía. 1 spa_latn_spai \n",
284
+ "1 se le añaden piñones y butifarra. 0 spa_latn_spai \n",
285
+ "2 En un restaurante te pondrán una teja con unos... 1 spa_latn_spai \n",
286
+ "3 Utilizas el dibujo liso, ya que evacua mejor e... 0 spa_latn_spai \n",
287
+ "4 Envuélvelo en film y guárdalo en la nevera en ... 1 spa_latn_spai \n",
288
+ ".. ... ... ... \n",
289
+ "95 Usaré un cuchillo jamonero bien afilado, con c... 1 spa_latn_spai \n",
290
+ "96 Se queman con el fuego. 1 spa_latn_spai \n",
291
+ "97 Moldeamos la figura y esperamos unas horas par... 0 spa_latn_spai \n",
292
+ "98 La ratafía es un licor de hierbas con base de ... 0 spa_latn_spai \n",
293
+ "99 Deja el gazpacho fuera de nevera antes de servir. 0 spa_latn_spai \n",
294
+ "\n",
295
+ " eng_translated0 \\\n",
296
+ "0 To see the church at the Sau swamp in its enti... \n",
297
+ "1 In the pepper and tomato coca pastry, pine nut... \n",
298
+ "2 How are calçots served? In a restaurant, you w... \n",
299
+ "3 You are taking a trip from Madrid to your town... \n",
300
+ "4 You have opened a cured chorizo and have half ... \n",
301
+ ".. ... \n",
302
+ "95 I am going to slice serrano ham for an appetiz... \n",
303
+ "96 What happens to the cardboard and wood figures... \n",
304
+ "97 To make a decorative figure, we mix gypsum pla... \n",
305
+ "98 How to make ratafia at home. Ratafia is a herb... \n",
306
+ "99 You are making Andalusian gazpacho in the summ... \n",
307
+ "\n",
308
+ " eng_translated1 approx_cultural_score \\\n",
309
+ "0 To see the church at the Sau swamp in its enti... 1 \n",
310
+ "1 In the pepper and tomato coca pastry, pine nut... 1 \n",
311
+ "2 How are calçots served? In a restaurant, you w... 1 \n",
312
+ "3 You are taking a trip from Madrid to your town... 1 \n",
313
+ "4 You have opened a cured chorizo and have half ... 1 \n",
314
+ ".. ... ... \n",
315
+ "95 I am going to slice serrano ham for an appetiz... 1 \n",
316
+ "96 What happens to the cardboard and wood figures... 1 \n",
317
+ "97 To make a decorative figure, we mix gypsum pla... 1 \n",
318
+ "98 How to make ratafia at home. Ratafia is a herb... 1 \n",
319
+ "99 You are making Andalusian gazpacho in the summ... 1 \n",
320
+ "\n",
321
+ " llm_used example_id \\\n",
322
+ "0 0 group0042_ex000035_spa_latn_spai_0_v1 \n",
323
+ "1 0 group0042_ex000070_spa_latn_spai_0_v1 \n",
324
+ "2 0 group0042_ex000021_spa_latn_spai_0_v1 \n",
325
+ "3 0 group0126_ex000024_spa_latn_spai_1_v1 \n",
326
+ "4 0 group0126_ex000010_spa_latn_spai_1_v1 \n",
327
+ ".. ... ... \n",
328
+ "95 0 group0126_ex000039_spa_latn_spai_1_v1 \n",
329
+ "96 0 group0134_ex000019_spa_latn_spai_2_v1 \n",
330
+ "97 0 group0134_ex000063_spa_latn_spai_2_v1 \n",
331
+ "98 0 group0042_ex000037_spa_latn_spai_0_v1 \n",
332
+ "99 0 group0126_ex000037_spa_latn_spai_1_v1 \n",
333
+ "\n",
334
+ " supplement \n",
335
+ "0 {\"topic\": \"place\", \"cultural_type\": \"cultural ... \n",
336
+ "1 {\"topic\": \"food\", \"cultural_type\": \"cultural C... \n",
337
+ "2 {\"topic\": \"food\", \"cultural_type\": \"cultural C... \n",
338
+ "3 {\"uncorrected_eng_translated0\": \"You are takin... \n",
339
+ "4 {\"uncorrected_eng_translated0\": \"You have open... \n",
340
+ ".. ... \n",
341
+ "95 {\"uncorrected_eng_translated0\": \"I am going to... \n",
342
+ "96 {\"uncorrected_eng_translated0\": \"What happens ... \n",
343
+ "97 {\"uncorrected_eng_translated0\": \"To make a dec... \n",
344
+ "98 {\"topic\": \"food\", \"cultural_type\": \"cultural C... \n",
345
+ "99 {\"uncorrected_eng_translated0\": \"You make gazp... \n",
346
+ "\n",
347
+ "[100 rows x 11 columns]"
348
+ ]
349
+ },
350
+ "execution_count": 8,
351
+ "metadata": {},
352
+ "output_type": "execute_result"
353
+ }
354
+ ],
355
+ "source": [
356
+ "df"
357
+ ]
358
+ },
359
+ {
360
+ "cell_type": "code",
361
+ "execution_count": 2,
362
+ "metadata": {},
363
+ "outputs": [],
364
+ "source": [
365
+ "splits = {'train': 'data/train-00000-of-00001.parquet', 'validation': 'data/validation-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}\n",
366
+ "df = pd.read_parquet(\"hf://datasets/somosnlp/NoticIA-it/\" + splits[\"train\"])\n",
367
+ "df = df[[\"texto\", \"respuesta\"]].dropna().reset_index(drop=True)"
368
+ ]
369
+ },
370
+ {
371
+ "cell_type": "code",
372
+ "execution_count": null,
373
+ "metadata": {},
374
+ "outputs": [
375
+ {
376
+ "data": {
377
+ "text/html": [
378
+ "<div>\n",
379
+ "<style scoped>\n",
380
+ " .dataframe tbody tr th:only-of-type {\n",
381
+ " vertical-align: middle;\n",
382
+ " }\n",
383
+ "\n",
384
+ " .dataframe tbody tr th {\n",
385
+ " vertical-align: top;\n",
386
+ " }\n",
387
+ "\n",
388
+ " .dataframe thead th {\n",
389
+ " text-align: right;\n",
390
+ " }\n",
391
+ "</style>\n",
392
+ "<table border=\"1\" class=\"dataframe\">\n",
393
+ " <thead>\n",
394
+ " <tr style=\"text-align: right;\">\n",
395
+ " <th></th>\n",
396
+ " <th>id</th>\n",
397
+ " <th>titular</th>\n",
398
+ " <th>respuesta</th>\n",
399
+ " <th>pregunta</th>\n",
400
+ " <th>texto</th>\n",
401
+ " <th>idioma</th>\n",
402
+ " <th>periodo</th>\n",
403
+ " <th>tarea</th>\n",
404
+ " <th>registro</th>\n",
405
+ " <th>dominio</th>\n",
406
+ " <th>país_origen</th>\n",
407
+ " </tr>\n",
408
+ " </thead>\n",
409
+ " <tbody>\n",
410
+ " <tr>\n",
411
+ " <th>0</th>\n",
412
+ " <td>0</td>\n",
413
+ " <td>JORGE REY: EL TIEMPO | La impactante predicció...</td>\n",
414
+ " <td>El inicio de un periodo frío intenso.</td>\n",
415
+ " <td>Ahora eres una Inteligencia Artificial experta...</td>\n",
416
+ " <td>27·11·23 | 08:34 | Actualizado a las 14:47\\nJO...</td>\n",
417
+ " <td>es_es</td>\n",
418
+ " <td>actual</td>\n",
419
+ " <td>resumen</td>\n",
420
+ " <td>medio</td>\n",
421
+ " <td>prensa_ciencia_y_tecnologia</td>\n",
422
+ " <td>españa</td>\n",
423
+ " </tr>\n",
424
+ " <tr>\n",
425
+ " <th>1</th>\n",
426
+ " <td>1</td>\n",
427
+ " <td>El cambio en las matrículas que se espera para...</td>\n",
428
+ " <td>Se dará el salto a la letra M.</td>\n",
429
+ " <td>Ahora eres una Inteligencia Artificial experta...</td>\n",
430
+ " <td>Si eres de los que sigues el avance de las mat...</td>\n",
431
+ " <td>es_es</td>\n",
432
+ " <td>actual</td>\n",
433
+ " <td>resumen</td>\n",
434
+ " <td>medio</td>\n",
435
+ " <td>prensa_ciencia_y_tecnologia</td>\n",
436
+ " <td>españa</td>\n",
437
+ " </tr>\n",
438
+ " <tr>\n",
439
+ " <th>2</th>\n",
440
+ " <td>2</td>\n",
441
+ " <td>Si no avisas a la DGT de este cambio en tu coc...</td>\n",
442
+ " <td>500 euros por pintar un coche de otro color y ...</td>\n",
443
+ " <td>Ahora eres una Inteligencia Artificial experta...</td>\n",
444
+ " <td>Con Pilar Cisneros y Fernando de Haro\\nCon Pac...</td>\n",
445
+ " <td>es_es</td>\n",
446
+ " <td>actual</td>\n",
447
+ " <td>resumen</td>\n",
448
+ " <td>medio</td>\n",
449
+ " <td>prensa_otros</td>\n",
450
+ " <td>españa</td>\n",
451
+ " </tr>\n",
452
+ " <tr>\n",
453
+ " <th>3</th>\n",
454
+ " <td>3</td>\n",
455
+ " <td>Estos serán los lenguajes de programación con ...</td>\n",
456
+ " <td>Python y JavaScript.</td>\n",
457
+ " <td>Ahora eres una Inteligencia Artificial experta...</td>\n",
458
+ " <td>Si con el año nuevo te has propuesto aumentar ...</td>\n",
459
+ " <td>es_es</td>\n",
460
+ " <td>actual</td>\n",
461
+ " <td>resumen</td>\n",
462
+ " <td>medio</td>\n",
463
+ " <td>prensa_ciencia_y_tecnologia</td>\n",
464
+ " <td>españa</td>\n",
465
+ " </tr>\n",
466
+ " <tr>\n",
467
+ " <th>4</th>\n",
468
+ " <td>4</td>\n",
469
+ " <td>Cambio de estrategia en Microsoft: Windows 12 ...</td>\n",
470
+ " <td>Solo un 28.6% de los usuarios actuales de Wind...</td>\n",
471
+ " <td>Ahora eres una Inteligencia Artificial experta...</td>\n",
472
+ " <td>Desde hace ya varios meses, las especulaciones...</td>\n",
473
+ " <td>es_es</td>\n",
474
+ " <td>actual</td>\n",
475
+ " <td>resumen</td>\n",
476
+ " <td>medio</td>\n",
477
+ " <td>prensa_ciencia_y_tecnologia</td>\n",
478
+ " <td>españa</td>\n",
479
+ " </tr>\n",
480
+ " <tr>\n",
481
+ " <th>...</th>\n",
482
+ " <td>...</td>\n",
483
+ " <td>...</td>\n",
484
+ " <td>...</td>\n",
485
+ " <td>...</td>\n",
486
+ " <td>...</td>\n",
487
+ " <td>...</td>\n",
488
+ " <td>...</td>\n",
489
+ " <td>...</td>\n",
490
+ " <td>...</td>\n",
491
+ " <td>...</td>\n",
492
+ " <td>...</td>\n",
493
+ " </tr>\n",
494
+ " <tr>\n",
495
+ " <th>695</th>\n",
496
+ " <td>695</td>\n",
497
+ " <td>Primicia: Mediaset ya tiene pareja de presenta...</td>\n",
498
+ " <td>Diego Losada y Mónica Sanz.</td>\n",
499
+ " <td>Ahora eres una Inteligencia Artificial experta...</td>\n",
500
+ " <td>Mediaset ya tiene encajadas las piezas del puz...</td>\n",
501
+ " <td>es_es</td>\n",
502
+ " <td>actual</td>\n",
503
+ " <td>resumen</td>\n",
504
+ " <td>medio</td>\n",
505
+ " <td>prensa_celebridades</td>\n",
506
+ " <td>españa</td>\n",
507
+ " </tr>\n",
508
+ " <tr>\n",
509
+ " <th>696</th>\n",
510
+ " <td>696</td>\n",
511
+ " <td>Margot Robbie anuncia que se retira de la actu...</td>\n",
512
+ " <td>No se retira, pero no quiere hacer otra pelícu...</td>\n",
513
+ " <td>Ahora eres una Inteligencia Artificial experta...</td>\n",
514
+ " <td>Todo lo que buscas en un solo click\\nLa actriz...</td>\n",
515
+ " <td>es_bo</td>\n",
516
+ " <td>actual</td>\n",
517
+ " <td>resumen</td>\n",
518
+ " <td>coloquial</td>\n",
519
+ " <td>prensa_celebridades</td>\n",
520
+ " <td>bolivia</td>\n",
521
+ " </tr>\n",
522
+ " <tr>\n",
523
+ " <th>697</th>\n",
524
+ " <td>697</td>\n",
525
+ " <td>¿Por qué el videojuego de Indiana Jones es en ...</td>\n",
526
+ " <td>Para que la acción parezca propia y sea mucho ...</td>\n",
527
+ " <td>Ahora eres una Inteligencia Artificial experta...</td>\n",
528
+ " <td>Xbox clarificó en el Developer_Direct de la se...</td>\n",
529
+ " <td>es_es</td>\n",
530
+ " <td>actual</td>\n",
531
+ " <td>resumen</td>\n",
532
+ " <td>medio</td>\n",
533
+ " <td>prensa_ocio_y_cultura</td>\n",
534
+ " <td>españa</td>\n",
535
+ " </tr>\n",
536
+ " <tr>\n",
537
+ " <th>698</th>\n",
538
+ " <td>698</td>\n",
539
+ " <td>La insólita situación vivida frente a un semáf...</td>\n",
540
+ " <td>Un conductor de 44 años se quedó dormido frent...</td>\n",
541
+ " <td>Ahora eres una Inteligencia Artificial experta...</td>\n",
542
+ " <td>Se pueden imaginar que en el teléfono de la Po...</td>\n",
543
+ " <td>es_es</td>\n",
544
+ " <td>actual</td>\n",
545
+ " <td>resumen</td>\n",
546
+ " <td>medio</td>\n",
547
+ " <td>prensa_otros</td>\n",
548
+ " <td>españa</td>\n",
549
+ " </tr>\n",
550
+ " <tr>\n",
551
+ " <th>699</th>\n",
552
+ " <td>699</td>\n",
553
+ " <td>Uno de los mejores Assassin’s Creed podría ten...</td>\n",
554
+ " <td>Black Flag.</td>\n",
555
+ " <td>Ahora eres una Inteligencia Artificial experta...</td>\n",
556
+ " <td>Parece que la nueva versión del título de Ubis...</td>\n",
557
+ " <td>es_mx</td>\n",
558
+ " <td>actual</td>\n",
559
+ " <td>resumen</td>\n",
560
+ " <td>medio</td>\n",
561
+ " <td>prensa_ocio_y_cultura</td>\n",
562
+ " <td>mexico</td>\n",
563
+ " </tr>\n",
564
+ " </tbody>\n",
565
+ "</table>\n",
566
+ "<p>700 rows × 11 columns</p>\n",
567
+ "</div>"
568
+ ],
569
+ "text/plain": [
570
+ " id titular \\\n",
571
+ "0 0 JORGE REY: EL TIEMPO | La impactante predicció... \n",
572
+ "1 1 El cambio en las matrículas que se espera para... \n",
573
+ "2 2 Si no avisas a la DGT de este cambio en tu coc... \n",
574
+ "3 3 Estos serán los lenguajes de programación con ... \n",
575
+ "4 4 Cambio de estrategia en Microsoft: Windows 12 ... \n",
576
+ ".. ... ... \n",
577
+ "695 695 Primicia: Mediaset ya tiene pareja de presenta... \n",
578
+ "696 696 Margot Robbie anuncia que se retira de la actu... \n",
579
+ "697 697 ¿Por qué el videojuego de Indiana Jones es en ... \n",
580
+ "698 698 La insólita situación vivida frente a un semáf... \n",
581
+ "699 699 Uno de los mejores Assassin’s Creed podría ten... \n",
582
+ "\n",
583
+ " respuesta \\\n",
584
+ "0 El inicio de un periodo frío intenso. \n",
585
+ "1 Se dará el salto a la letra M. \n",
586
+ "2 500 euros por pintar un coche de otro color y ... \n",
587
+ "3 Python y JavaScript. \n",
588
+ "4 Solo un 28.6% de los usuarios actuales de Wind... \n",
589
+ ".. ... \n",
590
+ "695 Diego Losada y Mónica Sanz. \n",
591
+ "696 No se retira, pero no quiere hacer otra pelícu... \n",
592
+ "697 Para que la acción parezca propia y sea mucho ... \n",
593
+ "698 Un conductor de 44 años se quedó dormido frent... \n",
594
+ "699 Black Flag. \n",
595
+ "\n",
596
+ " pregunta \\\n",
597
+ "0 Ahora eres una Inteligencia Artificial experta... \n",
598
+ "1 Ahora eres una Inteligencia Artificial experta... \n",
599
+ "2 Ahora eres una Inteligencia Artificial experta... \n",
600
+ "3 Ahora eres una Inteligencia Artificial experta... \n",
601
+ "4 Ahora eres una Inteligencia Artificial experta... \n",
602
+ ".. ... \n",
603
+ "695 Ahora eres una Inteligencia Artificial experta... \n",
604
+ "696 Ahora eres una Inteligencia Artificial experta... \n",
605
+ "697 Ahora eres una Inteligencia Artificial experta... \n",
606
+ "698 Ahora eres una Inteligencia Artificial experta... \n",
607
+ "699 Ahora eres una Inteligencia Artificial experta... \n",
608
+ "\n",
609
+ " texto idioma periodo \\\n",
610
+ "0 27·11·23 | 08:34 | Actualizado a las 14:47\\nJO... es_es actual \n",
611
+ "1 Si eres de los que sigues el avance de las mat... es_es actual \n",
612
+ "2 Con Pilar Cisneros y Fernando de Haro\\nCon Pac... es_es actual \n",
613
+ "3 Si con el año nuevo te has propuesto aumentar ... es_es actual \n",
614
+ "4 Desde hace ya varios meses, las especulaciones... es_es actual \n",
615
+ ".. ... ... ... \n",
616
+ "695 Mediaset ya tiene encajadas las piezas del puz... es_es actual \n",
617
+ "696 Todo lo que buscas en un solo click\\nLa actriz... es_bo actual \n",
618
+ "697 Xbox clarificó en el Developer_Direct de la se... es_es actual \n",
619
+ "698 Se pueden imaginar que en el teléfono de la Po... es_es actual \n",
620
+ "699 Parece que la nueva versión del título de Ubis... es_mx actual \n",
621
+ "\n",
622
+ " tarea registro dominio país_origen \n",
623
+ "0 resumen medio prensa_ciencia_y_tecnologia españa \n",
624
+ "1 resumen medio prensa_ciencia_y_tecnologia españa \n",
625
+ "2 resumen medio prensa_otros españa \n",
626
+ "3 resumen medio prensa_ciencia_y_tecnologia españa \n",
627
+ "4 resumen medio prensa_ciencia_y_tecnologia españa \n",
628
+ ".. ... ... ... ... \n",
629
+ "695 resumen medio prensa_celebridades españa \n",
630
+ "696 resumen coloquial prensa_celebridades bolivia \n",
631
+ "697 resumen medio prensa_ocio_y_cultura españa \n",
632
+ "698 resumen medio prensa_otros españa \n",
633
+ "699 resumen medio prensa_ocio_y_cultura mexico \n",
634
+ "\n",
635
+ "[700 rows x 11 columns]"
636
+ ]
637
+ },
638
+ "execution_count": 3,
639
+ "metadata": {},
640
+ "output_type": "execute_result"
641
+ }
642
+ ],
643
+ "source": [
644
+ "df.head()"
645
+ ]
646
+ },
647
+ {
648
+ "cell_type": "markdown",
649
+ "metadata": {},
650
+ "source": [
651
+ "### Celda de entrenamiento:\n",
652
+ "\n",
653
+ "Esta celda realiza el proceso completo de fine-tuning y guardado del modelo. En concreto:\n",
654
+ "\n",
655
+ "- Carga el `tokenizer` y el `model` base desde Hugging Face.\n",
656
+ "- Crea un subset de datos (`sample_size`) y lo divide en `train`, `val` y `test`.\n",
657
+ "- Define `preprocess_function` para tokenizar entradas (`texto`) y objetivos (`respuesta`).\n",
658
+ "- Construye `DataLoader`s y un `DataCollatorForSeq2Seq` para agrupar lotes apropiadamente.\n",
659
+ "- Ejecuta un bucle corto de entrenamiento (controlado por `max_train_steps`) con `AdamW`.\n",
660
+ "- Evalúa el modelo en el conjunto de test para obtener `test_loss` y `test_perplexity`.\n",
661
+ "- Guarda el modelo y tokenizer en `mt5-resumenes-es-final` y realiza una inferencia de ejemplo.\n",
662
+ "\n",
663
+ "Ejecuta esta celda después de comprobar `df.head()` y tener instaladas las dependencias necesarias. Tarda más tiempo si entrenas en CPU; en GPU será más rápido."
664
+ ]
665
+ },
666
+ {
667
+ "cell_type": "code",
668
+ "execution_count": 3,
669
+ "metadata": {},
670
+ "outputs": [
671
+ {
672
+ "name": "stderr",
673
+ "output_type": "stream",
674
+ "text": [
675
+ "Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.\n"
676
+ ]
677
+ },
678
+ {
679
+ "data": {
680
+ "application/vnd.jupyter.widget-view+json": {
681
+ "model_id": "3e1f88c34c734cb7bf409cfad217608b",
682
+ "version_major": 2,
683
+ "version_minor": 0
684
+ },
685
+ "text/plain": [
686
+ "Loading weights: 0%| | 0/192 [00:00<?, ?it/s]"
687
+ ]
688
+ },
689
+ "metadata": {},
690
+ "output_type": "display_data"
691
+ },
692
+ {
693
+ "name": "stderr",
694
+ "output_type": "stream",
695
+ "text": [
696
+ "[transformers] The tied weights mapping and config for this model specifies to tie shared.weight to lm_head.weight, but both are present in the checkpoints with different values, so we will NOT tie them. You should update the config with `tie_word_embeddings=False` to silence this warning.\n"
697
+ ]
698
+ },
699
+ {
700
+ "data": {
701
+ "application/vnd.jupyter.widget-view+json": {
702
+ "model_id": "d5a59ad4dd8b4701aef6078010db74f4",
703
+ "version_major": 2,
704
+ "version_minor": 0
705
+ },
706
+ "text/plain": [
707
+ "Map: 0%| | 0/204 [00:00<?, ? examples/s]"
708
+ ]
709
+ },
710
+ "metadata": {},
711
+ "output_type": "display_data"
712
+ },
713
+ {
714
+ "data": {
715
+ "application/vnd.jupyter.widget-view+json": {
716
+ "model_id": "8f12dce6cbf746fa82da1b7eafc923ef",
717
+ "version_major": 2,
718
+ "version_minor": 0
719
+ },
720
+ "text/plain": [
721
+ "Map: 0%| | 0/26 [00:00<?, ? examples/s]"
722
+ ]
723
+ },
724
+ "metadata": {},
725
+ "output_type": "display_data"
726
+ },
727
+ {
728
+ "data": {
729
+ "application/vnd.jupyter.widget-view+json": {
730
+ "model_id": "1514a44d18c84be38773f0f45391acd1",
731
+ "version_major": 2,
732
+ "version_minor": 0
733
+ },
734
+ "text/plain": [
735
+ "Map: 0%| | 0/26 [00:00<?, ? examples/s]"
736
+ ]
737
+ },
738
+ "metadata": {},
739
+ "output_type": "display_data"
740
+ },
741
+ {
742
+ "name": "stdout",
743
+ "output_type": "stream",
744
+ "text": [
745
+ "Train loss: 5.0288\n",
746
+ "Test loss: 4.0315\n",
747
+ "Test perplexity: 56.3473\n"
748
+ ]
749
+ },
750
+ {
751
+ "data": {
752
+ "application/vnd.jupyter.widget-view+json": {
753
+ "model_id": "a6f5da6256154aa592ff09a1295a330d",
754
+ "version_major": 2,
755
+ "version_minor": 0
756
+ },
757
+ "text/plain": [
758
+ "Writing model shards: 0%| | 0/1 [00:00<?, ?it/s]"
759
+ ]
760
+ },
761
+ "metadata": {},
762
+ "output_type": "display_data"
763
+ },
764
+ {
765
+ "name": "stdout",
766
+ "output_type": "stream",
767
+ "text": [
768
+ "Texto de entrada: Este jueves 16 de noviembre Sevilla se convierte en capital mundial de la música con la celebración en el Centro de Conferencias y Exposiciones (FIBES) de los Grammy Latinos, una entrega que se emitirá internacionalmente por primera vez en la historia, como ha informado RTVE, quien los coproducirá y emitirá junto con Univisión.\n",
769
+ "La ceremonia comienza a las 22:30 y se podrá ver en directo en La 1 y RTVE Play. Estará presentada por Paz Vega, Sebastián Yatra, Danna Paola y Roselyn Sánchez. Carlos del Amor y Elena S. Sánchez personalizarán la señal para España.\n",
770
+ "Antes, a las 21:30 y tras el Telediario llegará Noche de estrellas, un especial con la alfombra roja presentado por Carlos Baute, Clarissa Molina, Chiqui Delgado, Raul de Molina, y Borja Voces. Por supuesto, en El HuffPost te contaremos todo lo que dé de sí la noche.\n",
771
+ "En la ceremonia se ha confirmado la participación de artistas como Rosalía, Shakira, Pablo Alborán, Edgar Barrera, Camilo, Manuel Carrasco, Iza, Juanes y Ozuna, María Becerra, Bizarrap, Feid, Kany García, Carin León, Christian Nodal, Rauw Alejandro y Alejandro Sanz.\n",
772
+ "No faltará a la cita Laura Pausini, Persona del Año 2023 de la Academia Latina de la Grabación. Además\n",
773
+ "Resumen generado: españa se convierte en capital mundial de la música\n"
774
+ ]
775
+ }
776
+ ],
777
+ "source": [
778
+ "from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq\n",
779
+ "\n",
780
+ "tokenizer = AutoTokenizer.from_pretrained(\"josmunpen/mt5-small-spanish-summarization\")\n",
781
+ "model = AutoModelForSeq2SeqLM.from_pretrained(\"josmunpen/mt5-small-spanish-summarization\")\n",
782
+ "\n",
783
+ "sample_size = min(256, len(df))\n",
784
+ "df_sample = df.sample(n=sample_size, random_state=42).reset_index(drop=True)\n",
785
+ "train_df, temp_df = train_test_split(df_sample, test_size=0.2, random_state=42)\n",
786
+ "val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)\n",
787
+ "\n",
788
+ "train_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))\n",
789
+ "val_dataset = Dataset.from_pandas(val_df.reset_index(drop=True))\n",
790
+ "test_dataset = Dataset.from_pandas(test_df.reset_index(drop=True))\n",
791
+ "\n",
792
+ "max_input_length = 256\n",
793
+ "max_target_length = 64\n",
794
+ "\n",
795
+ "def preprocess_function(batch):\n",
796
+ " inputs = tokenizer(batch[\"texto\"], max_length=max_input_length, truncation=True)\n",
797
+ " targets = tokenizer(text_target=batch[\"respuesta\"], max_length=max_target_length, truncation=True)\n",
798
+ " inputs[\"labels\"] = targets[\"input_ids\"]\n",
799
+ " return inputs\n",
800
+ "\n",
801
+ "train_tokenized = train_dataset.map(preprocess_function, batched=True, remove_columns=train_dataset.column_names)\n",
802
+ "val_tokenized = val_dataset.map(preprocess_function, batched=True, remove_columns=val_dataset.column_names)\n",
803
+ "test_tokenized = test_dataset.map(preprocess_function, batched=True, remove_columns=test_dataset.column_names)\n",
804
+ "\n",
805
+ "data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)\n",
806
+ "\n",
807
+ "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
808
+ "model.to(device)\n",
809
+ "optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)\n",
810
+ "\n",
811
+ "train_loader = DataLoader(train_tokenized, batch_size=2, shuffle=True, collate_fn=data_collator)\n",
812
+ "eval_loader = DataLoader(test_tokenized, batch_size=2, shuffle=False, collate_fn=data_collator)\n",
813
+ "\n",
814
+ "model.train()\n",
815
+ "train_losses = []\n",
816
+ "max_train_steps = 20\n",
817
+ "for step, batch in enumerate(train_loader, start=1):\n",
818
+ " batch = {key: value.to(device) for key, value in batch.items()}\n",
819
+ " outputs = model(**batch)\n",
820
+ " loss = outputs.loss\n",
821
+ " loss.backward()\n",
822
+ " optimizer.step()\n",
823
+ " optimizer.zero_grad()\n",
824
+ " train_losses.append(loss.item())\n",
825
+ " if step >= max_train_steps:\n",
826
+ " break\n",
827
+ "\n",
828
+ "train_loss = float(np.mean(train_losses)) if train_losses else float(\"nan\")\n",
829
+ "\n",
830
+ "model.eval()\n",
831
+ "eval_losses = []\n",
832
+ "with torch.no_grad():\n",
833
+ " for batch in eval_loader:\n",
834
+ " batch = {key: value.to(device) for key, value in batch.items()}\n",
835
+ " outputs = model(**batch)\n",
836
+ " eval_losses.append(outputs.loss.item())\n",
837
+ "\n",
838
+ "test_loss = float(np.mean(eval_losses)) if eval_losses else float(\"nan\")\n",
839
+ "test_perplexity = math.exp(test_loss) if np.isfinite(test_loss) and test_loss < 20 else float(\"inf\")\n",
840
+ "\n",
841
+ "print(\"Train loss:\", round(train_loss, 4) if np.isfinite(train_loss) else train_loss)\n",
842
+ "print(\"Test loss:\", round(test_loss, 4))\n",
843
+ "print(\"Test perplexity:\", round(test_perplexity, 4) if np.isfinite(test_perplexity) else test_perplexity)\n",
844
+ "\n",
845
+ "model.save_pretrained(\"mt5-resumenes-es-final\")\n",
846
+ "tokenizer.save_pretrained(\"mt5-resumenes-es-final\")\n",
847
+ "\n",
848
+ "sample_text = test_df.iloc[0][\"texto\"]\n",
849
+ "inputs = tokenizer(sample_text, return_tensors=\"pt\", truncation=True, max_length=max_input_length).to(device)\n",
850
+ "generated_ids = model.generate(**inputs, max_length=max_target_length, num_beams=4)\n",
851
+ "print(\"Texto de entrada:\", sample_text[:1200])\n",
852
+ "print(\"Resumen generado:\", tokenizer.decode(generated_ids[0], skip_special_tokens=True))"
853
+ ]
854
+ },
855
+ {
856
+ "cell_type": "markdown",
857
+ "metadata": {},
858
+ "source": [
859
+ "## Métricas de evaluación en test\n",
860
+ "\n",
861
+ "En esta sección se calculan métricas de resumen sobre el conjunto de test para medir la calidad del modelo ajustado."
862
+ ]
863
+ },
864
+ {
865
+ "cell_type": "code",
866
+ "execution_count": 7,
867
+ "metadata": {},
868
+ "outputs": [
869
+ {
870
+ "data": {
871
+ "text/html": [
872
+ "<div>\n",
873
+ "<style scoped>\n",
874
+ " .dataframe tbody tr th:only-of-type {\n",
875
+ " vertical-align: middle;\n",
876
+ " }\n",
877
+ "\n",
878
+ " .dataframe tbody tr th {\n",
879
+ " vertical-align: top;\n",
880
+ " }\n",
881
+ "\n",
882
+ " .dataframe thead th {\n",
883
+ " text-align: right;\n",
884
+ " }\n",
885
+ "</style>\n",
886
+ "<table border=\"1\" class=\"dataframe\">\n",
887
+ " <thead>\n",
888
+ " <tr style=\"text-align: right;\">\n",
889
+ " <th></th>\n",
890
+ " <th>metric</th>\n",
891
+ " <th>valor</th>\n",
892
+ " </tr>\n",
893
+ " </thead>\n",
894
+ " <tbody>\n",
895
+ " <tr>\n",
896
+ " <th>0</th>\n",
897
+ " <td>ROUGE-1 aprox.</td>\n",
898
+ " <td>0.6236</td>\n",
899
+ " </tr>\n",
900
+ " <tr>\n",
901
+ " <th>1</th>\n",
902
+ " <td>ROUGE-2 aprox.</td>\n",
903
+ " <td>0.5829</td>\n",
904
+ " </tr>\n",
905
+ " <tr>\n",
906
+ " <th>2</th>\n",
907
+ " <td>ROUGE-L aprox.</td>\n",
908
+ " <td>0.6236</td>\n",
909
+ " </tr>\n",
910
+ " <tr>\n",
911
+ " <th>3</th>\n",
912
+ " <td>Test loss</td>\n",
913
+ " <td>4.0315</td>\n",
914
+ " </tr>\n",
915
+ " <tr>\n",
916
+ " <th>4</th>\n",
917
+ " <td>Test perplexity</td>\n",
918
+ " <td>56.3473</td>\n",
919
+ " </tr>\n",
920
+ " </tbody>\n",
921
+ "</table>\n",
922
+ "</div>"
923
+ ],
924
+ "text/plain": [
925
+ " metric valor\n",
926
+ "0 ROUGE-1 aprox. 0.6236\n",
927
+ "1 ROUGE-2 aprox. 0.5829\n",
928
+ "2 ROUGE-L aprox. 0.6236\n",
929
+ "3 Test loss 4.0315\n",
930
+ "4 Test perplexity 56.3473"
931
+ ]
932
+ },
933
+ "execution_count": 7,
934
+ "metadata": {},
935
+ "output_type": "execute_result"
936
+ }
937
+ ],
938
+ "source": [
939
+ "from collections import Counter\n",
940
+ "\n",
941
+ "test_eval_loader = DataLoader(test_tokenized, batch_size=2, shuffle=False, collate_fn=data_collator)\n",
942
+ "predictions = []\n",
943
+ "references = []\n",
944
+ "\n",
945
+ "model.eval()\n",
946
+ "with torch.no_grad():\n",
947
+ " for batch in test_eval_loader:\n",
948
+ " labels = batch[\"labels\"].clone()\n",
949
+ " model_inputs = {key: value.to(device) for key, value in batch.items() if key != \"labels\"}\n",
950
+ " generated_ids = model.generate(**model_inputs, max_new_tokens=32, num_beams=4)\n",
951
+ " batch_predictions = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)\n",
952
+ " labels[labels == -100] = tokenizer.pad_token_id\n",
953
+ " batch_references = tokenizer.batch_decode(labels, skip_special_tokens=True)\n",
954
+ " predictions.extend(batch_predictions)\n",
955
+ " references.extend(batch_references)\n",
956
+ "\n",
957
+ "def tokenize_summary(text):\n",
958
+ " return [token for token in text.lower().split() if token]\n",
959
+ "\n",
960
+ "def rouge_n_score(prediction_tokens, reference_tokens, n):\n",
961
+ " prediction_ngrams = Counter(tuple(prediction_tokens[index:index + n]) for index in range(max(len(prediction_tokens) - n + 1, 0)))\n",
962
+ " reference_ngrams = Counter(tuple(reference_tokens[index:index + n]) for index in range(max(len(reference_tokens) - n + 1, 0)))\n",
963
+ " overlap = sum(min(count, reference_ngrams[ngram]) for ngram, count in prediction_ngrams.items())\n",
964
+ " prediction_total = sum(prediction_ngrams.values())\n",
965
+ " reference_total = sum(reference_ngrams.values())\n",
966
+ " precision = overlap / prediction_total if prediction_total else 0.0\n",
967
+ " recall = overlap / reference_total if reference_total else 0.0\n",
968
+ " return 2 * precision * recall / (precision + recall) if precision + recall else 0.0\n",
969
+ "\n",
970
+ "def lcs_length(left_tokens, right_tokens):\n",
971
+ " previous_row = [0] * (len(right_tokens) + 1)\n",
972
+ " for left_token in left_tokens:\n",
973
+ " current_row = [0]\n",
974
+ " for index, right_token in enumerate(right_tokens, start=1):\n",
975
+ " if left_token == right_token:\n",
976
+ " current_row.append(previous_row[index - 1] + 1)\n",
977
+ " else:\n",
978
+ " current_row.append(max(previous_row[index], current_row[-1]))\n",
979
+ " previous_row = current_row\n",
980
+ " return previous_row[-1]\n",
981
+ "\n",
982
+ "def rouge_l_score(prediction_tokens, reference_tokens):\n",
983
+ " lcs = lcs_length(prediction_tokens, reference_tokens)\n",
984
+ " precision = lcs / len(prediction_tokens) if prediction_tokens else 0.0\n",
985
+ " recall = lcs / len(reference_tokens) if reference_tokens else 0.0\n",
986
+ " return 2 * precision * recall / (precision + recall) if precision + recall else 0.0\n",
987
+ "\n",
988
+ "rouge_scores = {\"rouge1\": [], \"rouge2\": [], \"rougeL\": []}\n",
989
+ "\n",
990
+ "for prediction, reference in zip(predictions, references):\n",
991
+ " prediction_tokens = tokenize_summary(prediction)\n",
992
+ " reference_tokens = tokenize_summary(reference)\n",
993
+ " rouge_scores[\"rouge1\"].append(rouge_n_score(prediction_tokens, reference_tokens, 1))\n",
994
+ " rouge_scores[\"rouge2\"].append(rouge_n_score(prediction_tokens, reference_tokens, 2))\n",
995
+ " rouge_scores[\"rougeL\"].append(rouge_l_score(prediction_tokens, reference_tokens))\n",
996
+ "\n",
997
+ "metrics_df = pd.DataFrame(\n",
998
+ " [\n",
999
+ " {\"metric\": \"ROUGE-1 aprox.\", \"valor\": float(np.mean(rouge_scores[\"rouge1\"]))},\n",
1000
+ " {\"metric\": \"ROUGE-2 aprox.\", \"valor\": float(np.mean(rouge_scores[\"rouge2\"]))},\n",
1001
+ " {\"metric\": \"ROUGE-L aprox.\", \"valor\": float(np.mean(rouge_scores[\"rougeL\"]))},\n",
1002
+ " {\"metric\": \"Test loss\", \"valor\": test_loss},\n",
1003
+ " {\"metric\": \"Test perplexity\", \"valor\": test_perplexity},\n",
1004
+ " ]\n",
1005
+ ")\n",
1006
+ "\n",
1007
+ "metrics_df[\"valor\"] = metrics_df[\"valor\"].apply(lambda value: round(value, 4) if isinstance(value, (float, np.floating)) and np.isfinite(value) else value)\n",
1008
+ "metrics_df"
1009
+ ]
1010
+ },
1011
+ {
1012
+ "cell_type": "markdown",
1013
+ "metadata": {},
1014
+ "source": [
1015
+ "## Demo con Gradio\n",
1016
+ "\n",
1017
+ "La siguiente interfaz permite escribir un texto, pulsar un botón y obtener el resumen generado por el modelo afinado."
1018
+ ]
1019
+ },
1020
+ {
1021
+ "cell_type": "code",
1022
+ "execution_count": 3,
1023
+ "metadata": {},
1024
+ "outputs": [
1025
+ {
1026
+ "data": {
1027
+ "application/vnd.jupyter.widget-view+json": {
1028
+ "model_id": "5c33f68d8c56475caaa96815c7841b17",
1029
+ "version_major": 2,
1030
+ "version_minor": 0
1031
+ },
1032
+ "text/plain": [
1033
+ "Loading weights: 0%| | 0/190 [00:00<?, ?it/s]"
1034
+ ]
1035
+ },
1036
+ "metadata": {},
1037
+ "output_type": "display_data"
1038
+ },
1039
+ {
1040
+ "name": "stderr",
1041
+ "output_type": "stream",
1042
+ "text": [
1043
+ "[transformers] The tied weights mapping and config for this model specifies to tie shared.weight to lm_head.weight, but both are present in the checkpoints with different values, so we will NOT tie them. You should update the config with `tie_word_embeddings=False` to silence this warning.\n"
1044
+ ]
1045
+ },
1046
+ {
1047
+ "name": "stdout",
1048
+ "output_type": "stream",
1049
+ "text": [
1050
+ "* Running on local URL: http://127.0.0.1:7860\n",
1051
+ "* To create a public link, set `share=True` in `launch()`.\n"
1052
+ ]
1053
+ },
1054
+ {
1055
+ "data": {
1056
+ "text/html": [
1057
+ "<div><iframe src=\"http://127.0.0.1:7860/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
1058
+ ],
1059
+ "text/plain": [
1060
+ "<IPython.core.display.HTML object>"
1061
+ ]
1062
+ },
1063
+ "metadata": {},
1064
+ "output_type": "display_data"
1065
+ },
1066
+ {
1067
+ "data": {
1068
+ "text/plain": []
1069
+ },
1070
+ "execution_count": 3,
1071
+ "metadata": {},
1072
+ "output_type": "execute_result"
1073
+ }
1074
+ ],
1075
+ "source": [
1076
+ "import gradio as gr\n",
1077
+ "import torch\n",
1078
+ "from transformers import AutoModelForSeq2SeqLM, AutoTokenizer\n",
1079
+ "\n",
1080
+ "model_path = \"mt5-resumenes-es-final\"\n",
1081
+ "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
1082
+ "tokenizer = AutoTokenizer.from_pretrained(model_path)\n",
1083
+ "model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to(device)\n",
1084
+ "max_input_length = 256\n",
1085
+ "\n",
1086
+ "def generate_summary(text):\n",
1087
+ " if not text or not text.strip():\n",
1088
+ " return \"Introduce un texto para generar el resumen.\"\n",
1089
+ "\n",
1090
+ " model.eval()\n",
1091
+ " inputs = tokenizer(text, return_tensors=\"pt\", truncation=True, max_length=max_input_length).to(device)\n",
1092
+ " with torch.no_grad():\n",
1093
+ " summary_ids = model.generate(**inputs, max_new_tokens=32, num_beams=4)\n",
1094
+ " return tokenizer.decode(summary_ids[0], skip_special_tokens=True)\n",
1095
+ "\n",
1096
+ "demo = gr.Blocks(title=\"Resumen de texto en español\")\n",
1097
+ "with demo:\n",
1098
+ " gr.Markdown(\"# Resumen de textos en español\\nEscribe un texto largo y pulsa el botón para generar un resumen.\")\n",
1099
+ " with gr.Row():\n",
1100
+ " input_text = gr.Textbox(label=\"Texto de entrada\", lines=12, placeholder=\"Pega aquí el texto que quieras resumir...\")\n",
1101
+ " output_text = gr.Textbox(label=\"Resumen generado\", lines=6)\n",
1102
+ " generate_button = gr.Button(\"Generar resumen\")\n",
1103
+ " generate_button.click(fn=generate_summary, inputs=input_text, outputs=output_text)\n",
1104
+ "\n",
1105
+ "demo.launch()"
1106
+ ]
1107
+ }
1108
+ ],
1109
+ "metadata": {
1110
+ "colab": {
1111
+ "provenance": []
1112
+ },
1113
+ "kernelspec": {
1114
+ "display_name": "TECL",
1115
+ "language": "python",
1116
+ "name": "python3"
1117
+ },
1118
+ "language_info": {
1119
+ "codemirror_mode": {
1120
+ "name": "ipython",
1121
+ "version": 3
1122
+ },
1123
+ "file_extension": ".py",
1124
+ "mimetype": "text/x-python",
1125
+ "name": "python",
1126
+ "nbconvert_exporter": "python",
1127
+ "pygments_lexer": "ipython3",
1128
+ "version": "3.12.13"
1129
+ }
1130
+ },
1131
+ "nbformat": 4,
1132
+ "nbformat_minor": 0
1133
+ }
Proyecto_Hugging_Face.py ADDED
@@ -0,0 +1,258 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import argparse
4
+ import math
5
+ from collections import Counter
6
+ from pathlib import Path
7
+
8
+ import numpy as np
9
+ import pandas as pd
10
+ import torch
11
+ import gradio as gr
12
+ from datasets import Dataset
13
+ from sklearn.model_selection import train_test_split
14
+ from torch.utils.data import DataLoader
15
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq
16
+
17
+
18
+ DATASET_SPLITS = {
19
+ "train": "data/train-00000-of-00001.parquet",
20
+ "validation": "data/validation-00000-of-00001.parquet",
21
+ "test": "data/test-00000-of-00001.parquet",
22
+ }
23
+ DATASET_URL = "hf://datasets/somosnlp/NoticIA-it/"
24
+ BASE_MODEL_NAME = "josmunpen/mt5-small-spanish-summarization"
25
+ DEFAULT_OUTPUT_DIR = "mt5-resumenes-es-final"
26
+ SAMPLE_SIZE = 256
27
+ MAX_INPUT_LENGTH = 256
28
+ MAX_TARGET_LENGTH = 64
29
+ TRAIN_BATCH_SIZE = 2
30
+ EVAL_BATCH_SIZE = 2
31
+ MAX_TRAIN_STEPS = 20
32
+ LEARNING_RATE = 2e-5
33
+
34
+
35
+ def load_dataframe() -> pd.DataFrame:
36
+ df = pd.read_parquet(DATASET_URL + DATASET_SPLITS["train"])
37
+ return df[["texto", "respuesta"]].dropna().reset_index(drop=True)
38
+
39
+
40
+ def prepare_splits(df: pd.DataFrame):
41
+ sample_size = min(SAMPLE_SIZE, len(df))
42
+ df_sample = df.sample(n=sample_size, random_state=42).reset_index(drop=True)
43
+ train_df, temp_df = train_test_split(df_sample, test_size=0.2, random_state=42)
44
+ val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)
45
+ return train_df.reset_index(drop=True), val_df.reset_index(drop=True), test_df.reset_index(drop=True)
46
+
47
+
48
+ def tokenize_datasets(tokenizer, train_df: pd.DataFrame, val_df: pd.DataFrame, test_df: pd.DataFrame):
49
+ train_dataset = Dataset.from_pandas(train_df)
50
+ val_dataset = Dataset.from_pandas(val_df)
51
+ test_dataset = Dataset.from_pandas(test_df)
52
+
53
+ def preprocess_function(batch):
54
+ inputs = tokenizer(batch["texto"], max_length=MAX_INPUT_LENGTH, truncation=True)
55
+ targets = tokenizer(text_target=batch["respuesta"], max_length=MAX_TARGET_LENGTH, truncation=True)
56
+ inputs["labels"] = targets["input_ids"]
57
+ return inputs
58
+
59
+ train_tokenized = train_dataset.map(preprocess_function, batched=True, remove_columns=train_dataset.column_names)
60
+ val_tokenized = val_dataset.map(preprocess_function, batched=True, remove_columns=val_dataset.column_names)
61
+ test_tokenized = test_dataset.map(preprocess_function, batched=True, remove_columns=test_dataset.column_names)
62
+ return train_tokenized, val_tokenized, test_tokenized
63
+
64
+
65
+ def train_model(model, tokenizer, train_tokenized, test_tokenized):
66
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
67
+ model.to(device)
68
+ optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
69
+ data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
70
+
71
+ train_loader = DataLoader(train_tokenized, batch_size=TRAIN_BATCH_SIZE, shuffle=True, collate_fn=data_collator)
72
+ eval_loader = DataLoader(test_tokenized, batch_size=EVAL_BATCH_SIZE, shuffle=False, collate_fn=data_collator)
73
+
74
+ model.train()
75
+ train_losses = []
76
+ for step, batch in enumerate(train_loader, start=1):
77
+ batch = {key: value.to(device) for key, value in batch.items()}
78
+ outputs = model(**batch)
79
+ loss = outputs.loss
80
+ loss.backward()
81
+ optimizer.step()
82
+ optimizer.zero_grad()
83
+ train_losses.append(loss.item())
84
+ if step >= MAX_TRAIN_STEPS:
85
+ break
86
+
87
+ train_loss = float(np.mean(train_losses)) if train_losses else float("nan")
88
+
89
+ model.eval()
90
+ eval_losses = []
91
+ with torch.no_grad():
92
+ for batch in eval_loader:
93
+ batch = {key: value.to(device) for key, value in batch.items()}
94
+ outputs = model(**batch)
95
+ eval_losses.append(outputs.loss.item())
96
+
97
+ test_loss = float(np.mean(eval_losses)) if eval_losses else float("nan")
98
+ test_perplexity = math.exp(test_loss) if np.isfinite(test_loss) and test_loss < 20 else float("inf")
99
+
100
+ return device, train_loss, test_loss, test_perplexity, data_collator
101
+
102
+
103
+ def compute_metrics(model, tokenizer, test_tokenized, data_collator, device):
104
+ test_eval_loader = DataLoader(test_tokenized, batch_size=EVAL_BATCH_SIZE, shuffle=False, collate_fn=data_collator)
105
+ predictions = []
106
+ references = []
107
+
108
+ model.eval()
109
+ with torch.no_grad():
110
+ for batch in test_eval_loader:
111
+ labels = batch["labels"].clone()
112
+ model_inputs = {key: value.to(device) for key, value in batch.items() if key != "labels"}
113
+ generated_ids = model.generate(**model_inputs, max_new_tokens=32, num_beams=4)
114
+ batch_predictions = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
115
+ labels[labels == -100] = tokenizer.pad_token_id
116
+ batch_references = tokenizer.batch_decode(labels, skip_special_tokens=True)
117
+ predictions.extend(batch_predictions)
118
+ references.extend(batch_references)
119
+
120
+ def tokenize_summary(text):
121
+ return [token for token in text.lower().split() if token]
122
+
123
+ def rouge_n_score(prediction_tokens, reference_tokens, n):
124
+ prediction_ngrams = Counter(
125
+ tuple(prediction_tokens[index : index + n])
126
+ for index in range(max(len(prediction_tokens) - n + 1, 0))
127
+ )
128
+ reference_ngrams = Counter(
129
+ tuple(reference_tokens[index : index + n])
130
+ for index in range(max(len(reference_tokens) - n + 1, 0))
131
+ )
132
+ overlap = sum(min(count, reference_ngrams[ngram]) for ngram, count in prediction_ngrams.items())
133
+ prediction_total = sum(prediction_ngrams.values())
134
+ reference_total = sum(reference_ngrams.values())
135
+ precision = overlap / prediction_total if prediction_total else 0.0
136
+ recall = overlap / reference_total if reference_total else 0.0
137
+ return 2 * precision * recall / (precision + recall) if precision + recall else 0.0
138
+
139
+ def lcs_length(left_tokens, right_tokens):
140
+ previous_row = [0] * (len(right_tokens) + 1)
141
+ for left_token in left_tokens:
142
+ current_row = [0]
143
+ for index, right_token in enumerate(right_tokens, start=1):
144
+ if left_token == right_token:
145
+ current_row.append(previous_row[index - 1] + 1)
146
+ else:
147
+ current_row.append(max(previous_row[index], current_row[-1]))
148
+ previous_row = current_row
149
+ return previous_row[-1]
150
+
151
+ def rouge_l_score(prediction_tokens, reference_tokens):
152
+ lcs = lcs_length(prediction_tokens, reference_tokens)
153
+ precision = lcs / len(prediction_tokens) if prediction_tokens else 0.0
154
+ recall = lcs / len(reference_tokens) if reference_tokens else 0.0
155
+ return 2 * precision * recall / (precision + recall) if precision + recall else 0.0
156
+
157
+ rouge_scores = {"rouge1": [], "rouge2": [], "rougeL": []}
158
+ for prediction, reference in zip(predictions, references):
159
+ prediction_tokens = tokenize_summary(prediction)
160
+ reference_tokens = tokenize_summary(reference)
161
+ rouge_scores["rouge1"].append(rouge_n_score(prediction_tokens, reference_tokens, 1))
162
+ rouge_scores["rouge2"].append(rouge_n_score(prediction_tokens, reference_tokens, 2))
163
+ rouge_scores["rougeL"].append(rouge_l_score(prediction_tokens, reference_tokens))
164
+
165
+ metrics_df = pd.DataFrame(
166
+ [
167
+ {"metric": "ROUGE-1 aprox.", "valor": float(np.mean(rouge_scores["rouge1"]))},
168
+ {"metric": "ROUGE-2 aprox.", "valor": float(np.mean(rouge_scores["rouge2"]))},
169
+ {"metric": "ROUGE-L aprox.", "valor": float(np.mean(rouge_scores["rougeL"]))},
170
+ ]
171
+ )
172
+ return metrics_df
173
+
174
+
175
+ def save_model(model, tokenizer, output_dir: Path):
176
+ output_dir.mkdir(parents=True, exist_ok=True)
177
+ model.save_pretrained(output_dir)
178
+ tokenizer.save_pretrained(output_dir)
179
+
180
+
181
+ def generate_sample_summary(model, tokenizer, test_df: pd.DataFrame, device):
182
+ sample_text = test_df.iloc[0]["texto"]
183
+ inputs = tokenizer(sample_text, return_tensors="pt", truncation=True, max_length=MAX_INPUT_LENGTH).to(device)
184
+ generated_ids = model.generate(**inputs, max_new_tokens=32, num_beams=4)
185
+ return sample_text, tokenizer.decode(generated_ids[0], skip_special_tokens=True)
186
+
187
+
188
+ def build_gradio_demo(model, tokenizer, device):
189
+ def generate_summary(text):
190
+ if not text or not text.strip():
191
+ return "Introduce un texto para generar el resumen."
192
+
193
+ model.eval()
194
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=MAX_INPUT_LENGTH).to(device)
195
+ with torch.no_grad():
196
+ summary_ids = model.generate(**inputs, max_new_tokens=32, num_beams=4)
197
+ return tokenizer.decode(summary_ids[0], skip_special_tokens=True)
198
+
199
+ with gr.Blocks(title="Resumen de texto en espanol") as demo:
200
+ gr.Markdown("# Resumen de textos en espanol\nEscribe un texto largo y pulsa el boton para generar un resumen.")
201
+ with gr.Row():
202
+ input_text = gr.Textbox(label="Texto de entrada", lines=12, placeholder="Pega aqui el texto que quieras resumir...")
203
+ output_text = gr.Textbox(label="Resumen generado", lines=6)
204
+ generate_button = gr.Button("Generar resumen")
205
+ generate_button.click(fn=generate_summary, inputs=input_text, outputs=output_text)
206
+ return demo
207
+
208
+
209
+ def main():
210
+ parser = argparse.ArgumentParser(description="Fine-tuning y demo de resumen en espanol")
211
+ parser.add_argument("--retrain", action="store_true", help="Reentrenar el modelo aunque ya exista una version guardada")
212
+ parser.add_argument("--no-demo", action="store_true", help="No lanzar la interfaz de Gradio al final")
213
+ parser.add_argument("--share", action="store_true", help="Crear un enlace publico de Gradio")
214
+ parser.add_argument("--server-port", type=int, default=7860, help="Puerto para la demo de Gradio")
215
+ args = parser.parse_args()
216
+
217
+ base_dir = Path(__file__).resolve().parent
218
+ output_dir = base_dir / DEFAULT_OUTPUT_DIR
219
+
220
+ df = load_dataframe()
221
+ train_df, val_df, test_df = prepare_splits(df)
222
+
223
+ if output_dir.exists() and not args.retrain:
224
+ tokenizer = AutoTokenizer.from_pretrained(output_dir)
225
+ model = AutoModelForSeq2SeqLM.from_pretrained(output_dir)
226
+ train_tokenized, val_tokenized, test_tokenized = tokenize_datasets(tokenizer, train_df, val_df, test_df)
227
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
228
+ model.to(device)
229
+ data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
230
+ train_loss = float("nan")
231
+ test_loss = float("nan")
232
+ test_perplexity = float("nan")
233
+ else:
234
+ tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)
235
+ model = AutoModelForSeq2SeqLM.from_pretrained(BASE_MODEL_NAME)
236
+ train_tokenized, val_tokenized, test_tokenized = tokenize_datasets(tokenizer, train_df, val_df, test_df)
237
+ device, train_loss, test_loss, test_perplexity, data_collator = train_model(model, tokenizer, train_tokenized, test_tokenized)
238
+ save_model(model, tokenizer, output_dir)
239
+
240
+ metrics_df = compute_metrics(model, tokenizer, test_tokenized, data_collator, device)
241
+ metrics_df["valor"] = metrics_df["valor"].apply(lambda value: round(value, 4) if isinstance(value, (float, np.floating)) and np.isfinite(value) else value)
242
+
243
+ print("Train loss:", round(train_loss, 4) if np.isfinite(train_loss) else train_loss)
244
+ print("Test loss:", round(test_loss, 4) if np.isfinite(test_loss) else test_loss)
245
+ print("Test perplexity:", round(test_perplexity, 4) if np.isfinite(test_perplexity) else test_perplexity)
246
+ print(metrics_df)
247
+
248
+ sample_text, sample_summary = generate_sample_summary(model, tokenizer, test_df, device)
249
+ print("Texto de entrada:", sample_text[:1200])
250
+ print("Resumen generado:", sample_summary)
251
+
252
+ if not args.no_demo:
253
+ demo = build_gradio_demo(model, tokenizer, device)
254
+ demo.launch(share=args.share, server_port=args.server_port)
255
+
256
+
257
+ if __name__ == "__main__":
258
+ main()
README.md CHANGED
@@ -1,17 +1,34 @@
1
- ---
2
- title: Summarization Spanish Text
3
- emoji: 💬
4
- colorFrom: yellow
5
- colorTo: purple
6
- sdk: gradio
7
- sdk_version: 6.5.1
8
- app_file: app.py
9
- pinned: false
10
- hf_oauth: true
11
- hf_oauth_scopes:
12
- - inference-api
13
- license: apache-2.0
14
- short_description: AI agent to summarize Spanish texts
15
- ---
16
-
17
- An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ title: Resúmenes huggingface TECP
2
+ emoji: 👀
3
+ colorFrom: yellow
4
+ colorTo: green
5
+ sdk: gradio
6
+ app_file: app.py
7
+ pinned: false
8
+ license: apache-2.0
9
+
10
+ Model: Este modelo está basado en `josmunpen/mt5-small-spanish-summarization` y ha sido ajustado con un subconjunto del dataset `somosnlp/NoticIA-it` para generar resúmenes en español.
11
+ El objetivo del modelo es tomar un texto largo de entrada y producir un resumen breve en español, orientado a extraer la idea principal del contenido.
12
+
13
+ Uses: El modelo está pensado para demostraciones educativas y prototipos de resumen automático de textos en español, especialmente noticias o artículos largos.
14
+
15
+ dataset: Durante el fine tuning se utilizó un subconjunto de 256 ejemplos del conjunto de entrenamiento. El dataset se dividió en entrenamiento, validación y test para evaluar el comportamiento del modelo en datos no vistos.
16
+
17
+ Métricas obtenidas en test: Resultados obtenidos tras el ajuste fino y la evaluación sobre el conjunto de test:
18
+
19
+ - ROUGE-1 aprox.: 0.6236
20
+ - ROUGE-2 aprox.: 0.5829
21
+ - ROUGE-L aprox.: 0.6236
22
+ - Test loss: 4.0315
23
+ - Test perplexity: 56.3473
24
+
25
+ Limitations:
26
+
27
+ - El entrenamiento se ha realizado con un subconjunto pequeño, por lo que el rendimiento no es representativo de una versión final optimizada.
28
+ - La métrica ROUGE se calcula con una implementación aproximada basada en solapamiento de tokens, no con la librería oficial de ROUGE.
29
+ - El modelo puede generar resúmenes demasiado genéricos o con pérdida de detalle en textos largos.
30
+ - El comportamiento dependerá mucho de la calidad y longitud del texto de entrada.
31
+ - No se ha incorporado un proceso de validación exhaustivo ni una búsqueda de hiperparámetros.
32
+
33
+
34
+
requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ numpy
2
+ pandas
3
+ torch
4
+ datasets
5
+ scikit-learn
6
+ transformers
7
+ gradio
8
+ fsspec
9
+ pyarrow
10
+ sentencepiece