marmarg2 commited on
Commit
dc6999b
·
1 Parent(s): 7f14c89

Upload 4 files

Browse files
Files changed (4) hide show
  1. BERT-mULT-t-MMG.ipynb +510 -0
  2. BETo-t-MMG.ipynb +482 -0
  3. Roberta-t-MMG.ipynb +486 -0
  4. Roberta-t-MMGb.ipynb +493 -0
BERT-mULT-t-MMG.ipynb ADDED
@@ -0,0 +1,510 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "976841dc",
6
+ "metadata": {},
7
+ "source": [
8
+ "## Preparación de un dataset\n",
9
+ "\n",
10
+ "Descargamos el dataset y lo preparamos para el entrenamiento. En el caso de ejemplo, usaremos toxic-teenage-relationships, que son frases que describen si un comporamiento es tóxico o sano. Tienen una campo de texto y un campo de etiqueta, que vale 1 si es tóxico y 0 si no lo es. Acumula 267 ejemplos de entrenamiento y 66 para testear."
11
+ ]
12
+ },
13
+ {
14
+ "cell_type": "code",
15
+ "execution_count": 1,
16
+ "id": "caf72aa3",
17
+ "metadata": {
18
+ "scrolled": false
19
+ },
20
+ "outputs": [
21
+ {
22
+ "data": {
23
+ "text/plain": [
24
+ "{'label': 1,\n",
25
+ " 'text': 'Mi amiga no puede subir videos a tik tok porque su pareja no le deja'}"
26
+ ]
27
+ },
28
+ "execution_count": 1,
29
+ "metadata": {},
30
+ "output_type": "execute_result"
31
+ }
32
+ ],
33
+ "source": [
34
+ "from datasets import load_dataset\n",
35
+ "data_files = {\"train\": \"train.csv\", \"test\": \"test.csv\"}\n",
36
+ "dataset = load_dataset(\"toxic-teenage-relationships\", data_files=data_files, sep=\";\")\n",
37
+ "dataset['train'][100]"
38
+ ]
39
+ },
40
+ {
41
+ "cell_type": "markdown",
42
+ "id": "08aacc14",
43
+ "metadata": {},
44
+ "source": [
45
+ "Una vez cargado el dataset, se crea un tokenizador para procesar el texto e incluir una estrategia para el padding y el truncamiento. Par poder procesar el dataset en un solo paso, se utiliza el método dataset.map para preprocesar todo el dataset.\n",
46
+ "\n"
47
+ ]
48
+ },
49
+ {
50
+ "cell_type": "code",
51
+ "execution_count": 2,
52
+ "id": "4a854ead",
53
+ "metadata": {},
54
+ "outputs": [],
55
+ "source": [
56
+ "\n",
57
+ "from transformers import AutoTokenizer\n",
58
+ "#el modelo a utilizar es BERT multilingual cased\n",
59
+ "tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')\n",
60
+ "\n",
61
+ "\n",
62
+ "def tokenize_function(examples):\n",
63
+ " return tokenizer(examples[\"text\"], padding=\"max_length\", truncation=True)\n",
64
+ "\n",
65
+ "\n",
66
+ "tokenized_datasets = dataset.map(tokenize_function, batched=True)\n"
67
+ ]
68
+ },
69
+ {
70
+ "cell_type": "code",
71
+ "execution_count": 3,
72
+ "id": "eb5477cc",
73
+ "metadata": {},
74
+ "outputs": [],
75
+ "source": [
76
+ "train_dataset = tokenized_datasets[\"train\"]\n",
77
+ "eval_dataset = tokenized_datasets[\"test\"]"
78
+ ]
79
+ },
80
+ {
81
+ "cell_type": "markdown",
82
+ "id": "38a6c521",
83
+ "metadata": {},
84
+ "source": [
85
+ "## Fine-tuning usando Trainer\n",
86
+ "\n",
87
+ "La clase trainer de Transformers permite entrenar modelos de transformers. La API del Trainer soporta varias opciones de entrenamiento y características como logging, gradient accumulation y mixed preccision"
88
+ ]
89
+ },
90
+ {
91
+ "cell_type": "code",
92
+ "execution_count": 4,
93
+ "id": "843f218d",
94
+ "metadata": {},
95
+ "outputs": [
96
+ {
97
+ "name": "stderr",
98
+ "output_type": "stream",
99
+ "text": [
100
+ "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']\n",
101
+ "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
102
+ ]
103
+ }
104
+ ],
105
+ "source": [
106
+ "from transformers import AutoModelForSequenceClassification\n",
107
+ "\n",
108
+ "#Hay dos categorías, así que ponemos 2 etiquetas (0 sano 1 tóxico)\n",
109
+ "model = AutoModelForSequenceClassification.from_pretrained('bert-base-multilingual-cased', num_labels=2)\n"
110
+ ]
111
+ },
112
+ {
113
+ "cell_type": "markdown",
114
+ "id": "27be3c25",
115
+ "metadata": {},
116
+ "source": [
117
+ "## Hiperparámetros de entrenamiento\n",
118
+ "\n",
119
+ "Ahora se crea una clase TrainingArguments que contiene todos los hiperparámetros que se pueden ajustar. \n",
120
+ "Empezamos con los hiperparámetros de entrenamiento por defecto, pero tendremos que ajustarlos para encontrar la configuración óptima.\n"
121
+ ]
122
+ },
123
+ {
124
+ "cell_type": "code",
125
+ "execution_count": 5,
126
+ "id": "7f84ef1e",
127
+ "metadata": {},
128
+ "outputs": [],
129
+ "source": [
130
+ "#Para poder evitar el overfitting, voy a añadir la clase earlystopping en el momento que se observe\n",
131
+ "#que la pérdida se incrementa en dos epoch\n",
132
+ "from transformers import EarlyStoppingCallback\n",
133
+ "early_stop=EarlyStoppingCallback(early_stopping_patience=2)"
134
+ ]
135
+ },
136
+ {
137
+ "cell_type": "code",
138
+ "execution_count": 6,
139
+ "id": "f53c992d",
140
+ "metadata": {},
141
+ "outputs": [
142
+ {
143
+ "name": "stderr",
144
+ "output_type": "stream",
145
+ "text": [
146
+ "/home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n",
147
+ " warnings.warn(\n"
148
+ ]
149
+ }
150
+ ],
151
+ "source": [
152
+ "from transformers import TrainingArguments\n",
153
+ "from transformers import DataCollatorWithPadding\n",
154
+ "from transformers import AdamW\n",
155
+ "# para controlar las métricas de evaluación durante el fine-tuning\n",
156
+ "# vamos a añadir que elija el mejor modelo al final, usamos load_best_model_at_end que cogerá eval_loss para evaluar\n",
157
+ "# para que se fije en el valor de loss como la mejor métrica, hay que poner greater_is_better a false.\n",
158
+ "#vamos a poner el número de epoch a 10 y el del batch a 8\n",
159
+ "\n",
160
+ "training_args = TrainingArguments(output_dir=\"BERT-mULT-t-MMG\",\n",
161
+ " num_train_epochs=10,\n",
162
+ " per_device_train_batch_size=8,\n",
163
+ " per_device_eval_batch_size=8,\n",
164
+ " load_best_model_at_end=True,\n",
165
+ " greater_is_better=False,\n",
166
+ " evaluation_strategy=\"epoch\",\n",
167
+ " save_strategy=\"epoch\",\n",
168
+ " )\n",
169
+ "#Para el optimizador, tengo que cargarlo en Trainer, así que lo creo y añado el learning rate\n",
170
+ "optimizer=AdamW(model.parameters(), lr=5e-5)\n",
171
+ "\n",
172
+ "#añado el data Collator, que en este caso va a ser parte del trainer.\n",
173
+ "#este es el indicado específicamente para tareas de clasificación de texto, agrupa y preprocesa\n",
174
+ "#para que todos los ejemplos de entrada en lotes tengan la misma longitud además del tokenizdor\n",
175
+ "#agrupación en lotes y creación de mapas de atención.\n",
176
+ "#usando la función .map, no es estrictamente necesario pero así se combinan las características\n",
177
+ "#adicionales del texto antes de pasarle el datacollator.\n",
178
+ "data_collator = DataCollatorWithPadding(tokenizer)"
179
+ ]
180
+ },
181
+ {
182
+ "cell_type": "markdown",
183
+ "id": "6d604727",
184
+ "metadata": {},
185
+ "source": [
186
+ "## Métricas\n",
187
+ "\n",
188
+ "El Trainer no evalúa automátiamentee el rendimiento, hay que pasarle una función para calcular y hacer un reporte de las métricas. En Datasets hay una función, accuracy, que se puede cargar con load_metric. \n",
189
+ "Antes hay que instalar scikit-learn"
190
+ ]
191
+ },
192
+ {
193
+ "cell_type": "code",
194
+ "execution_count": 7,
195
+ "id": "0ed3ddf4",
196
+ "metadata": {},
197
+ "outputs": [
198
+ {
199
+ "name": "stdout",
200
+ "output_type": "stream",
201
+ "text": [
202
+ "Requirement already satisfied: scikit-learn in /home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages (1.3.0)\n",
203
+ "Requirement already satisfied: numpy>=1.17.3 in /home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages (from scikit-learn) (1.24.3)\n",
204
+ "Requirement already satisfied: scipy>=1.5.0 in /home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages (from scikit-learn) (1.10.1)\n",
205
+ "Requirement already satisfied: joblib>=1.1.1 in /home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages (from scikit-learn) (1.3.1)\n",
206
+ "Requirement already satisfied: threadpoolctl>=2.0.0 in /home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages (from scikit-learn) (3.2.0)\n",
207
+ "Note: you may need to restart the kernel to use updated packages.\n"
208
+ ]
209
+ }
210
+ ],
211
+ "source": [
212
+ "pip install scikit-learn"
213
+ ]
214
+ },
215
+ {
216
+ "cell_type": "code",
217
+ "execution_count": 8,
218
+ "id": "326103f5",
219
+ "metadata": {},
220
+ "outputs": [
221
+ {
222
+ "name": "stderr",
223
+ "output_type": "stream",
224
+ "text": [
225
+ "/tmp/ipykernel_3278833/2607597888.py:4: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate\n",
226
+ " metric = load_metric(\"accuracy\")\n"
227
+ ]
228
+ }
229
+ ],
230
+ "source": [
231
+ "import numpy as np\n",
232
+ "from datasets import load_metric\n",
233
+ "\n",
234
+ "metric = load_metric(\"accuracy\")"
235
+ ]
236
+ },
237
+ {
238
+ "cell_type": "markdown",
239
+ "id": "087d4b3e",
240
+ "metadata": {},
241
+ "source": [
242
+ "Se define la función compute_metrics para calcular el accuracy de las predicciones hechas. Antes de pasar las predicciones a compute, hay que convertir las predicciones a logits (los modelos de Transformers devuelven logits)."
243
+ ]
244
+ },
245
+ {
246
+ "cell_type": "code",
247
+ "execution_count": 9,
248
+ "id": "d7b8341d",
249
+ "metadata": {},
250
+ "outputs": [],
251
+ "source": [
252
+ "def compute_metrics(eval_pred):\n",
253
+ " logits, labels = eval_pred\n",
254
+ " predictions = np.argmax(logits, axis=-1)\n",
255
+ " return metric.compute(predictions=predictions, references=labels)"
256
+ ]
257
+ },
258
+ {
259
+ "cell_type": "markdown",
260
+ "id": "53db268c",
261
+ "metadata": {},
262
+ "source": [
263
+ "## Trainer\n",
264
+ "\n",
265
+ "Ahora es el momento de crear el objeto Trainer con el modelo, argumentos de entrenamiento, datasets de entrenamiento y de prueba, y función de evaluación:"
266
+ ]
267
+ },
268
+ {
269
+ "cell_type": "code",
270
+ "execution_count": 11,
271
+ "id": "d566aded",
272
+ "metadata": {},
273
+ "outputs": [],
274
+ "source": [
275
+ "from transformers import Trainer\n",
276
+ "trainer = Trainer(\n",
277
+ " model=model,\n",
278
+ " args=training_args,\n",
279
+ " train_dataset=train_dataset,\n",
280
+ " eval_dataset=eval_dataset,\n",
281
+ " data_collator=data_collator,\n",
282
+ " optimizers=(optimizer, None),\n",
283
+ " compute_metrics=compute_metrics,\n",
284
+ " callbacks=[early_stop],\n",
285
+ ")"
286
+ ]
287
+ },
288
+ {
289
+ "cell_type": "markdown",
290
+ "id": "a31780ca",
291
+ "metadata": {},
292
+ "source": [
293
+ "Y se aplica fine-tunning con train"
294
+ ]
295
+ },
296
+ {
297
+ "cell_type": "code",
298
+ "execution_count": 12,
299
+ "id": "3e01c5fb",
300
+ "metadata": {},
301
+ "outputs": [
302
+ {
303
+ "name": "stderr",
304
+ "output_type": "stream",
305
+ "text": [
306
+ "You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n"
307
+ ]
308
+ },
309
+ {
310
+ "data": {
311
+ "text/html": [
312
+ "\n",
313
+ " <div>\n",
314
+ " \n",
315
+ " <progress value='136' max='340' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
316
+ " [136/340 03:05 < 04:42, 0.72 it/s, Epoch 4/10]\n",
317
+ " </div>\n",
318
+ " <table border=\"1\" class=\"dataframe\">\n",
319
+ " <thead>\n",
320
+ " <tr style=\"text-align: left;\">\n",
321
+ " <th>Epoch</th>\n",
322
+ " <th>Training Loss</th>\n",
323
+ " <th>Validation Loss</th>\n",
324
+ " <th>Accuracy</th>\n",
325
+ " </tr>\n",
326
+ " </thead>\n",
327
+ " <tbody>\n",
328
+ " <tr>\n",
329
+ " <td>1</td>\n",
330
+ " <td>No log</td>\n",
331
+ " <td>0.720169</td>\n",
332
+ " <td>0.545455</td>\n",
333
+ " </tr>\n",
334
+ " <tr>\n",
335
+ " <td>2</td>\n",
336
+ " <td>No log</td>\n",
337
+ " <td>0.585052</td>\n",
338
+ " <td>0.651515</td>\n",
339
+ " </tr>\n",
340
+ " <tr>\n",
341
+ " <td>3</td>\n",
342
+ " <td>No log</td>\n",
343
+ " <td>0.745457</td>\n",
344
+ " <td>0.742424</td>\n",
345
+ " </tr>\n",
346
+ " <tr>\n",
347
+ " <td>4</td>\n",
348
+ " <td>No log</td>\n",
349
+ " <td>0.674149</td>\n",
350
+ " <td>0.696970</td>\n",
351
+ " </tr>\n",
352
+ " </tbody>\n",
353
+ "</table><p>"
354
+ ],
355
+ "text/plain": [
356
+ "<IPython.core.display.HTML object>"
357
+ ]
358
+ },
359
+ "metadata": {},
360
+ "output_type": "display_data"
361
+ },
362
+ {
363
+ "data": {
364
+ "text/plain": [
365
+ "TrainOutput(global_step=136, training_loss=0.5572038538315717, metrics={'train_runtime': 186.6491, 'train_samples_per_second': 14.358, 'train_steps_per_second': 1.822, 'total_flos': 282055051345920.0, 'train_loss': 0.5572038538315717, 'epoch': 4.0})"
366
+ ]
367
+ },
368
+ "execution_count": 12,
369
+ "metadata": {},
370
+ "output_type": "execute_result"
371
+ }
372
+ ],
373
+ "source": [
374
+ "trainer.train()"
375
+ ]
376
+ },
377
+ {
378
+ "cell_type": "markdown",
379
+ "id": "417d3cd2",
380
+ "metadata": {},
381
+ "source": [
382
+ "Imprimo el loss y el accuracy tanto del conjunto de train como el conjunto de test"
383
+ ]
384
+ },
385
+ {
386
+ "cell_type": "code",
387
+ "execution_count": 13,
388
+ "id": "d1144002",
389
+ "metadata": {},
390
+ "outputs": [
391
+ {
392
+ "data": {
393
+ "text/html": [
394
+ "\n",
395
+ " <div>\n",
396
+ " \n",
397
+ " <progress value='43' max='34' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
398
+ " [34/34 00:12]\n",
399
+ " </div>\n",
400
+ " "
401
+ ],
402
+ "text/plain": [
403
+ "<IPython.core.display.HTML object>"
404
+ ]
405
+ },
406
+ "metadata": {},
407
+ "output_type": "display_data"
408
+ },
409
+ {
410
+ "name": "stdout",
411
+ "output_type": "stream",
412
+ "text": [
413
+ "Resultados del conjunto de train\n",
414
+ "eval_loss. 0.44578275084495544\n",
415
+ "eval_accuracy. 0.8246268656716418\n",
416
+ "eval_runtime. 9.9218\n",
417
+ "eval_samples_per_second. 27.011\n",
418
+ "eval_steps_per_second. 3.427\n",
419
+ "epoch. 4.0\n",
420
+ "Resultados del conjunto de test\n",
421
+ "eval_loss. 0.5850518345832825\n",
422
+ "eval_accuracy. 0.6515151515151515\n",
423
+ "eval_runtime. 2.4444\n",
424
+ "eval_samples_per_second. 27.0\n",
425
+ "eval_steps_per_second. 3.682\n",
426
+ "epoch. 4.0\n"
427
+ ]
428
+ }
429
+ ],
430
+ "source": [
431
+ "#creo una función para imprimir los resultados de una formá más visual\n",
432
+ "def print_results(title, results):\n",
433
+ " print(title)\n",
434
+ " for key, value in results.items():\n",
435
+ " print(f\"{key}. {value}\")\n",
436
+ " \n",
437
+ "train_result = trainer.evaluate(train_dataset)\n",
438
+ "print_results(\"Resultados del conjunto de train\",train_result)\n",
439
+ "eval_result = trainer.evaluate(eval_dataset)\n",
440
+ "print_results(\"Resultados del conjunto de test\",eval_result)\n",
441
+ "\n"
442
+ ]
443
+ },
444
+ {
445
+ "cell_type": "markdown",
446
+ "id": "9e61a040",
447
+ "metadata": {},
448
+ "source": [
449
+ "# Guardando el modelo"
450
+ ]
451
+ },
452
+ {
453
+ "cell_type": "markdown",
454
+ "id": "4af06209",
455
+ "metadata": {},
456
+ "source": [
457
+ "Para Guardarlo, utilizamos esl método save_model"
458
+ ]
459
+ },
460
+ {
461
+ "cell_type": "code",
462
+ "execution_count": 14,
463
+ "id": "b93638cb",
464
+ "metadata": {},
465
+ "outputs": [],
466
+ "source": [
467
+ "trainer.save_model()"
468
+ ]
469
+ },
470
+ {
471
+ "cell_type": "code",
472
+ "execution_count": 15,
473
+ "id": "973c4e03",
474
+ "metadata": {},
475
+ "outputs": [],
476
+ "source": [
477
+ "trainer.create_model_card()"
478
+ ]
479
+ },
480
+ {
481
+ "cell_type": "code",
482
+ "execution_count": null,
483
+ "id": "9671b67c",
484
+ "metadata": {},
485
+ "outputs": [],
486
+ "source": []
487
+ }
488
+ ],
489
+ "metadata": {
490
+ "kernelspec": {
491
+ "display_name": "Python 3 (ipykernel)",
492
+ "language": "python",
493
+ "name": "python3"
494
+ },
495
+ "language_info": {
496
+ "codemirror_mode": {
497
+ "name": "ipython",
498
+ "version": 3
499
+ },
500
+ "file_extension": ".py",
501
+ "mimetype": "text/x-python",
502
+ "name": "python",
503
+ "nbconvert_exporter": "python",
504
+ "pygments_lexer": "ipython3",
505
+ "version": "3.8.13"
506
+ }
507
+ },
508
+ "nbformat": 4,
509
+ "nbformat_minor": 5
510
+ }
BETo-t-MMG.ipynb ADDED
@@ -0,0 +1,482 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "976841dc",
6
+ "metadata": {},
7
+ "source": [
8
+ "## Preparación de un dataset\n",
9
+ "\n",
10
+ "Descargamos el dataset y lo preparamos para el entrenamiento. En el caso de ejemplo, usaremos toxic-teenage-relationships, que son frases que describen si un comporamiento es tóxico o sano. Tienen una campo de texto y un campo de etiqueta, que vale 1 si es tóxico y 0 si no lo es. Acumula 267 ejemplos de entrenamiento y 66 para testear."
11
+ ]
12
+ },
13
+ {
14
+ "cell_type": "code",
15
+ "execution_count": 1,
16
+ "id": "caf72aa3",
17
+ "metadata": {
18
+ "scrolled": false
19
+ },
20
+ "outputs": [
21
+ {
22
+ "data": {
23
+ "text/plain": [
24
+ "{'label': 1, 'text': 'Me mira mal por mi forma de vestir'}"
25
+ ]
26
+ },
27
+ "execution_count": 1,
28
+ "metadata": {},
29
+ "output_type": "execute_result"
30
+ }
31
+ ],
32
+ "source": [
33
+ "from datasets import load_dataset\n",
34
+ "data_files = {\"train\": \"train.csv\", \"test\": \"test.csv\"}\n",
35
+ "dataset = load_dataset(\"toxic-teenage-relationships\", data_files=data_files, sep=\";\")\n",
36
+ "dataset['train'][102]"
37
+ ]
38
+ },
39
+ {
40
+ "cell_type": "markdown",
41
+ "id": "08aacc14",
42
+ "metadata": {},
43
+ "source": [
44
+ "Una vez cargado el dataset, se crea un tokenizador para procesar el texto e incluir una estrategia para el padding y el truncamiento. Par poder procesar el dataset en un solo paso, se utiliza el método dataset.map para preprocesar todo el dataset.\n",
45
+ "\n"
46
+ ]
47
+ },
48
+ {
49
+ "cell_type": "code",
50
+ "execution_count": 2,
51
+ "id": "4a854ead",
52
+ "metadata": {},
53
+ "outputs": [],
54
+ "source": [
55
+ "\n",
56
+ "from transformers import AutoTokenizer\n",
57
+ "#el modelo a utilizar es BETo\n",
58
+ "tokenizer = AutoTokenizer.from_pretrained(\"dccuchile/bert-base-spanish-wwm-cased\")\n",
59
+ "\n",
60
+ "\n",
61
+ "def tokenize_function(examples):\n",
62
+ " return tokenizer(examples[\"text\"], padding=\"max_length\", truncation=True)\n",
63
+ "\n",
64
+ "\n",
65
+ "tokenized_datasets = dataset.map(tokenize_function, batched=True)\n"
66
+ ]
67
+ },
68
+ {
69
+ "cell_type": "code",
70
+ "execution_count": 3,
71
+ "id": "eb5477cc",
72
+ "metadata": {},
73
+ "outputs": [],
74
+ "source": [
75
+ "train_dataset = tokenized_datasets[\"train\"]\n",
76
+ "eval_dataset = tokenized_datasets[\"test\"]"
77
+ ]
78
+ },
79
+ {
80
+ "cell_type": "markdown",
81
+ "id": "38a6c521",
82
+ "metadata": {},
83
+ "source": [
84
+ "## Fine-tuning usando Trainer\n",
85
+ "\n",
86
+ "La clase trainer de Transformers permite entrenar modelos de transformers. La API del Trainer soporta varias opciones de entrenamiento y características como logging, gradient accumulation y mixed preccision"
87
+ ]
88
+ },
89
+ {
90
+ "cell_type": "code",
91
+ "execution_count": 4,
92
+ "id": "843f218d",
93
+ "metadata": {},
94
+ "outputs": [
95
+ {
96
+ "name": "stderr",
97
+ "output_type": "stream",
98
+ "text": [
99
+ "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['classifier.bias', 'bert.pooler.dense.weight', 'classifier.weight', 'bert.pooler.dense.bias']\n",
100
+ "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
101
+ ]
102
+ }
103
+ ],
104
+ "source": [
105
+ "from transformers import AutoModelForSequenceClassification\n",
106
+ "\n",
107
+ "#Hay dos categorías, así que ponemos 2 etiquetas (0 sano 1 tóxico)\n",
108
+ "model = AutoModelForSequenceClassification.from_pretrained(\"dccuchile/bert-base-spanish-wwm-cased\", num_labels=2)\n"
109
+ ]
110
+ },
111
+ {
112
+ "cell_type": "markdown",
113
+ "id": "27be3c25",
114
+ "metadata": {},
115
+ "source": [
116
+ "## Hiperparámetros de entrenamiento\n",
117
+ "\n",
118
+ "Ahora se crea una clase TrainingArguments que contiene todos los hiperparámetros que se pueden ajustar. \n",
119
+ "Empezamos con los hiperparámetros de entrenamiento por defecto, pero tendremos que ajustarlos para encontrar la configuración óptima.\n"
120
+ ]
121
+ },
122
+ {
123
+ "cell_type": "code",
124
+ "execution_count": 5,
125
+ "id": "7f84ef1e",
126
+ "metadata": {},
127
+ "outputs": [],
128
+ "source": [
129
+ "#Para poder evitar el overfitting, voy a añadir la clase earlystopping en el momento que se observe\n",
130
+ "#que la pérdida se incrementa en dos epoch\n",
131
+ "from transformers import EarlyStoppingCallback\n",
132
+ "early_stop=EarlyStoppingCallback(early_stopping_patience=2)"
133
+ ]
134
+ },
135
+ {
136
+ "cell_type": "code",
137
+ "execution_count": 6,
138
+ "id": "f53c992d",
139
+ "metadata": {},
140
+ "outputs": [
141
+ {
142
+ "name": "stderr",
143
+ "output_type": "stream",
144
+ "text": [
145
+ "/home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n",
146
+ " warnings.warn(\n"
147
+ ]
148
+ }
149
+ ],
150
+ "source": [
151
+ "from transformers import TrainingArguments\n",
152
+ "from transformers import DataCollatorWithPadding\n",
153
+ "from transformers import AdamW\n",
154
+ "# para controlar las métricas de evaluación durante el fine-tuning\n",
155
+ "# vamos a añadir que elija el mejor modelo al final, usamos load_best_model_at_end que cogerá eval_loss para evaluar\n",
156
+ "# para que se fije en el valor de loss como la mejor métrica, hay que poner greater_is_better a false.\n",
157
+ "#vamos a poner el número de epoch a 10 y el del batch a 8\n",
158
+ "\n",
159
+ "training_args = TrainingArguments(output_dir=\"BETo-t-MMG\",\n",
160
+ " num_train_epochs=10,\n",
161
+ " per_device_train_batch_size=8,\n",
162
+ " per_device_eval_batch_size=8,\n",
163
+ " load_best_model_at_end=True,\n",
164
+ " greater_is_better=False,\n",
165
+ " evaluation_strategy=\"epoch\",\n",
166
+ " save_strategy=\"epoch\")\n",
167
+ "#optimizador \n",
168
+ "optimizer=AdamW(model.parameters(),lr=5e-5)\n",
169
+ "#añado el data Collator, que en este caso va a ser parte del trainer.\n",
170
+ "#este es el indicado específicamente para tareas de clasificación de texto, agrupa y preprocesa\n",
171
+ "#para que todos los ejemplos de entrada en lotes tengan la misma longitud además del tokenizdor\n",
172
+ "#agrupación en lotes y creación de mapas de atención.\n",
173
+ "#usando la función .map, no es estrictamente necesario pero así se combinan las características\n",
174
+ "#adicionales del texto antes de pasarle el datacollator.\n",
175
+ "data_collator = DataCollatorWithPadding(tokenizer)"
176
+ ]
177
+ },
178
+ {
179
+ "cell_type": "markdown",
180
+ "id": "6d604727",
181
+ "metadata": {},
182
+ "source": [
183
+ "## Métricas\n",
184
+ "\n",
185
+ "El Trainer no evalúa automátiamentee el rendimiento, hay que pasarle una función para calcular y hacer un reporte de las métricas. En Datasets hay una función, accuracy, que se puede cargar con load_metric. \n",
186
+ "Antes hay que instalar scikit-learn"
187
+ ]
188
+ },
189
+ {
190
+ "cell_type": "code",
191
+ "execution_count": 7,
192
+ "id": "0ed3ddf4",
193
+ "metadata": {},
194
+ "outputs": [
195
+ {
196
+ "name": "stdout",
197
+ "output_type": "stream",
198
+ "text": [
199
+ "Requirement already satisfied: scikit-learn in /home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages (1.3.0)\n",
200
+ "Requirement already satisfied: numpy>=1.17.3 in /home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages (from scikit-learn) (1.24.3)\n",
201
+ "Requirement already satisfied: scipy>=1.5.0 in /home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages (from scikit-learn) (1.10.1)\n",
202
+ "Requirement already satisfied: joblib>=1.1.1 in /home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages (from scikit-learn) (1.3.1)\n",
203
+ "Requirement already satisfied: threadpoolctl>=2.0.0 in /home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages (from scikit-learn) (3.2.0)\n",
204
+ "Note: you may need to restart the kernel to use updated packages.\n"
205
+ ]
206
+ }
207
+ ],
208
+ "source": [
209
+ "pip install scikit-learn"
210
+ ]
211
+ },
212
+ {
213
+ "cell_type": "code",
214
+ "execution_count": 8,
215
+ "id": "326103f5",
216
+ "metadata": {},
217
+ "outputs": [
218
+ {
219
+ "name": "stderr",
220
+ "output_type": "stream",
221
+ "text": [
222
+ "/tmp/ipykernel_3270586/2607597888.py:4: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate\n",
223
+ " metric = load_metric(\"accuracy\")\n"
224
+ ]
225
+ }
226
+ ],
227
+ "source": [
228
+ "import numpy as np\n",
229
+ "from datasets import load_metric\n",
230
+ "\n",
231
+ "metric = load_metric(\"accuracy\")"
232
+ ]
233
+ },
234
+ {
235
+ "cell_type": "markdown",
236
+ "id": "087d4b3e",
237
+ "metadata": {},
238
+ "source": [
239
+ "Se define la función compute_metrics para calcular el accuracy de las predicciones hechas. Antes de pasar las predicciones a compute, hay que convertir las predicciones a logits (los modelos de Transformers devuelven logits)."
240
+ ]
241
+ },
242
+ {
243
+ "cell_type": "code",
244
+ "execution_count": 9,
245
+ "id": "d7b8341d",
246
+ "metadata": {},
247
+ "outputs": [],
248
+ "source": [
249
+ "def compute_metrics(eval_pred):\n",
250
+ " logits, labels = eval_pred\n",
251
+ " predictions = np.argmax(logits, axis=-1)\n",
252
+ " return metric.compute(predictions=predictions, references=labels)"
253
+ ]
254
+ },
255
+ {
256
+ "cell_type": "markdown",
257
+ "id": "53db268c",
258
+ "metadata": {},
259
+ "source": [
260
+ "## Trainer\n",
261
+ "\n",
262
+ "Ahora es el momento de crear el objeto Trainer con el modelo, argumentos de entrenamiento, datasets de entrenamiento y de prueba, y función de evaluación:"
263
+ ]
264
+ },
265
+ {
266
+ "cell_type": "code",
267
+ "execution_count": 12,
268
+ "id": "d566aded",
269
+ "metadata": {},
270
+ "outputs": [],
271
+ "source": [
272
+ "from transformers import Trainer\n",
273
+ "trainer = Trainer(\n",
274
+ " model=model,\n",
275
+ " args=training_args,\n",
276
+ " train_dataset=train_dataset,\n",
277
+ " eval_dataset=eval_dataset,\n",
278
+ " data_collator=data_collator,\n",
279
+ " optimizers=(optimizer, None),\n",
280
+ " compute_metrics=compute_metrics,\n",
281
+ " callbacks=[early_stop],\n",
282
+ ")"
283
+ ]
284
+ },
285
+ {
286
+ "cell_type": "markdown",
287
+ "id": "a31780ca",
288
+ "metadata": {},
289
+ "source": [
290
+ "Y se aplica fine-tunning con train"
291
+ ]
292
+ },
293
+ {
294
+ "cell_type": "code",
295
+ "execution_count": 13,
296
+ "id": "3e01c5fb",
297
+ "metadata": {},
298
+ "outputs": [
299
+ {
300
+ "name": "stderr",
301
+ "output_type": "stream",
302
+ "text": [
303
+ "You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n"
304
+ ]
305
+ },
306
+ {
307
+ "data": {
308
+ "text/html": [
309
+ "\n",
310
+ " <div>\n",
311
+ " \n",
312
+ " <progress value='102' max='340' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
313
+ " [102/340 01:27 < 03:27, 1.15 it/s, Epoch 3/10]\n",
314
+ " </div>\n",
315
+ " <table border=\"1\" class=\"dataframe\">\n",
316
+ " <thead>\n",
317
+ " <tr style=\"text-align: left;\">\n",
318
+ " <th>Epoch</th>\n",
319
+ " <th>Training Loss</th>\n",
320
+ " <th>Validation Loss</th>\n",
321
+ " <th>Accuracy</th>\n",
322
+ " </tr>\n",
323
+ " </thead>\n",
324
+ " <tbody>\n",
325
+ " <tr>\n",
326
+ " <td>1</td>\n",
327
+ " <td>No log</td>\n",
328
+ " <td>0.459866</td>\n",
329
+ " <td>0.803030</td>\n",
330
+ " </tr>\n",
331
+ " <tr>\n",
332
+ " <td>2</td>\n",
333
+ " <td>No log</td>\n",
334
+ " <td>0.649665</td>\n",
335
+ " <td>0.848485</td>\n",
336
+ " </tr>\n",
337
+ " <tr>\n",
338
+ " <td>3</td>\n",
339
+ " <td>No log</td>\n",
340
+ " <td>1.026334</td>\n",
341
+ " <td>0.787879</td>\n",
342
+ " </tr>\n",
343
+ " </tbody>\n",
344
+ "</table><p>"
345
+ ],
346
+ "text/plain": [
347
+ "<IPython.core.display.HTML object>"
348
+ ]
349
+ },
350
+ "metadata": {},
351
+ "output_type": "display_data"
352
+ },
353
+ {
354
+ "data": {
355
+ "text/plain": [
356
+ "TrainOutput(global_step=102, training_loss=0.33487387264476104, metrics={'train_runtime': 88.4219, 'train_samples_per_second': 30.309, 'train_steps_per_second': 3.845, 'total_flos': 211541288509440.0, 'train_loss': 0.33487387264476104, 'epoch': 3.0})"
357
+ ]
358
+ },
359
+ "execution_count": 13,
360
+ "metadata": {},
361
+ "output_type": "execute_result"
362
+ }
363
+ ],
364
+ "source": [
365
+ "trainer.train()"
366
+ ]
367
+ },
368
+ {
369
+ "cell_type": "markdown",
370
+ "id": "417d3cd2",
371
+ "metadata": {},
372
+ "source": [
373
+ "Imprimo el loss y el accuracy"
374
+ ]
375
+ },
376
+ {
377
+ "cell_type": "code",
378
+ "execution_count": 15,
379
+ "id": "d1144002",
380
+ "metadata": {},
381
+ "outputs": [
382
+ {
383
+ "name": "stdout",
384
+ "output_type": "stream",
385
+ "text": [
386
+ "Resultados del conjunto de train\n",
387
+ "eval_loss. 0.19221480190753937\n",
388
+ "eval_accuracy. 0.9440298507462687\n",
389
+ "eval_runtime. 9.8909\n",
390
+ "eval_samples_per_second. 27.095\n",
391
+ "eval_steps_per_second. 3.437\n",
392
+ "epoch. 3.0\n",
393
+ "Resultados del conjunto de test\n",
394
+ "eval_loss. 0.4598655700683594\n",
395
+ "eval_accuracy. 0.803030303030303\n",
396
+ "eval_runtime. 2.4345\n",
397
+ "eval_samples_per_second. 27.11\n",
398
+ "eval_steps_per_second. 3.697\n",
399
+ "epoch. 3.0\n"
400
+ ]
401
+ }
402
+ ],
403
+ "source": [
404
+ "#creo una función para imprimir los resultados de una formá más visual\n",
405
+ "def print_results(title, results):\n",
406
+ " print(title)\n",
407
+ " for key, value in results.items():\n",
408
+ " print(f\"{key}. {value}\")\n",
409
+ " \n",
410
+ "train_result = trainer.evaluate(train_dataset)\n",
411
+ "print_results(\"Resultados del conjunto de train\",train_result)\n",
412
+ "eval_result = trainer.evaluate(eval_dataset)\n",
413
+ "print_results(\"Resultados del conjunto de test\",eval_result)"
414
+ ]
415
+ },
416
+ {
417
+ "cell_type": "markdown",
418
+ "id": "9e61a040",
419
+ "metadata": {},
420
+ "source": [
421
+ "# Guardando el modelo"
422
+ ]
423
+ },
424
+ {
425
+ "cell_type": "markdown",
426
+ "id": "4af06209",
427
+ "metadata": {},
428
+ "source": [
429
+ "Para Guardarlo, utilizamos esl método save_model"
430
+ ]
431
+ },
432
+ {
433
+ "cell_type": "code",
434
+ "execution_count": 16,
435
+ "id": "b93638cb",
436
+ "metadata": {},
437
+ "outputs": [],
438
+ "source": [
439
+ "trainer.save_model()"
440
+ ]
441
+ },
442
+ {
443
+ "cell_type": "code",
444
+ "execution_count": 17,
445
+ "id": "973c4e03",
446
+ "metadata": {},
447
+ "outputs": [],
448
+ "source": [
449
+ "trainer.create_model_card()"
450
+ ]
451
+ },
452
+ {
453
+ "cell_type": "code",
454
+ "execution_count": null,
455
+ "id": "9671b67c",
456
+ "metadata": {},
457
+ "outputs": [],
458
+ "source": []
459
+ }
460
+ ],
461
+ "metadata": {
462
+ "kernelspec": {
463
+ "display_name": "Python 3 (ipykernel)",
464
+ "language": "python",
465
+ "name": "python3"
466
+ },
467
+ "language_info": {
468
+ "codemirror_mode": {
469
+ "name": "ipython",
470
+ "version": 3
471
+ },
472
+ "file_extension": ".py",
473
+ "mimetype": "text/x-python",
474
+ "name": "python",
475
+ "nbconvert_exporter": "python",
476
+ "pygments_lexer": "ipython3",
477
+ "version": "3.8.13"
478
+ }
479
+ },
480
+ "nbformat": 4,
481
+ "nbformat_minor": 5
482
+ }
Roberta-t-MMG.ipynb ADDED
@@ -0,0 +1,486 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "976841dc",
6
+ "metadata": {},
7
+ "source": [
8
+ "## Preparación de un dataset\n",
9
+ "\n",
10
+ "Descargamos el dataset y lo preparamos para el entrenamiento. En el caso de ejemplo, usaremos toxic-teenage-relationships, que son frases que describen si un comporamiento es tóxico o sano. Tienen una campo de texto y un campo de etiqueta, que vale 1 si es tóxico y 0 si no lo es. Acumula 267 ejemplos de entrenamiento y 66 para testear."
11
+ ]
12
+ },
13
+ {
14
+ "cell_type": "code",
15
+ "execution_count": 1,
16
+ "id": "caf72aa3",
17
+ "metadata": {
18
+ "scrolled": false
19
+ },
20
+ "outputs": [
21
+ {
22
+ "data": {
23
+ "text/plain": [
24
+ "{'label': 1, 'text': 'Me mira mal por mi forma de vestir'}"
25
+ ]
26
+ },
27
+ "execution_count": 1,
28
+ "metadata": {},
29
+ "output_type": "execute_result"
30
+ }
31
+ ],
32
+ "source": [
33
+ "from datasets import load_dataset\n",
34
+ "data_files = {\"train\": \"train.csv\", \"test\": \"test.csv\"}\n",
35
+ "dataset = load_dataset(\"toxic-teenage-relationships\", data_files=data_files, sep=\";\")\n",
36
+ "dataset['train'][102]"
37
+ ]
38
+ },
39
+ {
40
+ "cell_type": "markdown",
41
+ "id": "08aacc14",
42
+ "metadata": {},
43
+ "source": [
44
+ "Una vez cargado el dataset, se crea un tokenizador para procesar el texto e incluir una estrategia para el padding y el truncamiento. Par poder procesar el dataset en un solo paso, se utiliza el método dataset.map para preprocesar todo el dataset.\n",
45
+ "\n"
46
+ ]
47
+ },
48
+ {
49
+ "cell_type": "code",
50
+ "execution_count": 2,
51
+ "id": "4a854ead",
52
+ "metadata": {},
53
+ "outputs": [],
54
+ "source": [
55
+ "#Roberta tiene su propa clase Tokenizer\n",
56
+ "#from transformers import AutoTokenizer\n",
57
+ "from transformers import RobertaTokenizer\n",
58
+ "#el modelo a utilizar es RoBERTa\n",
59
+ "tokenizer = RobertaTokenizer.from_pretrained(\"PlanTL-GOB-ES/roberta-base-bne\")\n",
60
+ "\n",
61
+ "\n",
62
+ "def tokenize_function(examples):\n",
63
+ " return tokenizer(examples[\"text\"], padding=\"max_length\", truncation=True)\n",
64
+ "\n",
65
+ "\n",
66
+ "tokenized_datasets = dataset.map(tokenize_function, batched=True)\n"
67
+ ]
68
+ },
69
+ {
70
+ "cell_type": "code",
71
+ "execution_count": 3,
72
+ "id": "eb5477cc",
73
+ "metadata": {},
74
+ "outputs": [],
75
+ "source": [
76
+ "train_dataset = tokenized_datasets[\"train\"]\n",
77
+ "eval_dataset = tokenized_datasets[\"test\"]"
78
+ ]
79
+ },
80
+ {
81
+ "cell_type": "markdown",
82
+ "id": "38a6c521",
83
+ "metadata": {},
84
+ "source": [
85
+ "## Fine-tuning usando Trainer\n",
86
+ "\n",
87
+ "La clase trainer de Transformers permite entrenar modelos de transformers. La API del Trainer soporta varias opciones de entrenamiento y características como logging, gradient accumulation y mixed preccision"
88
+ ]
89
+ },
90
+ {
91
+ "cell_type": "code",
92
+ "execution_count": 4,
93
+ "id": "843f218d",
94
+ "metadata": {},
95
+ "outputs": [
96
+ {
97
+ "name": "stderr",
98
+ "output_type": "stream",
99
+ "text": [
100
+ "Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at PlanTL-GOB-ES/roberta-base-bne and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.out_proj.weight']\n",
101
+ "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
102
+ ]
103
+ }
104
+ ],
105
+ "source": [
106
+ "#from transformers import AutoModelForSequenceClassification\n",
107
+ "#también tiene una clase propia para el cabezal de clasificación\n",
108
+ "#Hay dos categorías, así que ponemos 2 etiquetas (0 sano 1 tóxico)\n",
109
+ "#model = AutoModelForSequenceClassification.from_pretrained(\"PlanTL-GOB-ES/roberta-base-bne\", num_labels=2)\n",
110
+ "from transformers import RobertaForSequenceClassification\n",
111
+ "model = RobertaForSequenceClassification.from_pretrained(\"PlanTL-GOB-ES/roberta-base-bne\", num_labels=2)"
112
+ ]
113
+ },
114
+ {
115
+ "cell_type": "markdown",
116
+ "id": "27be3c25",
117
+ "metadata": {},
118
+ "source": [
119
+ "## Hiperparámetros de entrenamiento\n",
120
+ "\n",
121
+ "Ahora se crea una clase TrainingArguments que contiene todos los hiperparámetros que se pueden ajustar. \n",
122
+ "Empezamos con los hiperparámetros de entrenamiento por defecto, pero tendremos que ajustarlos para encontrar la configuración óptima.\n"
123
+ ]
124
+ },
125
+ {
126
+ "cell_type": "code",
127
+ "execution_count": 5,
128
+ "id": "7f84ef1e",
129
+ "metadata": {},
130
+ "outputs": [],
131
+ "source": [
132
+ "#Para poder evitar el overfitting, voy a añadir la clase earlystopping en el momento que se observe\n",
133
+ "#que la pérdida se incrementa en dos epoch\n",
134
+ "from transformers import EarlyStoppingCallback\n",
135
+ "early_stop=EarlyStoppingCallback(early_stopping_patience=2)"
136
+ ]
137
+ },
138
+ {
139
+ "cell_type": "code",
140
+ "execution_count": 12,
141
+ "id": "f53c992d",
142
+ "metadata": {},
143
+ "outputs": [
144
+ {
145
+ "name": "stderr",
146
+ "output_type": "stream",
147
+ "text": [
148
+ "/home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n",
149
+ " warnings.warn(\n"
150
+ ]
151
+ }
152
+ ],
153
+ "source": [
154
+ "from transformers import TrainingArguments\n",
155
+ "from transformers import DataCollatorWithPadding, AdamW\n",
156
+ "# para controlar las métricas de evaluación durante el fine-tuning\n",
157
+ "# vamos a añadir que elija el mejor modelo al final, usamos load_best_model_at_end que cogerá eval_loss para evaluar\n",
158
+ "# para que se fije en el valor de loss como la mejor métrica, hay que poner greater_is_better a false.\n",
159
+ "#vamos a poner el número de epoch a 10 y el del batch a 8\n",
160
+ "\n",
161
+ "training_args = TrainingArguments(output_dir=\"RoBERTa-t-MMG\",\n",
162
+ " num_train_epochs=10,\n",
163
+ " per_device_train_batch_size=8,\n",
164
+ " per_device_eval_batch_size=8,\n",
165
+ " load_best_model_at_end=True,\n",
166
+ " greater_is_better=False,\n",
167
+ " evaluation_strategy=\"epoch\",\n",
168
+ " save_strategy=\"epoch\")\n",
169
+ "#optmizador\n",
170
+ "optimizer=AdamW(model.parameters(), lr=5e-5)\n",
171
+ "#añado el data Collator, que en este caso va a ser parte del trainer.\n",
172
+ "#este es el indicado específicamente para tareas de clasificación de texto, agrupa y preprocesa\n",
173
+ "#para que todos los ejemplos de entrada en lotes tengan la misma longitud además del tokenizdor\n",
174
+ "#agrupación en lotes y creación de mapas de atención.\n",
175
+ "#usando la función .map, no es estrictamente necesario pero así se combinan las características\n",
176
+ "#adicionales del texto antes de pasarle el datacollator.\n",
177
+ "data_collator = DataCollatorWithPadding(tokenizer)"
178
+ ]
179
+ },
180
+ {
181
+ "cell_type": "markdown",
182
+ "id": "6d604727",
183
+ "metadata": {},
184
+ "source": [
185
+ "## Métricas\n",
186
+ "\n",
187
+ "El Trainer no evalúa automátiamentee el rendimiento, hay que pasarle una función para calcular y hacer un reporte de las métricas. En Datasets hay una función, accuracy, que se puede cargar con load_metric. \n",
188
+ "Antes hay que instalar scikit-learn"
189
+ ]
190
+ },
191
+ {
192
+ "cell_type": "code",
193
+ "execution_count": 13,
194
+ "id": "0ed3ddf4",
195
+ "metadata": {},
196
+ "outputs": [
197
+ {
198
+ "name": "stdout",
199
+ "output_type": "stream",
200
+ "text": [
201
+ "Requirement already satisfied: scikit-learn in /home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages (1.3.0)\n",
202
+ "Requirement already satisfied: numpy>=1.17.3 in /home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages (from scikit-learn) (1.24.3)\n",
203
+ "Requirement already satisfied: scipy>=1.5.0 in /home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages (from scikit-learn) (1.10.1)\n",
204
+ "Requirement already satisfied: joblib>=1.1.1 in /home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages (from scikit-learn) (1.3.1)\n",
205
+ "Requirement already satisfied: threadpoolctl>=2.0.0 in /home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages (from scikit-learn) (3.2.0)\n",
206
+ "Note: you may need to restart the kernel to use updated packages.\n"
207
+ ]
208
+ }
209
+ ],
210
+ "source": [
211
+ "pip install scikit-learn"
212
+ ]
213
+ },
214
+ {
215
+ "cell_type": "code",
216
+ "execution_count": 14,
217
+ "id": "326103f5",
218
+ "metadata": {},
219
+ "outputs": [],
220
+ "source": [
221
+ "import numpy as np\n",
222
+ "from datasets import load_metric\n",
223
+ "\n",
224
+ "metric = load_metric(\"accuracy\")"
225
+ ]
226
+ },
227
+ {
228
+ "cell_type": "markdown",
229
+ "id": "087d4b3e",
230
+ "metadata": {},
231
+ "source": [
232
+ "Se define la función compute_metrics para calcular el accuracy de las predicciones hechas. Antes de pasar las predicciones a compute, hay que convertir las predicciones a logits (los modelos de Transformers devuelven logits)."
233
+ ]
234
+ },
235
+ {
236
+ "cell_type": "code",
237
+ "execution_count": 15,
238
+ "id": "d7b8341d",
239
+ "metadata": {},
240
+ "outputs": [],
241
+ "source": [
242
+ "def compute_metrics(eval_pred):\n",
243
+ " logits, labels = eval_pred\n",
244
+ " predictions = np.argmax(logits, axis=-1)\n",
245
+ " return metric.compute(predictions=predictions, references=labels)"
246
+ ]
247
+ },
248
+ {
249
+ "cell_type": "markdown",
250
+ "id": "53db268c",
251
+ "metadata": {},
252
+ "source": [
253
+ "## Trainer\n",
254
+ "\n",
255
+ "Ahora es el momento de crear el objeto Trainer con el modelo, argumentos de entrenamiento, datasets de entrenamiento y de prueba, y función de evaluación:"
256
+ ]
257
+ },
258
+ {
259
+ "cell_type": "code",
260
+ "execution_count": 19,
261
+ "id": "d566aded",
262
+ "metadata": {},
263
+ "outputs": [],
264
+ "source": [
265
+ "from transformers import Trainer\n",
266
+ "trainer = Trainer(\n",
267
+ " model=model,\n",
268
+ " args=training_args,\n",
269
+ " train_dataset=train_dataset,\n",
270
+ " eval_dataset=eval_dataset,\n",
271
+ " optimizers=(optimizer, None),\n",
272
+ " data_collator=data_collator,\n",
273
+ " compute_metrics=compute_metrics,\n",
274
+ " callbacks=[early_stop],\n",
275
+ ")"
276
+ ]
277
+ },
278
+ {
279
+ "cell_type": "markdown",
280
+ "id": "a31780ca",
281
+ "metadata": {},
282
+ "source": [
283
+ "Y se aplica fine-tunning con train"
284
+ ]
285
+ },
286
+ {
287
+ "cell_type": "code",
288
+ "execution_count": 20,
289
+ "id": "3e01c5fb",
290
+ "metadata": {},
291
+ "outputs": [
292
+ {
293
+ "data": {
294
+ "text/html": [
295
+ "\n",
296
+ " <div>\n",
297
+ " \n",
298
+ " <progress value='102' max='340' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
299
+ " [102/340 01:24 < 03:22, 1.18 it/s, Epoch 3/10]\n",
300
+ " </div>\n",
301
+ " <table border=\"1\" class=\"dataframe\">\n",
302
+ " <thead>\n",
303
+ " <tr style=\"text-align: left;\">\n",
304
+ " <th>Epoch</th>\n",
305
+ " <th>Training Loss</th>\n",
306
+ " <th>Validation Loss</th>\n",
307
+ " <th>Accuracy</th>\n",
308
+ " </tr>\n",
309
+ " </thead>\n",
310
+ " <tbody>\n",
311
+ " <tr>\n",
312
+ " <td>1</td>\n",
313
+ " <td>No log</td>\n",
314
+ " <td>0.419906</td>\n",
315
+ " <td>0.818182</td>\n",
316
+ " </tr>\n",
317
+ " <tr>\n",
318
+ " <td>2</td>\n",
319
+ " <td>No log</td>\n",
320
+ " <td>0.541695</td>\n",
321
+ " <td>0.818182</td>\n",
322
+ " </tr>\n",
323
+ " <tr>\n",
324
+ " <td>3</td>\n",
325
+ " <td>No log</td>\n",
326
+ " <td>0.485065</td>\n",
327
+ " <td>0.878788</td>\n",
328
+ " </tr>\n",
329
+ " </tbody>\n",
330
+ "</table><p>"
331
+ ],
332
+ "text/plain": [
333
+ "<IPython.core.display.HTML object>"
334
+ ]
335
+ },
336
+ "metadata": {},
337
+ "output_type": "display_data"
338
+ },
339
+ {
340
+ "data": {
341
+ "text/plain": [
342
+ "TrainOutput(global_step=102, training_loss=0.3791993459065755, metrics={'train_runtime': 86.1528, 'train_samples_per_second': 31.108, 'train_steps_per_second': 3.946, 'total_flos': 211541288509440.0, 'train_loss': 0.3791993459065755, 'epoch': 3.0})"
343
+ ]
344
+ },
345
+ "execution_count": 20,
346
+ "metadata": {},
347
+ "output_type": "execute_result"
348
+ }
349
+ ],
350
+ "source": [
351
+ "trainer.train()"
352
+ ]
353
+ },
354
+ {
355
+ "cell_type": "markdown",
356
+ "id": "417d3cd2",
357
+ "metadata": {},
358
+ "source": [
359
+ "Imprimo el loss y el accuracy"
360
+ ]
361
+ },
362
+ {
363
+ "cell_type": "code",
364
+ "execution_count": 21,
365
+ "id": "d1144002",
366
+ "metadata": {},
367
+ "outputs": [
368
+ {
369
+ "data": {
370
+ "text/html": [
371
+ "\n",
372
+ " <div>\n",
373
+ " \n",
374
+ " <progress value='43' max='34' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
375
+ " [34/34 00:11]\n",
376
+ " </div>\n",
377
+ " "
378
+ ],
379
+ "text/plain": [
380
+ "<IPython.core.display.HTML object>"
381
+ ]
382
+ },
383
+ "metadata": {},
384
+ "output_type": "display_data"
385
+ },
386
+ {
387
+ "name": "stdout",
388
+ "output_type": "stream",
389
+ "text": [
390
+ "Resultados del conjunto de train\n",
391
+ "eval_loss. 0.2099413126707077\n",
392
+ "eval_accuracy. 0.9402985074626866\n",
393
+ "eval_runtime. 9.7123\n",
394
+ "eval_samples_per_second. 27.594\n",
395
+ "eval_steps_per_second. 3.501\n",
396
+ "epoch. 3.0\n",
397
+ "Resultados del conjunto de test\n",
398
+ "eval_loss. 0.41990572214126587\n",
399
+ "eval_accuracy. 0.8181818181818182\n",
400
+ "eval_runtime. 2.3764\n",
401
+ "eval_samples_per_second. 27.774\n",
402
+ "eval_steps_per_second. 3.787\n",
403
+ "epoch. 3.0\n"
404
+ ]
405
+ }
406
+ ],
407
+ "source": [
408
+ "#creo una función para imprimir los resultados de una formá más visual\n",
409
+ "def print_results(title, results):\n",
410
+ " print(title)\n",
411
+ " for key, value in results.items():\n",
412
+ " print(f\"{key}. {value}\")\n",
413
+ " \n",
414
+ "train_result = trainer.evaluate(train_dataset)\n",
415
+ "print_results(\"Resultados del conjunto de train\",train_result)\n",
416
+ "eval_result = trainer.evaluate(eval_dataset)\n",
417
+ "print_results(\"Resultados del conjunto de test\",eval_result)"
418
+ ]
419
+ },
420
+ {
421
+ "cell_type": "markdown",
422
+ "id": "9e61a040",
423
+ "metadata": {},
424
+ "source": [
425
+ "# Guardando el modelo"
426
+ ]
427
+ },
428
+ {
429
+ "cell_type": "markdown",
430
+ "id": "4af06209",
431
+ "metadata": {},
432
+ "source": [
433
+ "Para Guardarlo, utilizamos esl método save_model"
434
+ ]
435
+ },
436
+ {
437
+ "cell_type": "code",
438
+ "execution_count": 22,
439
+ "id": "b93638cb",
440
+ "metadata": {},
441
+ "outputs": [],
442
+ "source": [
443
+ "trainer.save_model()"
444
+ ]
445
+ },
446
+ {
447
+ "cell_type": "code",
448
+ "execution_count": 23,
449
+ "id": "973c4e03",
450
+ "metadata": {},
451
+ "outputs": [],
452
+ "source": [
453
+ "trainer.create_model_card()"
454
+ ]
455
+ },
456
+ {
457
+ "cell_type": "code",
458
+ "execution_count": null,
459
+ "id": "9671b67c",
460
+ "metadata": {},
461
+ "outputs": [],
462
+ "source": []
463
+ }
464
+ ],
465
+ "metadata": {
466
+ "kernelspec": {
467
+ "display_name": "Python 3 (ipykernel)",
468
+ "language": "python",
469
+ "name": "python3"
470
+ },
471
+ "language_info": {
472
+ "codemirror_mode": {
473
+ "name": "ipython",
474
+ "version": 3
475
+ },
476
+ "file_extension": ".py",
477
+ "mimetype": "text/x-python",
478
+ "name": "python",
479
+ "nbconvert_exporter": "python",
480
+ "pygments_lexer": "ipython3",
481
+ "version": "3.8.13"
482
+ }
483
+ },
484
+ "nbformat": 4,
485
+ "nbformat_minor": 5
486
+ }
Roberta-t-MMGb.ipynb ADDED
@@ -0,0 +1,493 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "976841dc",
6
+ "metadata": {},
7
+ "source": [
8
+ "## Preparación de un dataset\n",
9
+ "\n",
10
+ "Descargamos el dataset y lo preparamos para el entrenamiento. En el caso de ejemplo, usaremos toxic-teenage-relationships, que son frases que describen si un comporamiento es tóxico o sano. Tienen una campo de texto y un campo de etiqueta, que vale 1 si es tóxico y 0 si no lo es. Acumula 267 ejemplos de entrenamiento y 66 para testear."
11
+ ]
12
+ },
13
+ {
14
+ "cell_type": "code",
15
+ "execution_count": 1,
16
+ "id": "caf72aa3",
17
+ "metadata": {
18
+ "scrolled": false
19
+ },
20
+ "outputs": [
21
+ {
22
+ "data": {
23
+ "text/plain": [
24
+ "{'label': 1, 'text': 'Me mira mal por mi forma de vestir'}"
25
+ ]
26
+ },
27
+ "execution_count": 1,
28
+ "metadata": {},
29
+ "output_type": "execute_result"
30
+ }
31
+ ],
32
+ "source": [
33
+ "from datasets import load_dataset\n",
34
+ "data_files = {\"train\": \"train.csv\", \"test\": \"test.csv\"}\n",
35
+ "dataset = load_dataset(\"toxic-teenage-relationships\", data_files=data_files, sep=\";\")\n",
36
+ "dataset['train'][102]"
37
+ ]
38
+ },
39
+ {
40
+ "cell_type": "markdown",
41
+ "id": "08aacc14",
42
+ "metadata": {},
43
+ "source": [
44
+ "Una vez cargado el dataset, se crea un tokenizador para procesar el texto e incluir una estrategia para el padding y el truncamiento. Par poder procesar el dataset en un solo paso, se utiliza el método dataset.map para preprocesar todo el dataset.\n",
45
+ "\n"
46
+ ]
47
+ },
48
+ {
49
+ "cell_type": "code",
50
+ "execution_count": 2,
51
+ "id": "4a854ead",
52
+ "metadata": {},
53
+ "outputs": [],
54
+ "source": [
55
+ "\n",
56
+ "from transformers import AutoTokenizer\n",
57
+ "#el modelo a utilizar es RoBERTa\n",
58
+ "tokenizer = AutoTokenizer.from_pretrained(\"PlanTL-GOB-ES/roberta-base-bne\")\n",
59
+ "\n",
60
+ "\n",
61
+ "def tokenize_function(examples):\n",
62
+ " return tokenizer(examples[\"text\"], padding=\"max_length\", truncation=True)\n",
63
+ "\n",
64
+ "\n",
65
+ "tokenized_datasets = dataset.map(tokenize_function, batched=True)\n"
66
+ ]
67
+ },
68
+ {
69
+ "cell_type": "code",
70
+ "execution_count": 3,
71
+ "id": "eb5477cc",
72
+ "metadata": {},
73
+ "outputs": [],
74
+ "source": [
75
+ "train_dataset = tokenized_datasets[\"train\"]\n",
76
+ "eval_dataset = tokenized_datasets[\"test\"]"
77
+ ]
78
+ },
79
+ {
80
+ "cell_type": "markdown",
81
+ "id": "38a6c521",
82
+ "metadata": {},
83
+ "source": [
84
+ "## Fine-tuning usando Trainer\n",
85
+ "\n",
86
+ "La clase trainer de Transformers permite entrenar modelos de transformers. La API del Trainer soporta varias opciones de entrenamiento y características como logging, gradient accumulation y mixed preccision"
87
+ ]
88
+ },
89
+ {
90
+ "cell_type": "code",
91
+ "execution_count": 4,
92
+ "id": "843f218d",
93
+ "metadata": {},
94
+ "outputs": [
95
+ {
96
+ "name": "stderr",
97
+ "output_type": "stream",
98
+ "text": [
99
+ "Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at PlanTL-GOB-ES/roberta-base-bne and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.weight']\n",
100
+ "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
101
+ ]
102
+ }
103
+ ],
104
+ "source": [
105
+ "from transformers import AutoModelForSequenceClassification\n",
106
+ "\n",
107
+ "#Hay dos categorías, así que ponemos 2 etiquetas (0 sano 1 tóxico)\n",
108
+ "model = AutoModelForSequenceClassification.from_pretrained(\"PlanTL-GOB-ES/roberta-base-bne\", num_labels=2)\n"
109
+ ]
110
+ },
111
+ {
112
+ "cell_type": "markdown",
113
+ "id": "27be3c25",
114
+ "metadata": {},
115
+ "source": [
116
+ "## Hiperparámetros de entrenamiento\n",
117
+ "\n",
118
+ "Ahora se crea una clase TrainingArguments que contiene todos los hiperparámetros que se pueden ajustar. \n",
119
+ "Empezamos con los hiperparámetros de entrenamiento por defecto, pero tendremos que ajustarlos para encontrar la configuración óptima.\n"
120
+ ]
121
+ },
122
+ {
123
+ "cell_type": "code",
124
+ "execution_count": 5,
125
+ "id": "7f84ef1e",
126
+ "metadata": {},
127
+ "outputs": [],
128
+ "source": [
129
+ "#Para poder evitar el overfitting, voy a añadir la clase earlystopping en el momento que se observe\n",
130
+ "#que la pérdida se incrementa en dos epoch\n",
131
+ "from transformers import EarlyStoppingCallback\n",
132
+ "early_stop=EarlyStoppingCallback(early_stopping_patience=2)"
133
+ ]
134
+ },
135
+ {
136
+ "cell_type": "code",
137
+ "execution_count": 6,
138
+ "id": "f53c992d",
139
+ "metadata": {},
140
+ "outputs": [
141
+ {
142
+ "name": "stderr",
143
+ "output_type": "stream",
144
+ "text": [
145
+ "/home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n",
146
+ " warnings.warn(\n"
147
+ ]
148
+ }
149
+ ],
150
+ "source": [
151
+ "from transformers import TrainingArguments\n",
152
+ "from transformers import DataCollatorWithPadding, AdamW\n",
153
+ "# para controlar las métricas de evaluación durante el fine-tuning\n",
154
+ "# vamos a añadir que elija el mejor modelo al final, usamos load_best_model_at_end que cogerá eval_loss para evaluar\n",
155
+ "# para que se fije en el valor de loss como la mejor métrica, hay que poner greater_is_better a false.\n",
156
+ "#vamos a poner el número de epoch a 10 y el del batch a 8\n",
157
+ "\n",
158
+ "training_args = TrainingArguments(output_dir=\"RoBERTa-t-MMGb\",\n",
159
+ " num_train_epochs=10,\n",
160
+ " per_device_train_batch_size=8,\n",
161
+ " per_device_eval_batch_size=8,\n",
162
+ " load_best_model_at_end=True,\n",
163
+ " greater_is_better=False,\n",
164
+ " evaluation_strategy=\"epoch\",\n",
165
+ " save_strategy=\"epoch\")\n",
166
+ "#optimizador\n",
167
+ "optimizer=AdamW(model.parameters(), lr=5e-5)\n",
168
+ "\n",
169
+ "#añado el data Collator, que en este caso va a ser parte del trainer.\n",
170
+ "#este es el indicado específicamente para tareas de clasificación de texto, agrupa y preprocesa\n",
171
+ "#para que todos los ejemplos de entrada en lotes tengan la misma longitud además del tokenizdor\n",
172
+ "#agrupación en lotes y creación de mapas de atención.\n",
173
+ "#usando la función .map, no es estrictamente necesario pero así se combinan las características\n",
174
+ "#adicionales del texto antes de pasarle el datacollator.\n",
175
+ "data_collator = DataCollatorWithPadding(tokenizer)"
176
+ ]
177
+ },
178
+ {
179
+ "cell_type": "markdown",
180
+ "id": "6d604727",
181
+ "metadata": {},
182
+ "source": [
183
+ "## Métricas\n",
184
+ "\n",
185
+ "El Trainer no evalúa automátiamentee el rendimiento, hay que pasarle una función para calcular y hacer un reporte de las métricas. En Datasets hay una función, accuracy, que se puede cargar con load_metric. \n",
186
+ "Antes hay que instalar scikit-learn"
187
+ ]
188
+ },
189
+ {
190
+ "cell_type": "code",
191
+ "execution_count": 7,
192
+ "id": "0ed3ddf4",
193
+ "metadata": {},
194
+ "outputs": [
195
+ {
196
+ "name": "stdout",
197
+ "output_type": "stream",
198
+ "text": [
199
+ "Requirement already satisfied: scikit-learn in /home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages (1.3.0)\n",
200
+ "Requirement already satisfied: numpy>=1.17.3 in /home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages (from scikit-learn) (1.24.3)\n",
201
+ "Requirement already satisfied: scipy>=1.5.0 in /home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages (from scikit-learn) (1.10.1)\n",
202
+ "Requirement already satisfied: joblib>=1.1.1 in /home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages (from scikit-learn) (1.3.1)\n",
203
+ "Requirement already satisfied: threadpoolctl>=2.0.0 in /home/mmartinez/anaconda3/envs/TFM/lib/python3.8/site-packages (from scikit-learn) (3.2.0)\n",
204
+ "Note: you may need to restart the kernel to use updated packages.\n"
205
+ ]
206
+ }
207
+ ],
208
+ "source": [
209
+ "pip install scikit-learn"
210
+ ]
211
+ },
212
+ {
213
+ "cell_type": "code",
214
+ "execution_count": 8,
215
+ "id": "326103f5",
216
+ "metadata": {},
217
+ "outputs": [
218
+ {
219
+ "name": "stderr",
220
+ "output_type": "stream",
221
+ "text": [
222
+ "/tmp/ipykernel_3329828/2607597888.py:4: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate\n",
223
+ " metric = load_metric(\"accuracy\")\n"
224
+ ]
225
+ }
226
+ ],
227
+ "source": [
228
+ "import numpy as np\n",
229
+ "from datasets import load_metric\n",
230
+ "\n",
231
+ "metric = load_metric(\"accuracy\")"
232
+ ]
233
+ },
234
+ {
235
+ "cell_type": "markdown",
236
+ "id": "087d4b3e",
237
+ "metadata": {},
238
+ "source": [
239
+ "Se define la función compute_metrics para calcular el accuracy de las predicciones hechas. Antes de pasar las predicciones a compute, hay que convertir las predicciones a logits (los modelos de Transformers devuelven logits)."
240
+ ]
241
+ },
242
+ {
243
+ "cell_type": "code",
244
+ "execution_count": 9,
245
+ "id": "d7b8341d",
246
+ "metadata": {},
247
+ "outputs": [],
248
+ "source": [
249
+ "def compute_metrics(eval_pred):\n",
250
+ " logits, labels = eval_pred\n",
251
+ " predictions = np.argmax(logits, axis=-1)\n",
252
+ " return metric.compute(predictions=predictions, references=labels)"
253
+ ]
254
+ },
255
+ {
256
+ "cell_type": "markdown",
257
+ "id": "53db268c",
258
+ "metadata": {},
259
+ "source": [
260
+ "## Trainer\n",
261
+ "\n",
262
+ "Ahora es el momento de crear el objeto Trainer con el modelo, argumentos de entrenamiento, datasets de entrenamiento y de prueba, y función de evaluación:"
263
+ ]
264
+ },
265
+ {
266
+ "cell_type": "code",
267
+ "execution_count": 13,
268
+ "id": "d566aded",
269
+ "metadata": {},
270
+ "outputs": [],
271
+ "source": [
272
+ "from transformers import Trainer\n",
273
+ "trainer = Trainer(\n",
274
+ " model=model,\n",
275
+ " args=training_args,\n",
276
+ " train_dataset=train_dataset,\n",
277
+ " eval_dataset=eval_dataset,\n",
278
+ " optimizers=(optimizer, None),\n",
279
+ " data_collator=data_collator,\n",
280
+ " compute_metrics=compute_metrics,\n",
281
+ " callbacks=[early_stop],\n",
282
+ ")"
283
+ ]
284
+ },
285
+ {
286
+ "cell_type": "markdown",
287
+ "id": "a31780ca",
288
+ "metadata": {},
289
+ "source": [
290
+ "Y se aplica fine-tunning con train"
291
+ ]
292
+ },
293
+ {
294
+ "cell_type": "code",
295
+ "execution_count": 14,
296
+ "id": "3e01c5fb",
297
+ "metadata": {},
298
+ "outputs": [
299
+ {
300
+ "data": {
301
+ "text/html": [
302
+ "\n",
303
+ " <div>\n",
304
+ " \n",
305
+ " <progress value='102' max='340' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
306
+ " [102/340 01:24 < 03:22, 1.18 it/s, Epoch 3/10]\n",
307
+ " </div>\n",
308
+ " <table border=\"1\" class=\"dataframe\">\n",
309
+ " <thead>\n",
310
+ " <tr style=\"text-align: left;\">\n",
311
+ " <th>Epoch</th>\n",
312
+ " <th>Training Loss</th>\n",
313
+ " <th>Validation Loss</th>\n",
314
+ " <th>Accuracy</th>\n",
315
+ " </tr>\n",
316
+ " </thead>\n",
317
+ " <tbody>\n",
318
+ " <tr>\n",
319
+ " <td>1</td>\n",
320
+ " <td>No log</td>\n",
321
+ " <td>0.388526</td>\n",
322
+ " <td>0.803030</td>\n",
323
+ " </tr>\n",
324
+ " <tr>\n",
325
+ " <td>2</td>\n",
326
+ " <td>No log</td>\n",
327
+ " <td>0.600745</td>\n",
328
+ " <td>0.818182</td>\n",
329
+ " </tr>\n",
330
+ " <tr>\n",
331
+ " <td>3</td>\n",
332
+ " <td>No log</td>\n",
333
+ " <td>0.712544</td>\n",
334
+ " <td>0.848485</td>\n",
335
+ " </tr>\n",
336
+ " </tbody>\n",
337
+ "</table><p>"
338
+ ],
339
+ "text/plain": [
340
+ "<IPython.core.display.HTML object>"
341
+ ]
342
+ },
343
+ "metadata": {},
344
+ "output_type": "display_data"
345
+ },
346
+ {
347
+ "data": {
348
+ "text/plain": [
349
+ "TrainOutput(global_step=102, training_loss=0.3626815197514553, metrics={'train_runtime': 85.6313, 'train_samples_per_second': 31.297, 'train_steps_per_second': 3.971, 'total_flos': 211541288509440.0, 'train_loss': 0.3626815197514553, 'epoch': 3.0})"
350
+ ]
351
+ },
352
+ "execution_count": 14,
353
+ "metadata": {},
354
+ "output_type": "execute_result"
355
+ }
356
+ ],
357
+ "source": [
358
+ "trainer.train()"
359
+ ]
360
+ },
361
+ {
362
+ "cell_type": "markdown",
363
+ "id": "417d3cd2",
364
+ "metadata": {},
365
+ "source": [
366
+ "Imprimo el loss y el accuracy"
367
+ ]
368
+ },
369
+ {
370
+ "cell_type": "code",
371
+ "execution_count": 15,
372
+ "id": "d1144002",
373
+ "metadata": {},
374
+ "outputs": [
375
+ {
376
+ "data": {
377
+ "text/html": [
378
+ "\n",
379
+ " <div>\n",
380
+ " \n",
381
+ " <progress value='43' max='34' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
382
+ " [34/34 00:11]\n",
383
+ " </div>\n",
384
+ " "
385
+ ],
386
+ "text/plain": [
387
+ "<IPython.core.display.HTML object>"
388
+ ]
389
+ },
390
+ "metadata": {},
391
+ "output_type": "display_data"
392
+ },
393
+ {
394
+ "name": "stdout",
395
+ "output_type": "stream",
396
+ "text": [
397
+ "Resultados del conjunto de train\n",
398
+ "eval_loss. 0.1909465789794922\n",
399
+ "eval_accuracy. 0.9365671641791045\n",
400
+ "eval_runtime. 9.8021\n",
401
+ "eval_samples_per_second. 27.341\n",
402
+ "eval_steps_per_second. 3.469\n",
403
+ "epoch. 3.0\n",
404
+ "Resultados del conjunto de test\n",
405
+ "eval_loss. 0.38852614164352417\n",
406
+ "eval_accuracy. 0.803030303030303\n",
407
+ "eval_runtime. 2.4096\n",
408
+ "eval_samples_per_second. 27.391\n",
409
+ "eval_steps_per_second. 3.735\n",
410
+ "epoch. 3.0\n"
411
+ ]
412
+ }
413
+ ],
414
+ "source": [
415
+ "#creo una función para imprimir los resultados de una formá más visual\n",
416
+ "def print_results(title, results):\n",
417
+ " print(title)\n",
418
+ " for key, value in results.items():\n",
419
+ " print(f\"{key}. {value}\")\n",
420
+ " \n",
421
+ "train_result = trainer.evaluate(train_dataset)\n",
422
+ "print_results(\"Resultados del conjunto de train\",train_result)\n",
423
+ "eval_result = trainer.evaluate(eval_dataset)\n",
424
+ "print_results(\"Resultados del conjunto de test\",eval_result)"
425
+ ]
426
+ },
427
+ {
428
+ "cell_type": "markdown",
429
+ "id": "9e61a040",
430
+ "metadata": {},
431
+ "source": [
432
+ "# Guardando el modelo"
433
+ ]
434
+ },
435
+ {
436
+ "cell_type": "markdown",
437
+ "id": "4af06209",
438
+ "metadata": {},
439
+ "source": [
440
+ "Para Guardarlo, utilizamos esl método save_model"
441
+ ]
442
+ },
443
+ {
444
+ "cell_type": "code",
445
+ "execution_count": 16,
446
+ "id": "b93638cb",
447
+ "metadata": {},
448
+ "outputs": [],
449
+ "source": [
450
+ "trainer.save_model()"
451
+ ]
452
+ },
453
+ {
454
+ "cell_type": "code",
455
+ "execution_count": 17,
456
+ "id": "973c4e03",
457
+ "metadata": {},
458
+ "outputs": [],
459
+ "source": [
460
+ "trainer.create_model_card()"
461
+ ]
462
+ },
463
+ {
464
+ "cell_type": "code",
465
+ "execution_count": null,
466
+ "id": "9671b67c",
467
+ "metadata": {},
468
+ "outputs": [],
469
+ "source": []
470
+ }
471
+ ],
472
+ "metadata": {
473
+ "kernelspec": {
474
+ "display_name": "Python 3 (ipykernel)",
475
+ "language": "python",
476
+ "name": "python3"
477
+ },
478
+ "language_info": {
479
+ "codemirror_mode": {
480
+ "name": "ipython",
481
+ "version": 3
482
+ },
483
+ "file_extension": ".py",
484
+ "mimetype": "text/x-python",
485
+ "name": "python",
486
+ "nbconvert_exporter": "python",
487
+ "pygments_lexer": "ipython3",
488
+ "version": "3.8.13"
489
+ }
490
+ },
491
+ "nbformat": 4,
492
+ "nbformat_minor": 5
493
+ }