{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "bMYkkVla0zjn" }, "source": [ "# Proyecto: Fine-Tuning y Despliegue de un Modelo Transformer\n", "\n", "**Instrucciones Generales:**\n", "En este proyecto deberás seleccionar un problema de negocio o investigación que involucre el procesamiento de lenguaje natural (NLP). Algunos ejemplos incluyen: clasificación de reviews de e-commerce, detección de spam, análisis de sentimientos, o resumen de noticias financieras.\n", "\n", "**Entregables esperados:**\n", "1. **Dataset:** Selección y carga de un dataset (propio o de Hugging Face) distinto a los vistos en clase.\n", " - Tened en cuenta la complejidad del dataset y la tokenización.\n", " - También recomiendo utilizar un subset para aligerar el posterior entrenamiento. No buscamos maximizar resultados, sólo demostrar lo aprendido.\n", "2. **Entrenamiento:** Proceso de finetuning de un modelo:\n", " - Elección de un modelo.\n", " - Fine-tuning de un modelo Transformer sobre los datos.\n", " - Reporte de métricas de evaluación en el conjunto de test.\n", "3. **Despliegue (Model y Space):** El modelo final debe estar subido al Hub de Hugging Face y debe crearse un \"Space\" (demo en Gradio) funcional donde se pueda probar el modelo introduciendo texto en vivo*.\n", "4. **Model Card:** El repositorio del modelo en Hugging Face debe contener un `README.md` explicando qué hace el modelo, sus limitaciones y las métricas obtenidas.\n", "\n", "\\* Si tenéis problemas con el finetuning, el modelo desplegado puede ser un modelo ya existente.\n", "\n", "> **Nota sobre la organización:**\n", ">\n", ">Este notebook está diseñado para que lo utilices como plantilla. **En principio, todo el ciclo de vida del proyecto (carga, entrenamiento, evaluación y push al Hub) se puede realizar dentro de este mismo notebook.** Sin embargo, siéntete libre de dividirlo en varios notebooks separados (ej. uno para entrenamiento y otro para el despliegue) si lo consideras más organizado." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "El código del proyecto, y una demo, puede encontrarse en https://huggingface.co/spaces/antcaesar/resuemenes_hugginface_TECP" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SWa-5d910tPC" }, "outputs": [], "source": [ "import math\n", "import numpy as np\n", "import pandas as pd\n", "import torch\n", "from datasets import Dataset\n", "from torch.utils.data import DataLoader\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
promptsolution0solution1labellanguageeng_translated0eng_translated1approx_cultural_scorellm_usedexample_idsupplement
0Para ver la iglesia del pantano de Sau complet...tienes que esperar un período sin niebla.tienes que esperar un período de sequía.1spa_latn_spaiTo see the church at the Sau swamp in its enti...To see the church at the Sau swamp in its enti...10group0042_ex000035_spa_latn_spai_0_v1{\"topic\": \"place\", \"cultural_type\": \"cultural ...
1En la coca de pimiento y tomatese le añaden piñones y atún.se le añaden piñones y butifarra.0spa_latn_spaiIn the pepper and tomato coca pastry, pine nut...In the pepper and tomato coca pastry, pine nut...10group0042_ex000070_spa_latn_spai_0_v1{\"topic\": \"food\", \"cultural_type\": \"cultural C...
2¿Cómo se sirven los calçots?En un restaurante te pondrán una teja con unos...En un restaurante te pondrán una teja con unos...1spa_latn_spaiHow are calçots served? In a restaurant, you w...How are calçots served? In a restaurant, you w...10group0042_ex000021_spa_latn_spai_0_v1{\"topic\": \"food\", \"cultural_type\": \"cultural C...
3Estás haciendo un viaje desde Madrid a tu pueb...Utilizas el dibujo profundo, ya que evacua mej...Utilizas el dibujo liso, ya que evacua mejor e...0spa_latn_spaiYou are taking a trip from Madrid to your town...You are taking a trip from Madrid to your town...10group0126_ex000024_spa_latn_spai_1_v1{\"uncorrected_eng_translated0\": \"You are takin...
4Has abierto un chorizo curado y te sobra la mi...Envuélvelo en papel y guárdalo en la nevera en...Envuélvelo en film y guárdalo en la nevera en ...1spa_latn_spaiYou have opened a cured chorizo and have half ...You have opened a cured chorizo and have half ...10group0126_ex000010_spa_latn_spai_1_v1{\"uncorrected_eng_translated0\": \"You have open...
....................................
95Voy a a cortar jamón serrano para un aperitivo...Usaré cuchillo de sierra corto, con cortes cor...Usaré un cuchillo jamonero bien afilado, con c...1spa_latn_spaiI am going to slice serrano ham for an appetiz...I am going to slice serrano ham for an appetiz...10group0126_ex000039_spa_latn_spai_1_v1{\"uncorrected_eng_translated0\": \"I am going to...
96¿Qué les pasa a las figuras de cartón y madera...Se endurecen con el fuego.Se queman con el fuego.1spa_latn_spaiWhat happens to the cardboard and wood figures...What happens to the cardboard and wood figures...10group0134_ex000019_spa_latn_spai_2_v1{\"uncorrected_eng_translated0\": \"What happens ...
97Para hacer una figura decorativa, mezclamos el...Moldeamos la figura y esperamos unas horas par...Moldeamos la figura y esperamos unas horas par...0spa_latn_spaiTo make a decorative figure, we mix gypsum pla...To make a decorative figure, we mix gypsum pla...10group0134_ex000063_spa_latn_spai_2_v1{\"uncorrected_eng_translated0\": \"To make a dec...
98Cómo hacer ratafía en casa.La ratafía es un licor de hierbas con base de ...La ratafía es un licor de hierbas con base de ...0spa_latn_spaiHow to make ratafia at home. Ratafia is a herb...How to make ratafia at home. Ratafia is a herb...10group0042_ex000037_spa_latn_spai_0_v1{\"topic\": \"food\", \"cultural_type\": \"cultural C...
99Haces gazpacho andaluz en verano para la comid...Deja el gazpacho en nevera antes de servir.Deja el gazpacho fuera de nevera antes de servir.0spa_latn_spaiYou are making Andalusian gazpacho in the summ...You are making Andalusian gazpacho in the summ...10group0126_ex000037_spa_latn_spai_1_v1{\"uncorrected_eng_translated0\": \"You make gazp...
\n", "

100 rows × 11 columns

\n", "
" ], "text/plain": [ " prompt \\\n", "0 Para ver la iglesia del pantano de Sau complet... \n", "1 En la coca de pimiento y tomate \n", "2 ¿Cómo se sirven los calçots? \n", "3 Estás haciendo un viaje desde Madrid a tu pueb... \n", "4 Has abierto un chorizo curado y te sobra la mi... \n", ".. ... \n", "95 Voy a a cortar jamón serrano para un aperitivo... \n", "96 ¿Qué les pasa a las figuras de cartón y madera... \n", "97 Para hacer una figura decorativa, mezclamos el... \n", "98 Cómo hacer ratafía en casa. \n", "99 Haces gazpacho andaluz en verano para la comid... \n", "\n", " solution0 \\\n", "0 tienes que esperar un período sin niebla. \n", "1 se le añaden piñones y atún. \n", "2 En un restaurante te pondrán una teja con unos... \n", "3 Utilizas el dibujo profundo, ya que evacua mej... \n", "4 Envuélvelo en papel y guárdalo en la nevera en... \n", ".. ... \n", "95 Usaré cuchillo de sierra corto, con cortes cor... \n", "96 Se endurecen con el fuego. \n", "97 Moldeamos la figura y esperamos unas horas par... \n", "98 La ratafía es un licor de hierbas con base de ... \n", "99 Deja el gazpacho en nevera antes de servir. \n", "\n", " solution1 label language \\\n", "0 tienes que esperar un período de sequía. 1 spa_latn_spai \n", "1 se le añaden piñones y butifarra. 0 spa_latn_spai \n", "2 En un restaurante te pondrán una teja con unos... 1 spa_latn_spai \n", "3 Utilizas el dibujo liso, ya que evacua mejor e... 0 spa_latn_spai \n", "4 Envuélvelo en film y guárdalo en la nevera en ... 1 spa_latn_spai \n", ".. ... ... ... \n", "95 Usaré un cuchillo jamonero bien afilado, con c... 1 spa_latn_spai \n", "96 Se queman con el fuego. 1 spa_latn_spai \n", "97 Moldeamos la figura y esperamos unas horas par... 0 spa_latn_spai \n", "98 La ratafía es un licor de hierbas con base de ... 0 spa_latn_spai \n", "99 Deja el gazpacho fuera de nevera antes de servir. 0 spa_latn_spai \n", "\n", " eng_translated0 \\\n", "0 To see the church at the Sau swamp in its enti... \n", "1 In the pepper and tomato coca pastry, pine nut... \n", "2 How are calçots served? In a restaurant, you w... \n", "3 You are taking a trip from Madrid to your town... \n", "4 You have opened a cured chorizo and have half ... \n", ".. ... \n", "95 I am going to slice serrano ham for an appetiz... \n", "96 What happens to the cardboard and wood figures... \n", "97 To make a decorative figure, we mix gypsum pla... \n", "98 How to make ratafia at home. Ratafia is a herb... \n", "99 You are making Andalusian gazpacho in the summ... \n", "\n", " eng_translated1 approx_cultural_score \\\n", "0 To see the church at the Sau swamp in its enti... 1 \n", "1 In the pepper and tomato coca pastry, pine nut... 1 \n", "2 How are calçots served? In a restaurant, you w... 1 \n", "3 You are taking a trip from Madrid to your town... 1 \n", "4 You have opened a cured chorizo and have half ... 1 \n", ".. ... ... \n", "95 I am going to slice serrano ham for an appetiz... 1 \n", "96 What happens to the cardboard and wood figures... 1 \n", "97 To make a decorative figure, we mix gypsum pla... 1 \n", "98 How to make ratafia at home. Ratafia is a herb... 1 \n", "99 You are making Andalusian gazpacho in the summ... 1 \n", "\n", " llm_used example_id \\\n", "0 0 group0042_ex000035_spa_latn_spai_0_v1 \n", "1 0 group0042_ex000070_spa_latn_spai_0_v1 \n", "2 0 group0042_ex000021_spa_latn_spai_0_v1 \n", "3 0 group0126_ex000024_spa_latn_spai_1_v1 \n", "4 0 group0126_ex000010_spa_latn_spai_1_v1 \n", ".. ... ... \n", "95 0 group0126_ex000039_spa_latn_spai_1_v1 \n", "96 0 group0134_ex000019_spa_latn_spai_2_v1 \n", "97 0 group0134_ex000063_spa_latn_spai_2_v1 \n", "98 0 group0042_ex000037_spa_latn_spai_0_v1 \n", "99 0 group0126_ex000037_spa_latn_spai_1_v1 \n", "\n", " supplement \n", "0 {\"topic\": \"place\", \"cultural_type\": \"cultural ... \n", "1 {\"topic\": \"food\", \"cultural_type\": \"cultural C... \n", "2 {\"topic\": \"food\", \"cultural_type\": \"cultural C... \n", "3 {\"uncorrected_eng_translated0\": \"You are takin... \n", "4 {\"uncorrected_eng_translated0\": \"You have open... \n", ".. ... \n", "95 {\"uncorrected_eng_translated0\": \"I am going to... \n", "96 {\"uncorrected_eng_translated0\": \"What happens ... \n", "97 {\"uncorrected_eng_translated0\": \"To make a dec... \n", "98 {\"topic\": \"food\", \"cultural_type\": \"cultural C... \n", "99 {\"uncorrected_eng_translated0\": \"You make gazp... \n", "\n", "[100 rows x 11 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "splits = {'train': 'data/train-00000-of-00001.parquet', 'validation': 'data/validation-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}\n", "df = pd.read_parquet(\"hf://datasets/somosnlp/NoticIA-it/\" + splits[\"train\"])\n", "df = df[[\"texto\", \"respuesta\"]].dropna().reset_index(drop=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idtitularrespuestapreguntatextoidiomaperiodotarearegistrodominiopaís_origen
00JORGE REY: EL TIEMPO | La impactante predicció...El inicio de un periodo frío intenso.Ahora eres una Inteligencia Artificial experta...27·11·23 | 08:34 | Actualizado a las 14:47\\nJO...es_esactualresumenmedioprensa_ciencia_y_tecnologiaespaña
11El cambio en las matrículas que se espera para...Se dará el salto a la letra M.Ahora eres una Inteligencia Artificial experta...Si eres de los que sigues el avance de las mat...es_esactualresumenmedioprensa_ciencia_y_tecnologiaespaña
22Si no avisas a la DGT de este cambio en tu coc...500 euros por pintar un coche de otro color y ...Ahora eres una Inteligencia Artificial experta...Con Pilar Cisneros y Fernando de Haro\\nCon Pac...es_esactualresumenmedioprensa_otrosespaña
33Estos serán los lenguajes de programación con ...Python y JavaScript.Ahora eres una Inteligencia Artificial experta...Si con el año nuevo te has propuesto aumentar ...es_esactualresumenmedioprensa_ciencia_y_tecnologiaespaña
44Cambio de estrategia en Microsoft: Windows 12 ...Solo un 28.6% de los usuarios actuales de Wind...Ahora eres una Inteligencia Artificial experta...Desde hace ya varios meses, las especulaciones...es_esactualresumenmedioprensa_ciencia_y_tecnologiaespaña
....................................
695695Primicia: Mediaset ya tiene pareja de presenta...Diego Losada y Mónica Sanz.Ahora eres una Inteligencia Artificial experta...Mediaset ya tiene encajadas las piezas del puz...es_esactualresumenmedioprensa_celebridadesespaña
696696Margot Robbie anuncia que se retira de la actu...No se retira, pero no quiere hacer otra pelícu...Ahora eres una Inteligencia Artificial experta...Todo lo que buscas en un solo click\\nLa actriz...es_boactualresumencoloquialprensa_celebridadesbolivia
697697¿Por qué el videojuego de Indiana Jones es en ...Para que la acción parezca propia y sea mucho ...Ahora eres una Inteligencia Artificial experta...Xbox clarificó en el Developer_Direct de la se...es_esactualresumenmedioprensa_ocio_y_culturaespaña
698698La insólita situación vivida frente a un semáf...Un conductor de 44 años se quedó dormido frent...Ahora eres una Inteligencia Artificial experta...Se pueden imaginar que en el teléfono de la Po...es_esactualresumenmedioprensa_otrosespaña
699699Uno de los mejores Assassin’s Creed podría ten...Black Flag.Ahora eres una Inteligencia Artificial experta...Parece que la nueva versión del título de Ubis...es_mxactualresumenmedioprensa_ocio_y_culturamexico
\n", "

700 rows × 11 columns

\n", "
" ], "text/plain": [ " id titular \\\n", "0 0 JORGE REY: EL TIEMPO | La impactante predicció... \n", "1 1 El cambio en las matrículas que se espera para... \n", "2 2 Si no avisas a la DGT de este cambio en tu coc... \n", "3 3 Estos serán los lenguajes de programación con ... \n", "4 4 Cambio de estrategia en Microsoft: Windows 12 ... \n", ".. ... ... \n", "695 695 Primicia: Mediaset ya tiene pareja de presenta... \n", "696 696 Margot Robbie anuncia que se retira de la actu... \n", "697 697 ¿Por qué el videojuego de Indiana Jones es en ... \n", "698 698 La insólita situación vivida frente a un semáf... \n", "699 699 Uno de los mejores Assassin’s Creed podría ten... \n", "\n", " respuesta \\\n", "0 El inicio de un periodo frío intenso. \n", "1 Se dará el salto a la letra M. \n", "2 500 euros por pintar un coche de otro color y ... \n", "3 Python y JavaScript. \n", "4 Solo un 28.6% de los usuarios actuales de Wind... \n", ".. ... \n", "695 Diego Losada y Mónica Sanz. \n", "696 No se retira, pero no quiere hacer otra pelícu... \n", "697 Para que la acción parezca propia y sea mucho ... \n", "698 Un conductor de 44 años se quedó dormido frent... \n", "699 Black Flag. \n", "\n", " pregunta \\\n", "0 Ahora eres una Inteligencia Artificial experta... \n", "1 Ahora eres una Inteligencia Artificial experta... \n", "2 Ahora eres una Inteligencia Artificial experta... \n", "3 Ahora eres una Inteligencia Artificial experta... \n", "4 Ahora eres una Inteligencia Artificial experta... \n", ".. ... \n", "695 Ahora eres una Inteligencia Artificial experta... \n", "696 Ahora eres una Inteligencia Artificial experta... \n", "697 Ahora eres una Inteligencia Artificial experta... \n", "698 Ahora eres una Inteligencia Artificial experta... \n", "699 Ahora eres una Inteligencia Artificial experta... \n", "\n", " texto idioma periodo \\\n", "0 27·11·23 | 08:34 | Actualizado a las 14:47\\nJO... es_es actual \n", "1 Si eres de los que sigues el avance de las mat... es_es actual \n", "2 Con Pilar Cisneros y Fernando de Haro\\nCon Pac... es_es actual \n", "3 Si con el año nuevo te has propuesto aumentar ... es_es actual \n", "4 Desde hace ya varios meses, las especulaciones... es_es actual \n", ".. ... ... ... \n", "695 Mediaset ya tiene encajadas las piezas del puz... es_es actual \n", "696 Todo lo que buscas en un solo click\\nLa actriz... es_bo actual \n", "697 Xbox clarificó en el Developer_Direct de la se... es_es actual \n", "698 Se pueden imaginar que en el teléfono de la Po... es_es actual \n", "699 Parece que la nueva versión del título de Ubis... es_mx actual \n", "\n", " tarea registro dominio país_origen \n", "0 resumen medio prensa_ciencia_y_tecnologia españa \n", "1 resumen medio prensa_ciencia_y_tecnologia españa \n", "2 resumen medio prensa_otros españa \n", "3 resumen medio prensa_ciencia_y_tecnologia españa \n", "4 resumen medio prensa_ciencia_y_tecnologia españa \n", ".. ... ... ... ... \n", "695 resumen medio prensa_celebridades españa \n", "696 resumen coloquial prensa_celebridades bolivia \n", "697 resumen medio prensa_ocio_y_cultura españa \n", "698 resumen medio prensa_otros españa \n", "699 resumen medio prensa_ocio_y_cultura mexico \n", "\n", "[700 rows x 11 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Celda de entrenamiento:\n", "\n", "Esta celda realiza el proceso completo de fine-tuning y guardado del modelo. En concreto:\n", "\n", "- Carga el `tokenizer` y el `model` base desde Hugging Face.\n", "- Crea un subset de datos (`sample_size`) y lo divide en `train`, `val` y `test`.\n", "- Define `preprocess_function` para tokenizar entradas (`texto`) y objetivos (`respuesta`).\n", "- Construye `DataLoader`s y un `DataCollatorForSeq2Seq` para agrupar lotes apropiadamente.\n", "- Ejecuta un bucle corto de entrenamiento (controlado por `max_train_steps`) con `AdamW`.\n", "- Evalúa el modelo en el conjunto de test para obtener `test_loss` y `test_perplexity`.\n", "- Guarda el modelo y tokenizer en `mt5-resumenes-es-final` y realiza una inferencia de ejemplo.\n", "\n", "Ejecuta esta celda después de comprobar `df.head()` y tener instaladas las dependencias necesarias. Tarda más tiempo si entrenas en CPU; en GPU será más rápido." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "3e1f88c34c734cb7bf409cfad217608b", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Loading weights: 0%| | 0/192 [00:00= max_train_steps:\n", " break\n", "\n", "train_loss = float(np.mean(train_losses)) if train_losses else float(\"nan\")\n", "\n", "model.eval()\n", "eval_losses = []\n", "with torch.no_grad():\n", " for batch in eval_loader:\n", " batch = {key: value.to(device) for key, value in batch.items()}\n", " outputs = model(**batch)\n", " eval_losses.append(outputs.loss.item())\n", "\n", "test_loss = float(np.mean(eval_losses)) if eval_losses else float(\"nan\")\n", "test_perplexity = math.exp(test_loss) if np.isfinite(test_loss) and test_loss < 20 else float(\"inf\")\n", "\n", "print(\"Train loss:\", round(train_loss, 4) if np.isfinite(train_loss) else train_loss)\n", "print(\"Test loss:\", round(test_loss, 4))\n", "print(\"Test perplexity:\", round(test_perplexity, 4) if np.isfinite(test_perplexity) else test_perplexity)\n", "\n", "model.save_pretrained(\"mt5-resumenes-es-final\")\n", "tokenizer.save_pretrained(\"mt5-resumenes-es-final\")\n", "\n", "sample_text = test_df.iloc[0][\"texto\"]\n", "inputs = tokenizer(sample_text, return_tensors=\"pt\", truncation=True, max_length=max_input_length).to(device)\n", "generated_ids = model.generate(**inputs, max_length=max_target_length, num_beams=4)\n", "print(\"Texto de entrada:\", sample_text[:1200])\n", "print(\"Resumen generado:\", tokenizer.decode(generated_ids[0], skip_special_tokens=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Métricas de evaluación en test\n", "\n", "En esta sección se calculan métricas de resumen sobre el conjunto de test para medir la calidad del modelo ajustado." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
metricvalor
0ROUGE-1 aprox.0.6236
1ROUGE-2 aprox.0.5829
2ROUGE-L aprox.0.6236
3Test loss4.0315
4Test perplexity56.3473
\n", "
" ], "text/plain": [ " metric valor\n", "0 ROUGE-1 aprox. 0.6236\n", "1 ROUGE-2 aprox. 0.5829\n", "2 ROUGE-L aprox. 0.6236\n", "3 Test loss 4.0315\n", "4 Test perplexity 56.3473" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from collections import Counter\n", "\n", "test_eval_loader = DataLoader(test_tokenized, batch_size=2, shuffle=False, collate_fn=data_collator)\n", "predictions = []\n", "references = []\n", "\n", "model.eval()\n", "with torch.no_grad():\n", " for batch in test_eval_loader:\n", " labels = batch[\"labels\"].clone()\n", " model_inputs = {key: value.to(device) for key, value in batch.items() if key != \"labels\"}\n", " generated_ids = model.generate(**model_inputs, max_new_tokens=32, num_beams=4)\n", " batch_predictions = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)\n", " labels[labels == -100] = tokenizer.pad_token_id\n", " batch_references = tokenizer.batch_decode(labels, skip_special_tokens=True)\n", " predictions.extend(batch_predictions)\n", " references.extend(batch_references)\n", "\n", "def tokenize_summary(text):\n", " return [token for token in text.lower().split() if token]\n", "\n", "def rouge_n_score(prediction_tokens, reference_tokens, n):\n", " prediction_ngrams = Counter(tuple(prediction_tokens[index:index + n]) for index in range(max(len(prediction_tokens) - n + 1, 0)))\n", " reference_ngrams = Counter(tuple(reference_tokens[index:index + n]) for index in range(max(len(reference_tokens) - n + 1, 0)))\n", " overlap = sum(min(count, reference_ngrams[ngram]) for ngram, count in prediction_ngrams.items())\n", " prediction_total = sum(prediction_ngrams.values())\n", " reference_total = sum(reference_ngrams.values())\n", " precision = overlap / prediction_total if prediction_total else 0.0\n", " recall = overlap / reference_total if reference_total else 0.0\n", " return 2 * precision * recall / (precision + recall) if precision + recall else 0.0\n", "\n", "def lcs_length(left_tokens, right_tokens):\n", " previous_row = [0] * (len(right_tokens) + 1)\n", " for left_token in left_tokens:\n", " current_row = [0]\n", " for index, right_token in enumerate(right_tokens, start=1):\n", " if left_token == right_token:\n", " current_row.append(previous_row[index - 1] + 1)\n", " else:\n", " current_row.append(max(previous_row[index], current_row[-1]))\n", " previous_row = current_row\n", " return previous_row[-1]\n", "\n", "def rouge_l_score(prediction_tokens, reference_tokens):\n", " lcs = lcs_length(prediction_tokens, reference_tokens)\n", " precision = lcs / len(prediction_tokens) if prediction_tokens else 0.0\n", " recall = lcs / len(reference_tokens) if reference_tokens else 0.0\n", " return 2 * precision * recall / (precision + recall) if precision + recall else 0.0\n", "\n", "rouge_scores = {\"rouge1\": [], \"rouge2\": [], \"rougeL\": []}\n", "\n", "for prediction, reference in zip(predictions, references):\n", " prediction_tokens = tokenize_summary(prediction)\n", " reference_tokens = tokenize_summary(reference)\n", " rouge_scores[\"rouge1\"].append(rouge_n_score(prediction_tokens, reference_tokens, 1))\n", " rouge_scores[\"rouge2\"].append(rouge_n_score(prediction_tokens, reference_tokens, 2))\n", " rouge_scores[\"rougeL\"].append(rouge_l_score(prediction_tokens, reference_tokens))\n", "\n", "metrics_df = pd.DataFrame(\n", " [\n", " {\"metric\": \"ROUGE-1 aprox.\", \"valor\": float(np.mean(rouge_scores[\"rouge1\"]))},\n", " {\"metric\": \"ROUGE-2 aprox.\", \"valor\": float(np.mean(rouge_scores[\"rouge2\"]))},\n", " {\"metric\": \"ROUGE-L aprox.\", \"valor\": float(np.mean(rouge_scores[\"rougeL\"]))},\n", " {\"metric\": \"Test loss\", \"valor\": test_loss},\n", " {\"metric\": \"Test perplexity\", \"valor\": test_perplexity},\n", " ]\n", ")\n", "\n", "metrics_df[\"valor\"] = metrics_df[\"valor\"].apply(lambda value: round(value, 4) if isinstance(value, (float, np.floating)) and np.isfinite(value) else value)\n", "metrics_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Demo con Gradio\n", "\n", "La siguiente interfaz permite escribir un texto, pulsar un botón y obtener el resumen generado por el modelo afinado." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "5c33f68d8c56475caaa96815c7841b17", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Loading weights: 0%| | 0/190 [00:00" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import gradio as gr\n", "import torch\n", "from transformers import AutoModelForSeq2SeqLM, AutoTokenizer\n", "\n", "model_path = \"mt5-resumenes-es-final\"\n", "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", "tokenizer = AutoTokenizer.from_pretrained(model_path)\n", "model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to(device)\n", "max_input_length = 256\n", "\n", "def generate_summary(text):\n", " if not text or not text.strip():\n", " return \"Introduce un texto para generar el resumen.\"\n", "\n", " model.eval()\n", " inputs = tokenizer(text, return_tensors=\"pt\", truncation=True, max_length=max_input_length).to(device)\n", " with torch.no_grad():\n", " summary_ids = model.generate(**inputs, max_new_tokens=32, num_beams=4)\n", " return tokenizer.decode(summary_ids[0], skip_special_tokens=True)\n", "\n", "demo = gr.Blocks(title=\"Resumen de texto en español\")\n", "with demo:\n", " gr.Markdown(\"# Resumen de textos en español\\nEscribe un texto largo y pulsa el botón para generar un resumen.\")\n", " with gr.Row():\n", " input_text = gr.Textbox(label=\"Texto de entrada\", lines=12, placeholder=\"Pega aquí el texto que quieras resumir...\")\n", " output_text = gr.Textbox(label=\"Resumen generado\", lines=6)\n", " generate_button = gr.Button(\"Generar resumen\")\n", " generate_button.click(fn=generate_summary, inputs=input_text, outputs=output_text)\n", "\n", "demo.launch()" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "TECL", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.13" } }, "nbformat": 4, "nbformat_minor": 0 }