attempting to save model

Browse files

Files changed (2) hide show

.ipynb_checkpoints/summarize-linydub-checkpoint.ipynb +540 -0
summarize-linydub.ipynb +95 -1

.ipynb_checkpoints/summarize-linydub-checkpoint.ipynb ADDED Viewed

	@@ -0,0 +1,540 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "fe34671e-5117-4b12-a2b5-dc07fbb49021",
+   "metadata": {},
+   "source": [
+    "## Testing out Hugging Face Inference API. Goal is to get working model & inference API on Hugging Face hub."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "92575bb9-bcdf-4357-aa08-e8814b02eafb",
+   "metadata": {},
+   "source": [
+    "For starters, just trying out inference api from starter script here: https://api-inference.huggingface.co/docs/python/html/detailed_parameters.html#summarization-task\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "f44dba0e-72bd-49be-93a8-68895dbf994d",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[{'summary_text': 'CNN.com is celebrating its 10th anniversary this year. We are celebrating by asking our team members to share their thoughts and ideas. We want to hear from you, our readers, about what you think and what you want to share with the world. Share your ideas and help our team to become a little bit better today.'}]"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import json\n",
+    "\n",
+    "import requests\n",
+    "\n",
+    "API_TOKEN = \"hugging_face_access_token\"\n",
+    "\n",
+    "headers = {\"Authorization\": f\"Bearer {API_TOKEN}\"}\n",
+    "API_URL = \"https://api-inference.huggingface.co/models/facebook/bart-large-cnn\"\n",
+    "\n",
+    "# frank question = why POST? Maybe Willian can help answer that\n",
+    "def query(payload):\n",
+    "    data = json.dumps(payload)\n",
+    "    response = requests.request(\"POST\", API_URL, headers=headers, data=data)\n",
+    "    return json.loads(response.content.decode(\"utf-8\"))\n",
+    "\n",
+    "data = query(\n",
+    "    {\n",
+    "        \"inputs\": \"Picture this it’s the early morning, I’m sitting down with a hot cup of French roast coffee. I’m putting my AirPods in everything’s quiet. My mind is fresh and there are no distractions. I have a feeling of anticipation. It’s a mix of focus, excitement, and a tiny bit of apprehension. I’m about to hear your voice because some about to listen, to ponder replies. I have the space to connect with my team members to listen to them and learn something about you. This usually forces me to learn something about myself _____. I press play, and the learning begins. Thank you for being engaged here and sharing your ideas and helping our team to become just a little bit better today.\",\n",
+    "        \"parameters\": {\"do_sample\": False},\n",
+    "    }\n",
+    ")\n",
+    "\n",
+    "data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3e786be3-d888-4faa-9527-11237fea3882",
+   "metadata": {},
+   "source": [
+    "Ok. Crappy summary, but you can tell it works from the last sentence..."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5d3455b8-ec89-49d2-be27-a2e87218d902",
+   "metadata": {},
+   "source": [
+    "### Next, fine tune linydub and push model to hugging face hub¶"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4cdd2db4-95ea-4212-9f99-8d65f62fb349",
+   "metadata": {},
+   "source": [
+    "We need the model..."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "ca901ecf-7369-47f9-bbdd-ddc358cc8ff1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoTokenizer, AutoModelForSeq2SeqLM\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"linydub/bart-large-samsum\")\n",
+    "\n",
+    "model = AutoModelForSeq2SeqLM.from_pretrained(\"linydub/bart-large-samsum\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f88d654b-f9de-421d-aab0-bd8522e01e69",
+   "metadata": {},
+   "source": [
+    "We need training data...\n",
+    "\n",
+    "Let's see what the lindydub training data looks like so we can replicate the format for fine-tuning. https://github.com/linydub/azureml-greenai-txtsum/tree/main/examples/assets/data/hf-samsum/train\n",
+    "\n",
+    "I had no idea how to open an arrow file, so I got the samsum data from here instead: https://paperswithcode.com/dataset/samsum-corpus\n",
+    "\n",
+    "And then this blog post helped me figure out the training part: https://medium.com/rocket-mortgage-technology-blog/conversational-summarization-with-natural-language-processing-c073a6bcaa3a\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "cea91bb7-b1eb-4363-841a-88502a368669",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "val_path = 'val.json'\n",
+    "test_path = 'test.json'\n",
+    "train_path = 'train.json'\n",
+    "\n",
+    "with open(val_path) as in_file:\n",
+    "    val = json.load(in_file)\n",
+    "    in_file.close()\n",
+    "\n",
+    "with open(test_path) as in_file:\n",
+    "    test = json.load(in_file)\n",
+    "    in_file.close()\n",
+    "\n",
+    "with open(train_path) as in_file:\n",
+    "    train = json.load(in_file)\n",
+    "    in_file.close()\n",
+    "    \n",
+    "data = train + test + val\n",
+    "assert len(data) == len(train) + len(test) + len(val)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "a4fc8ad6-1c3c-4839-9db7-ea811d8d1cc3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "df = pd.DataFrame(data)\n",
+    "df['dialogue'] = df['dialogue'].str.replace('\\r', '')\n",
+    "df['dialogue'] = df['dialogue'].str.replace('\\n', '')\n",
+    "df['summary'] = df['summary'].str.replace('\\r', '')\n",
+    "df['summary'] = df['summary'].str.replace('\\n', '')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "4db4c0de-1a68-4977-b10b-694923f692e1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "validator = df.head(1)\n",
+    "df = df.iloc[1:,]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "83afe4d7-b410-432d-9b1a-a2983f3858cc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import datasets\n",
+    "\n",
+    "data_as_dataset = datasets.Dataset.from_pandas(df, preserve_index=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "5539d549-2306-42cc-9975-aa2fb4ecab43",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dd = data_as_dataset.train_test_split(test_size=0.1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "e013ce63-8252-41f7-8ebb-41e926a889d6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "prefix = \"summarize: \"\n",
+    "\n",
+    "def preprocess_function(examples):\n",
+    "    inputs = [prefix + doc for doc in examples[\"dialogue\"]]\n",
+    "    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)\n",
+    "\n",
+    "    with tokenizer.as_target_tokenizer():\n",
+    "        labels = tokenizer(examples[\"summary\"], max_length=128, truncation=True)\n",
+    "\n",
+    "    model_inputs[\"labels\"] = labels[\"input_ids\"]\n",
+    "    return model_inputs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "64f15b64-34b2-4849-9d7f-8fe712656613",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "a2ba058964994fd2b55f7c569c000f13",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "  0%|          | 0/1 [00:00<?, ?ba/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "d4a48e263d50461ba596bd5033a1e920",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "  0%|          | 0/1 [00:00<?, ?ba/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "tokenized_dd = dd.map(preprocess_function, batched=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "9b5ded79-f0ed-4c0a-8a96-ae694f232aa6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "1f051a1f-9292-4d91-a2c3-89f1d03a0935",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import DataCollatorForSeq2Seq\n",
+    "\n",
+    "data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "c04e187b-6e44-479f-911e-278733884c30",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pynvml import *\n",
+    "\n",
+    "def print_gpu_utilization():\n",
+    "    nvmlInit()\n",
+    "    handle = nvmlDeviceGetHandleByIndex(0)\n",
+    "    info = nvmlDeviceGetMemoryInfo(handle)\n",
+    "    print(f\"GPU memory occupied: {info.used//1024**2} MB.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "bdbaf892-ce5d-4fba-8133-db147de51d53",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "GPU memory occupied: 0 MB.\n"
+     ]
+    }
+   ],
+   "source": [
+    "print_gpu_utilization()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "ed721ebc-11c9-431d-a303-dcd1d39a6a6a",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Using amp half precision backend\n",
+      "The following columns in the training set  don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: id, dialogue, summary. If id, dialogue, summary are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.\n",
+      "/opt/conda/lib/python3.7/site-packages/transformers/optimization.py:309: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n",
+      "  FutureWarning,\n",
+      "***** Running training *****\n",
+      "  Num examples = 12\n",
+      "  Num Epochs = 3\n",
+      "  Instantaneous batch size per device = 8\n",
+      "  Total train batch size (w. parallel, distributed & accumulation) = 8\n",
+      "  Gradient Accumulation steps = 1\n",
+      "  Total optimization steps = 6\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "    <div>\n",
+       "      \n",
+       "      <progress value='6' max='6' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
+       "      [6/6 00:02, Epoch 3/3]\n",
+       "    </div>\n",
+       "    <table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       " <tr style=\"text-align: left;\">\n",
+       "      <th>Epoch</th>\n",
+       "      <th>Training Loss</th>\n",
+       "      <th>Validation Loss</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <td>1</td>\n",
+       "      <td>No log</td>\n",
+       "      <td>1.015452</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>2</td>\n",
+       "      <td>No log</td>\n",
+       "      <td>1.071723</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>3</td>\n",
+       "      <td>No log</td>\n",
+       "      <td>1.088332</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table><p>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "The following columns in the evaluation set  don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: id, dialogue, summary. If id, dialogue, summary are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.\n",
+      "***** Running Evaluation *****\n",
+      "  Num examples = 2\n",
+      "  Batch size = 8\n",
+      "The following columns in the evaluation set  don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: id, dialogue, summary. If id, dialogue, summary are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.\n",
+      "***** Running Evaluation *****\n",
+      "  Num examples = 2\n",
+      "  Batch size = 8\n",
+      "The following columns in the evaluation set  don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: id, dialogue, summary. If id, dialogue, summary are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.\n",
+      "***** Running Evaluation *****\n",
+      "  Num examples = 2\n",
+      "  Batch size = 8\n",
+      "\n",
+      "\n",
+      "Training completed. Do not forget to share your model on huggingface.co/models =)\n",
+      "\n",
+      "\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "TrainOutput(global_step=6, training_loss=1.0018877983093262, metrics={'train_runtime': 4.8179, 'train_samples_per_second': 7.472, 'train_steps_per_second': 1.245, 'total_flos': 22551432265728.0, 'train_loss': 1.0018877983093262, 'epoch': 3.0})"
+      ]
+     },
+     "execution_count": 15,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "training_args = Seq2SeqTrainingArguments(\n",
+    "    output_dir=\"./results\",\n",
+    "    evaluation_strategy=\"epoch\",\n",
+    "    learning_rate=2e-5,\n",
+    "    per_device_train_batch_size=8,\n",
+    "    per_device_eval_batch_size=8,\n",
+    "    weight_decay=0.01,\n",
+    "    save_total_limit=3,\n",
+    "    num_train_epochs=3,\n",
+    "    fp16=True,\n",
+    ")\n",
+    "\n",
+    "trainer = Seq2SeqTrainer(\n",
+    "    model=model,\n",
+    "    args=training_args,\n",
+    "    train_dataset=tokenized_dd[\"train\"],\n",
+    "    eval_dataset=tokenized_dd[\"test\"],\n",
+    "    tokenizer=tokenizer,\n",
+    "    data_collator=data_collator,\n",
+    ")\n",
+    "\n",
+    "trainer.train()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "eece8dd8-29ab-45cf-9a3e-7cded880f385",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "784ecc7b7797495f95651cd2e61a7c3d",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "VBox(children=(HTML(value='<center>\\n<img src=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from huggingface_hub import notebook_login\n",
+    "\n",
+    "notebook_login()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "e4a7de93-5acc-4173-b6fe-98ae15e2373a",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
+     ]
+    },
+    {
+     "ename": "OSError",
+     "evalue": "Tried to clone a repository in a non-empty folder that isn't a git repository. If you really want to do this, do it manually:\ngit init && git remote add origin && git pull origin main\n or clone repo to a new folder and move your existing files there afterwards.",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mOSError\u001b[0m                                   Traceback (most recent call last)",
+      "\u001b[0;32m/tmp/ipykernel_1257/1405518398.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mtrainer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpush_to_hub\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+      "\u001b[0;32m/opt/conda/lib/python3.7/site-packages/transformers/trainer.py\u001b[0m in \u001b[0;36mpush_to_hub\u001b[0;34m(self, commit_message, blocking, **kwargs)\u001b[0m\n\u001b[1;32m   2827\u001b[0m         \u001b[0;31m# it might fail.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2828\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mhasattr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"repo\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2829\u001b[0;31m             \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0minit_git_repo\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2830\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2831\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshould_save\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/opt/conda/lib/python3.7/site-packages/transformers/trainer.py\u001b[0m in \u001b[0;36minit_git_repo\u001b[0;34m(self, at_init)\u001b[0m\n\u001b[1;32m   2709\u001b[0m                 \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0moutput_dir\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2710\u001b[0m                 \u001b[0mclone_from\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mrepo_name\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2711\u001b[0;31m                 \u001b[0muse_auth_token\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0muse_auth_token\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2712\u001b[0m             )\n\u001b[1;32m   2713\u001b[0m         \u001b[0;32mexcept\u001b[0m \u001b[0mEnvironmentError\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/opt/conda/lib/python3.7/site-packages/huggingface_hub/repository.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, local_dir, clone_from, repo_type, use_auth_token, git_user, git_email, revision, private, skip_lfs_files)\u001b[0m\n\u001b[1;32m    419\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    420\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mclone_from\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 421\u001b[0;31m             \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mclone_from\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrepo_url\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mclone_from\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    422\u001b[0m         \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    423\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0mis_git_repo\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlocal_dir\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/opt/conda/lib/python3.7/site-packages/huggingface_hub/repository.py\u001b[0m in \u001b[0;36mclone_from\u001b[0;34m(self, repo_url, use_auth_token)\u001b[0m\n\u001b[1;32m    620\u001b[0m                 \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0min_repository\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    621\u001b[0m                     raise EnvironmentError(\n\u001b[0;32m--> 622\u001b[0;31m                         \u001b[0;34m\"Tried to clone a repository in a non-empty folder that isn't a git repository. If you really \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    623\u001b[0m                         \u001b[0;34m\"want to do this, do it manually:\\n\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    624\u001b[0m                         \u001b[0;34m\"git init && git remote add origin && git pull origin main\\n\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;31mOSError\u001b[0m: Tried to clone a repository in a non-empty folder that isn't a git repository. If you really want to do this, do it manually:\ngit init && git remote add origin && git pull origin main\n or clone repo to a new folder and move your existing files there afterwards."
+     ]
+    }
+   ],
+   "source": [
+    "trainer.push_to_hub()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "53e539de-637d-4127-99c5-4f3dab3ab286",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "environment": {
+   "kernel": "python3",
+   "name": "pytorch-gpu.1-10.m87",
+   "type": "gcloud",
+   "uri": "gcr.io/deeplearning-platform-release/pytorch-gpu.1-10:m87"
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

summarize-linydub.ipynb CHANGED Viewed

@@ -503,10 +503,104 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
    "id": "53e539de-637d-4127-99c5-4f3dab3ab286",
    "metadata": {},
    "outputs": [],
    "source": []
   }
  ],

   },
   {
    "cell_type": "code",
+   "execution_count": 19,
+   "id": "fa530ecc-2343-4d8c-9255-0eb2c807f24d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "original text preprocessed: \n",
+      " Amanda: I baked  cookies. Do you want some?Jerry: Sure!Amanda: I'll bring you tomorrow :-)\n"
+     ]
+    }
+   ],
+   "source": [
+    "preprocess_text = validator[\"dialogue\"].values[0].strip().replace(\"\\n\",\"\")\n",
+    "prepared_Text = \"summarize: \" + preprocess_text\n",
+    "print (\"original text preprocessed: \\n\", preprocess_text)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
    "id": "53e539de-637d-4127-99c5-4f3dab3ab286",
    "metadata": {},
    "outputs": [],
+   "source": [
+    "tokenized_text = tokenizer.encode(prepared_Text, return_tensors=\"pt\").to(device='cuda')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "id": "c0be5361-7b03-4ca3-a9b6-3ec2b8f264bb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "summary_ids = model.generate(tokenized_text,\n",
+    "                                    num_beams=4,\n",
+    "                                    no_repeat_ngram_size=2,\n",
+    "                                    min_length=30,\n",
+    "                                    max_length=100,\n",
+    "                                    early_stopping=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "id": "1dbf6a8b-89dd-4f9b-89b9-bad7c3ac41ff",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "\n",
+      "Summarized text: \n",
+      " Amanda baked cookies and will bring them to Jerry tomorrow. Jerry is happy to have some cookies. Amanda is going to bring Jerry cookies tomorrow as well.\n"
+     ]
+    }
+   ],
+   "source": [
+    "output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)\n",
+    "print (\"\\n\\nSummarized text: \\n\",output)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "id": "7535952b-7d24-4716-85fa-3e6ab142baa5",
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "PermissionError",
+     "evalue": "[Errno 13] Permission denied: '/first_success_save_model'",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mPermissionError\u001b[0m                           Traceback (most recent call last)",
+      "\u001b[0;32m/tmp/ipykernel_1257/4217802359.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0mpt_save_directory\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m\"/first_success_save_model\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mtokenizer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msave_pretrained\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpt_save_directory\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      3\u001b[0m \u001b[0mmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msave_pretrained\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpt_save_directory\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;32m/opt/conda/lib/python3.7/site-packages/transformers/tokenization_utils_base.py\u001b[0m in \u001b[0;36msave_pretrained\u001b[0;34m(self, save_directory, legacy_format, filename_prefix, push_to_hub, **kwargs)\u001b[0m\n\u001b[1;32m   2043\u001b[0m             \u001b[0mrepo\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_create_or_get_repo\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msave_directory\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2044\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2045\u001b[0;31m         \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmakedirs\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msave_directory\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mexist_ok\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2046\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2047\u001b[0m         special_tokens_map_file = os.path.join(\n",
+      "\u001b[0;32m/opt/conda/lib/python3.7/os.py\u001b[0m in \u001b[0;36mmakedirs\u001b[0;34m(name, mode, exist_ok)\u001b[0m\n\u001b[1;32m    221\u001b[0m             \u001b[0;32mreturn\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    222\u001b[0m     \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 223\u001b[0;31m         \u001b[0mmkdir\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmode\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    224\u001b[0m     \u001b[0;32mexcept\u001b[0m \u001b[0mOSError\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    225\u001b[0m         \u001b[0;31m# Cannot rely on checking for EEXIST, since the operating system\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;31mPermissionError\u001b[0m: [Errno 13] Permission denied: '/first_success_save_model'"
+     ]
+    }
+   ],
+   "source": [
+    "pt_save_directory = \"/first_success_save_model\"\n",
+    "tokenizer.save_pretrained(pt_save_directory)\n",
+    "model.save_pretrained(pt_save_directory)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "12f3280a-60c9-42d0-a2e8-2e581a46e68f",
+   "metadata": {},
+   "outputs": [],
    "source": []
   }
  ],