Spaces:

tlmk22
/

OptimAbstract

Build error

File size: 9,062 Bytes

d75519d
 
c53eedd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d75519d
f9eeac2
d75519d
 
 
c53eedd
d75519d
 
c53eedd
 
 
 
 
d75519d
 
f9eeac2
 
c53eedd
 
 
 
 
 
 
 
 
 
 
 
f9eeac2
c53eedd
 
 
d75519d
f9eeac2
d75519d
 
f9eeac2
d75519d
 
c53eedd
 
 
 
 
 
 
 
 
f9eeac2
 
c53eedd
 
f9eeac2
c53eedd
 
 
f9eeac2
 
c53eedd
 
f9eeac2
c53eedd
 
 
 
f9eeac2
c53eedd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d75519d
c53eedd
 
 
 
 
 
 
 
 
 
 
 
 
 
d75519d
c53eedd
 
 
 
 
 
 
 
 
 
d75519d
c53eedd
d75519d
 
 
 
 
c53eedd
 
 
 
 
 
 
 
 
f9eeac2
d75519d
c53eedd
f9eeac2
 
 
c53eedd
 
 
 
f9eeac2
 
c53eedd
 
f9eeac2
c53eedd
 
 
d75519d
f9eeac2
c53eedd
 
f9eeac2
c53eedd
d75519d
 
 
 
 
 
c53eedd
 
 
d75519d
 
 
 
 
 
c53eedd
 
d75519d
 
 
 
 
f9eeac2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d75519d
 
f9eeac2
 
d75519d
 
 
f9eeac2
 
c53eedd
 
f9eeac2
c53eedd
 
 
 
 
 
 
f9eeac2
 
 
 
c53eedd
 
 
 
 
 
 
 
 
f9eeac2
c53eedd
 
 
d75519d
f9eeac2
d75519d
 
 
 
 
 
 
f9eeac2
d75519d

{
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "# OptimAbstract\n",
    "It aims at building a meta-model on top of T5 model in order to adapt the model choice relatively to the complextity of the text to compress.\n",
    "\n",
    "Several steps. During learning phase:\n",
    "1. Find relevant features that represents the complexity with low computational time\n",
    "2. Apply the candidate models and select the best with regard with a fixed criteria (BertScore)\n",
    "3. Fit a classifier to predict, from the features, the best model\n",
    "In the inference: simply predict the classifier, and choose the right model."
   ],
   "id": "bec99a2ab93de91b"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from datasets import load_dataset\n",
    "from bert_score import score\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "from model import MetaModel, save_object\n",
    "import time\n",
    "from model import T5Model, extract_features, get_best_model"
   ],
   "id": "5d14705fffbcfb64",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Data loading\n",
    "\n",
    "For the first idea, let us work on a very small amount of data"
   ],
   "id": "3bffb33f36f005c2"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": "dataset = load_dataset(\"cnn_dailymail\", \"3.0.0\", split=\"train\")",
   "id": "4c35f3d88583bb80",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "# I want a wide diversity of complexity\n",
    "train_dataset = dataset.map(lambda x: {\"text_length\": len(x[\"article\"])})\n",
    "train_dataset = train_dataset.sort(\"text_length\")\n",
    "num_samples = 500\n",
    "indices = np.linspace(0, len(train_dataset) - 1, num_samples, dtype=int)\n",
    "selected_samples = train_dataset.select(indices)\n",
    "print([ex[\"text_length\"] for ex in selected_samples])"
   ],
   "id": "622dbc95cf5e0cc5",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": "selected_samples",
   "id": "f038b84f4f9aee51",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": "model_names = [\"google-t5/t5-small\", \"google-t5/t5-base\", \"google-t5/t5-large\"]",
   "id": "dbecc71e2eb0df4b",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "## Exploring features and classifier",
   "id": "b3fe4e5fd705255"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "models = {name: T5Model(name) for name in model_names}\n",
    "train_texts = selected_samples[\"article\"]\n",
    "train_summaries = selected_samples[\"highlights\"]"
   ],
   "id": "187ade3986ce0021",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": "features_name = list(extract_features(train_texts[0]).keys())",
   "id": "2337a1e10f568364",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": "X = np.array([list(extract_features(text).values()) for text in train_texts])",
   "id": "b6d795089320dd18",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": "y = get_best_model(models, train_texts, train_summaries, tolerance=0)",
   "id": "aeaa0061b8d274a5",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.DataFrame(\n",
    "    columns=[\"best_model_name\"] + features_name, data=np.concatenate((y.reshape(-1, 1), X), axis=1)\n",
    ")"
   ],
   "id": "a01e40f0d6915fdb",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.figure(figsize=(10, 30))\n",
    "for i, feature in enumerate(features_name):\n",
    "    plt.subplot(len(features_name) // 2, len(features_name) // 2, i + 1)\n",
    "    sns.boxplot(x=\"best_model_name\", y=feature, data=df)\n",
    "    plt.xticks(rotation=45)\n",
    "    plt.yticks(rotation=0)\n",
    "    plt.locator_params(axis=\"y\", nbins=6)\n",
    "    plt.title(feature)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ],
   "id": "3c3ed90a12128ce6",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "The features do not seem to be relevant, I have to work further.\n",
    "\n",
    "## MetaModel"
   ],
   "id": "c9f46b201ff00ec7"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\", message=\".*Some weights of RobertaModel.*\")\n",
    "meta_model = MetaModel(model_names, base_classifier=RandomForestClassifier(), tolerance=0.01)\n",
    "meta_model.fit(selected_samples[\"article\"], selected_samples[\"highlights\"])\n",
    "save_object(meta_model, \"first_model.pkl\")"
   ],
   "id": "6d68f234e372396d",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": "test_dataset = dataset.shuffle(seed=42).select(range(100))",
   "id": "59f0d58080ac7b44",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "meta_model_scores = []\n",
    "meta_model_times = []\n",
    "model_scores = {name: [] for name in model_names}\n",
    "model_times = {name: [] for name in model_names}\n",
    "\n",
    "for i, dataset_ in enumerate(test_dataset):\n",
    "    predicted_summary, meta_time = meta_model.summarize(dataset_[\"article\"])\n",
    "    P, R, F1 = score([predicted_summary], [dataset_[\"highlights\"]], lang=\"en\", verbose=False)\n",
    "    meta_model_scores.append(F1.item())\n",
    "    meta_model_times.append(meta_time)\n",
    "\n",
    "    model_results = []\n",
    "    for model_name in model_names:\n",
    "        model = meta_model.models[model_name]\n",
    "        summary, elapsed_time = model.summarize(dataset_[\"article\"])\n",
    "        P, R, F1 = score([summary], [dataset_[\"highlights\"]], lang=\"en\", verbose=False)\n",
    "        f1_score = F1.item()\n",
    "\n",
    "        model_scores[model_name].append(f1_score)\n",
    "        model_times[model_name].append(elapsed_time)\n",
    "        model_results.append((model_name, f1_score, elapsed_time))\n",
    "\n"
   ],
   "id": "6fd91b97e4b6e588",
   "outputs": [],
   "execution_count": null
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "",
   "id": "9f8cc25886f69b6c"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-02-19T08:05:59.621340Z",
     "start_time": "2025-02-19T08:05:59.616976Z"
    }
   },
   "cell_type": "code",
   "source": [
    "print(\"\\n===== Model Evaluation =====\")\n",
    "for model_name in model_names:\n",
    "    avg_score = np.mean(model_scores[model_name])\n",
    "    avg_time = np.mean(model_times[model_name])\n",
    "    print(f\"{model_name}: BERTScore={avg_score:.4f}, Time={avg_time:.4f}s\")\n",
    "\n",
    "print(\n",
    "    f\" MetaModel : BERTScore={np.mean(meta_model_scores):.4f}, \"\n",
    "    f\"Time={np.mean(meta_model_times):.4f}s\"\n",
    ")"
   ],
   "id": "ffa22ef5f39d30bf",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "===== Model Evaluation =====\n",
      "google-t5/t5-small: BERTScore=0.8639, Time=2.0048s\n",
      "google-t5/t5-base: BERTScore=0.8720, Time=5.2173s\n",
      "google-t5/t5-large: BERTScore=0.8664, Time=15.8678s\n",
      " MetaModel : BERTScore=0.8681, Time=3.2380s\n"
     ]
    }
   ],
   "execution_count": 9
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "17/02/25 : The results are better with tol at 1%. I should rerun with :\n",
    "- add the feature computation in the meta model time cost\n",
    "- analyze more deeply the features and the classifier performances\n",
    "- Should change the MetaModel structure because it is too large to be commited (4GB)"
   ],
   "id": "3785e798f9dfaa6d"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": "",
   "id": "d5f89bd659f54ed2"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}