File size: 6,288 Bytes

d4774de

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Example code for Named Entity Recognition with MITIE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cell 1: Import Libraries and Set Up Paths\n",
    "Make sure to import necessary libraries and set up the path for the MITIE library.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mitie import ner_training_instance, ner_trainer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cell 2: Create Training Examples\n",
    "Define training examples by tokenizing sentences and annotating entities. Each entity is labeled with its respective tag, such as \"person\" or \"organization\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# First training example\n",
    "sample = ner_training_instance([\"Mam\", \"na\", \"imię\", \"Alicja\", \"Kowalska\", \"i\", \"pracuję\", \"w\", \"Cleantext\", \".\"])\n",
    "sample.add_entity(range(3,5), \"PERSON\")  # Alicja Kowalska as a person\n",
    "sample.add_entity(range(8,9), \"ORGANIZATION\")  # Cleantext as an organization\n",
    "\n",
    "# Second training example\n",
    "sample_2 = ner_training_instance([\"Wczoraj\", \"spotkałem\", \"się\", \"z\", \"Robertem\", \"Nowakiem\", \"z\", \"Global\", \"Tech\", \".\"])\n",
    "sample_2.add_entity(range(4,6), \"PERSON\")  # Robert Nowak as a person\n",
    "sample_2.add_entity(range(7,9), \"ORGANIZATION\")  # Global Tech as an organization\n",
    "\n",
    "samples = [sample, sample_2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cell 3: Initialize the Trainer\n",
    "Load the feature extractor and add the training examples to the trainer. Set the number of threads for faster processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "trainer = ner_trainer(\"./model/total_word_feature_extractor.dat\")\n",
    "\n",
    "for sample in samples:\n",
    "    trainer.add(sample)\n",
    "\n",
    "trainer.num_threads = 8"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cell 4: Train the Model\n",
    "Train the named entity recognizer and save the trained model to disk for future use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training to recognize 2 labels: 'PERSON', 'ORGANIZATION'\n",
      "Part I: train segmenter\n",
      "words in dictionary: 200000\n",
      "num features: 271\n",
      "now do training\n",
      "C:           20\n",
      "epsilon:     0.01\n",
      "num threads: 8\n",
      "cache size:  5\n",
      "max iterations: 2000\n",
      "loss per missed segment:  3\n",
      "C: 20   loss: 3 \t0.5\n",
      "C: 35   loss: 3 \t0.5\n",
      "C: 20   loss: 4.5 \t0.5\n",
      "C: 5   loss: 3 \t0.5\n",
      "C: 20   loss: 1.5 \t0.5\n",
      "C: 21   loss: 3 \t0.5\n",
      "C: 20   loss: 3.1 \t0.5\n",
      "C: 19   loss: 3 \t0.5\n",
      "C: 20   loss: 3 \t0.5\n",
      "best C: 20\n",
      "best loss: 3\n",
      "num feats in chunker model: 4095\n",
      "train: precision, recall, f1-score: 1 1 1 \n",
      "Part I: elapsed time: 0 seconds.\n",
      "\n",
      "Part II: train segment classifier\n",
      "now do training\n",
      "num training samples: 4\n",
      "C: 200   f-score: 0.5\n",
      "C: 400   f-score: 0.5\n",
      "C: 300   f-score: 0.5\n",
      "C: 100   f-score: 0.5\n",
      "C: 0.01   f-score: 0.5\n",
      "C: 50.005   f-score: 0.5\n",
      "C: 25.0075   f-score: 0.5\n",
      "C: 12.5088   f-score: 0.5\n",
      "C: 6.25938   f-score: 0.5\n",
      "C: 3.13469   f-score: 0.5\n",
      "C: 1.57234   f-score: 0.5\n",
      "C: 0.791172   f-score: 0.5\n",
      "C: 0.400586   f-score: 0.5\n",
      "best C: 0.791172\n",
      "test on train: \n",
      "2 0 \n",
      "0 2 \n",
      "\n",
      "overall accuracy: 1\n",
      "Part II: elapsed time: 1 seconds.\n",
      "df.number_of_classes(): 2\n"
     ]
    }
   ],
   "source": [
    "ner = trainer.train()\n",
    "ner.save_to_disk(\"./output/ner_model.dat\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cell 5: Test the Trained Model\n",
    "Test the trained model with a sample sentence and display the detected entities along with their tags.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tags: ['PERSON', 'ORGANIZATION']\n",
      "\n",
      "Discovered entities: [(range(3, 5), 'PERSON', 0.18746685653645725), (range(6, 8), 'ORGANIZATION', 0.22672175876072712)]\n",
      "\n",
      "Amount of entities: 2\n",
      "    PERSON: Anną Kowalską\n",
      "    ORGANIZATION: Politechnice Warszawskiej\n"
     ]
    }
   ],
   "source": [
    "# Show possible tags\n",
    "print(\"Tags:\", ner.get_possible_ner_tags())\n",
    "\n",
    "# Test the model\n",
    "tokens = [\"Spotkałem\", \"się\", \"z\", \"Anną\", \"Kowalską\", \"w\", \"Politechnice\", \"Warszawskiej\", \".\"]\n",
    "entities = ner.extract_entities(tokens)\n",
    "\n",
    "# Print the results\n",
    "print(\"\\nDiscovered entities:\", entities)\n",
    "print(\"\\nAmount of entities:\", len(entities))\n",
    "for e in entities:\n",
    "    range = e[0]\n",
    "    tag = e[1]\n",
    "    entity_text = \" \".join(tokens[i] for i in range)\n",
    "    print(\"    \" + tag + \": \" + entity_text)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "mitie-polish-oNJ6WEaN-py3.10",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}