{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example code for Named Entity Recognition with MITIE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cell 1: Import Libraries and Set Up Paths\n", "Make sure to import necessary libraries and set up the path for the MITIE library.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from mitie import ner_training_instance, ner_trainer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cell 2: Create Training Examples\n", "Define training examples by tokenizing sentences and annotating entities. Each entity is labeled with its respective tag, such as \"person\" or \"organization\"." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# First training example\n", "sample = ner_training_instance([\"Mam\", \"na\", \"imię\", \"Alicja\", \"Kowalska\", \"i\", \"pracuję\", \"w\", \"Cleantext\", \".\"])\n", "sample.add_entity(range(3,5), \"PERSON\") # Alicja Kowalska as a person\n", "sample.add_entity(range(8,9), \"ORGANIZATION\") # Cleantext as an organization\n", "\n", "# Second training example\n", "sample_2 = ner_training_instance([\"Wczoraj\", \"spotkałem\", \"się\", \"z\", \"Robertem\", \"Nowakiem\", \"z\", \"Global\", \"Tech\", \".\"])\n", "sample_2.add_entity(range(4,6), \"PERSON\") # Robert Nowak as a person\n", "sample_2.add_entity(range(7,9), \"ORGANIZATION\") # Global Tech as an organization\n", "\n", "samples = [sample, sample_2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cell 3: Initialize the Trainer\n", "Load the feature extractor and add the training examples to the trainer. Set the number of threads for faster processing." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "trainer = ner_trainer(\"./model/total_word_feature_extractor.dat\")\n", "\n", "for sample in samples:\n", " trainer.add(sample)\n", "\n", "trainer.num_threads = 8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cell 4: Train the Model\n", "Train the named entity recognizer and save the trained model to disk for future use." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training to recognize 2 labels: 'PERSON', 'ORGANIZATION'\n", "Part I: train segmenter\n", "words in dictionary: 200000\n", "num features: 271\n", "now do training\n", "C: 20\n", "epsilon: 0.01\n", "num threads: 8\n", "cache size: 5\n", "max iterations: 2000\n", "loss per missed segment: 3\n", "C: 20 loss: 3 \t0.5\n", "C: 35 loss: 3 \t0.5\n", "C: 20 loss: 4.5 \t0.5\n", "C: 5 loss: 3 \t0.5\n", "C: 20 loss: 1.5 \t0.5\n", "C: 21 loss: 3 \t0.5\n", "C: 20 loss: 3.1 \t0.5\n", "C: 19 loss: 3 \t0.5\n", "C: 20 loss: 3 \t0.5\n", "best C: 20\n", "best loss: 3\n", "num feats in chunker model: 4095\n", "train: precision, recall, f1-score: 1 1 1 \n", "Part I: elapsed time: 0 seconds.\n", "\n", "Part II: train segment classifier\n", "now do training\n", "num training samples: 4\n", "C: 200 f-score: 0.5\n", "C: 400 f-score: 0.5\n", "C: 300 f-score: 0.5\n", "C: 100 f-score: 0.5\n", "C: 0.01 f-score: 0.5\n", "C: 50.005 f-score: 0.5\n", "C: 25.0075 f-score: 0.5\n", "C: 12.5088 f-score: 0.5\n", "C: 6.25938 f-score: 0.5\n", "C: 3.13469 f-score: 0.5\n", "C: 1.57234 f-score: 0.5\n", "C: 0.791172 f-score: 0.5\n", "C: 0.400586 f-score: 0.5\n", "best C: 0.791172\n", "test on train: \n", "2 0 \n", "0 2 \n", "\n", "overall accuracy: 1\n", "Part II: elapsed time: 1 seconds.\n", "df.number_of_classes(): 2\n" ] } ], "source": [ "ner = trainer.train()\n", "ner.save_to_disk(\"./output/ner_model.dat\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cell 5: Test the Trained Model\n", "Test the trained model with a sample sentence and display the detected entities along with their tags.\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tags: ['PERSON', 'ORGANIZATION']\n", "\n", "Discovered entities: [(range(3, 5), 'PERSON', 0.18746685653645725), (range(6, 8), 'ORGANIZATION', 0.22672175876072712)]\n", "\n", "Amount of entities: 2\n", " PERSON: Anną Kowalską\n", " ORGANIZATION: Politechnice Warszawskiej\n" ] } ], "source": [ "# Show possible tags\n", "print(\"Tags:\", ner.get_possible_ner_tags())\n", "\n", "# Test the model\n", "tokens = [\"Spotkałem\", \"się\", \"z\", \"Anną\", \"Kowalską\", \"w\", \"Politechnice\", \"Warszawskiej\", \".\"]\n", "entities = ner.extract_entities(tokens)\n", "\n", "# Print the results\n", "print(\"\\nDiscovered entities:\", entities)\n", "print(\"\\nAmount of entities:\", len(entities))\n", "for e in entities:\n", " range = e[0]\n", " tag = e[1]\n", " entity_text = \" \".join(tokens[i] for i in range)\n", " print(\" \" + tag + \": \" + entity_text)\n" ] } ], "metadata": { "kernelspec": { "display_name": "mitie-polish-oNJ6WEaN-py3.10", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 2 }