{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Stanza-CoreNLP-Interface.ipynb", "provenance": [], "collapsed_sections": [], "toc_visible": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "2-4lzQTC9yxG", "colab_type": "text" }, "source": [ "# Stanza: A Tutorial on the Python CoreNLP Interface\n", "\n", "![Latest Version](https://img.shields.io/pypi/v/stanza.svg?colorB=bc4545)\n", "![Python Versions](https://img.shields.io/pypi/pyversions/stanza.svg?colorB=bc4545)\n", "\n", "While the Stanza library implements accurate neural network modules for basic functionalities such as part-of-speech tagging and dependency parsing, the [Stanford CoreNLP Java library](https://stanfordnlp.github.io/CoreNLP/) has been developed for years and offers more complementary features such as coreference resolution and relation extraction. To unlock these features, the Stanza library also offers an officially maintained Python interface to the CoreNLP Java library. This interface allows you to get NLP anntotations from CoreNLP by writing native Python code.\n", "\n", "\n", "This tutorial walks you through the installation, setup and basic usage of this Python CoreNLP interface. If you want to learn how to use the neural network components in Stanza, please refer to other tutorials." ] }, { "cell_type": "markdown", "metadata": { "id": "YpKwWeVkASGt", "colab_type": "text" }, "source": [ "## 1. Installation\n", "\n", "Before the installation starts, please make sure that you have Python 3 and Java installed on your computer. Since Colab already has them installed, we'll skip this procedure in this notebook." ] }, { "cell_type": "markdown", "metadata": { "id": "k1Az2ECuAfG8", "colab_type": "text" }, "source": [ "### Installing Stanza\n", "\n", "Installing and importing Stanza are as simple as running the following commands:" ] }, { "cell_type": "code", "metadata": { "id": "xiFwYAgW4Mss", "colab_type": "code", "colab": {} }, "source": [ "# Install stanza; note that the prefix \"!\" is not needed if you are running in a terminal\n", "!pip install stanza\n", "\n", "# Import stanza\n", "import stanza" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "2zFvaA8_A32_", "colab_type": "text" }, "source": [ "### Setting up Stanford CoreNLP\n", "\n", "In order for the interface to work, the Stanford CoreNLP library has to be installed and a `CORENLP_HOME` environment variable has to be pointed to the installation location.\n", "\n", "Here we are going to show you how to download and install the CoreNLP library on your machine, with Stanza's installation command:" ] }, { "cell_type": "code", "metadata": { "id": "MgK6-LPV-OdA", "colab_type": "code", "colab": {} }, "source": [ "# Download the Stanford CoreNLP package with Stanza's installation command\n", "# This'll take several minutes, depending on the network speed\n", "corenlp_dir = './corenlp'\n", "stanza.install_corenlp(dir=corenlp_dir)\n", "\n", "# Set the CORENLP_HOME environment variable to point to the installation location\n", "import os\n", "os.environ[\"CORENLP_HOME\"] = corenlp_dir" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Jdq8MT-NAhKj", "colab_type": "text" }, "source": [ "That's all for the installation! 🎉 We can now double check if the installation is successful by listing files in the CoreNLP directory. You should be able to see a number of `.jar` files by running the following command:" ] }, { "cell_type": "code", "metadata": { "id": "K5eIOaJp_tuo", "colab_type": "code", "colab": {} }, "source": [ "# Examine the CoreNLP installation folder to make sure the installation is successful\n", "!ls $CORENLP_HOME" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "S0xb9BHt__gx", "colab_type": "text" }, "source": [ "**Note 1**:\n", "If you are want to use the interface in a terminal (instead of a Colab notebook), you can properly set the `CORENLP_HOME` environment variable with:\n", "\n", "```bash\n", "export CORENLP_HOME=path_to_corenlp_dir\n", "```\n", "\n", "Here we instead set this variable with the Python `os` library, simply because `export` command is not well-supported in Colab notebook.\n", "\n", "\n", "**Note 2**:\n", "The `stanza.install_corenlp()` function is only available since Stanza v1.1.1. If you are using an earlier version of Stanza, please check out our [manual installation page](https://stanfordnlp.github.io/stanza/client_setup.html#manual-installation) for how to install CoreNLP on your computer.\n", "\n", "**Note 3**:\n", "Besides the installation function, we also provide a `stanza.download_corenlp_models()` function to help you download additional CoreNLP models for different languages that are not shipped with the default installation. Check out our [automatic installation website page](https://stanfordnlp.github.io/stanza/client_setup.html#automated-installation) for more information on how to use it." ] }, { "cell_type": "markdown", "metadata": { "id": "xJsuO6D8D05q", "colab_type": "text" }, "source": [ "## 2. Annotating Text with CoreNLP Interface" ] }, { "cell_type": "markdown", "metadata": { "id": "dZNHxXHkH1K2", "colab_type": "text" }, "source": [ "### Constructing CoreNLPClient\n", "\n", "At a high level, the CoreNLP Python interface works by first starting a background Java CoreNLP server process, and then initializing a client instance in Python which can pass the text to the background server process, and accept the returned annotation results.\n", "\n", "We wrap these functionalities in a `CoreNLPClient` class. Therefore, we need to start by importing this class from Stanza." ] }, { "cell_type": "code", "metadata": { "id": "LS4OKnqJ8wui", "colab_type": "code", "colab": {} }, "source": [ "# Import client module\n", "from stanza.server import CoreNLPClient" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "WP4Dz6PIJHeL", "colab_type": "text" }, "source": [ "After the import is done, we can construct a `CoreNLPClient` instance. The constructor method takes a Python list of annotator names as argument. Here let's explore some basic annotators including tokenization, sentence split, part-of-speech tagging, lemmatization and named entity recognition (NER). \n", "\n", "Additionally, the client constructor accepts a `memory` argument, which specifies how much memory will be allocated to the background Java process. An `endpoint` option can be used to specify a port number used by the communication between the server and the client. The default port is 9000. However, since this port is pre-occupied by a system process in Colab, we'll manually set it to 9001 in the following example.\n", "\n", "Also, here we manually set `be_quiet=True` to avoid an IO issue in colab notebook. You should be able to use `be_quiet=False` on your own computer, which will print detailed logging information from CoreNLP during usage.\n", "\n", "For more options in constructing the clients, please refer to the [CoreNLP Client Options List](https://stanfordnlp.github.io/stanza/corenlp_client.html#corenlp-client-options)." ] }, { "cell_type": "code", "metadata": { "id": "mbOBugvd9JaM", "colab_type": "code", "colab": {} }, "source": [ "# Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001\n", "client = CoreNLPClient(\n", " annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner'], \n", " memory='4G', \n", " endpoint='http://localhost:9001',\n", " be_quiet=True)\n", "print(client)\n", "\n", "# Start the background server and wait for some time\n", "# Note that in practice this is totally optional, as by default the server will be started when the first annotation is performed\n", "client.start()\n", "import time; time.sleep(10)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "kgTiVjNydmIW", "colab_type": "text" }, "source": [ "After the above code block finishes executing, if you print the background processes, you should be able to find the Java CoreNLP server running." ] }, { "cell_type": "code", "metadata": { "id": "spZrJ-oFdkdF", "colab_type": "code", "colab": {} }, "source": [ "# Print background processes and look for java\n", "# You should be able to see a StanfordCoreNLPServer java process running in the background\n", "!ps -o pid,cmd | grep java" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "KxJeJ0D2LoOs", "colab_type": "text" }, "source": [ "### Annotating Text\n", "\n", "Annotating a piece of text is as simple as passing the text into an `annotate` function of the client object. After the annotation is complete, a `Document` object will be returned with all annotations.\n", "\n", "Note that although in general annotations are very fast, the first annotation might take a while to complete in the notebook. Please stay patient." ] }, { "cell_type": "code", "metadata": { "id": "s194RnNg5z95", "colab_type": "code", "colab": {} }, "source": [ "# Annotate some text\n", "text = \"Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity.\"\n", "document = client.annotate(text)\n", "print(type(document))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "semmA3e0TcM1", "colab_type": "text" }, "source": [ "## 3. Accessing Annotations\n", "\n", "Annotations can be accessed from the returned `Document` object.\n", "\n", "A `Document` contains a list of `Sentence`s, which contain a list of `Token`s. Here let's first explore the annotations stored in all tokens." ] }, { "cell_type": "code", "metadata": { "id": "lIO4B5d6Rk4I", "colab_type": "code", "colab": {} }, "source": [ "# Iterate over all tokens in all sentences, and print out the word, lemma, pos and ner tags\n", "print(\"{:12s}\\t{:12s}\\t{:6s}\\t{}\".format(\"Word\", \"Lemma\", \"POS\", \"NER\"))\n", "\n", "for i, sent in enumerate(document.sentence):\n", " print(\"[Sentence {}]\".format(i+1))\n", " for t in sent.token:\n", " print(\"{:12s}\\t{:12s}\\t{:6s}\\t{}\".format(t.word, t.lemma, t.pos, t.ner))\n", " print(\"\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "msrJfvu8VV9m", "colab_type": "text" }, "source": [ "Alternatively, you can also browse the NER results by iterating over entity mentions over the sentences. For example:" ] }, { "cell_type": "code", "metadata": { "id": "ezEjc9LeV2Xs", "colab_type": "code", "colab": {} }, "source": [ "# Iterate over all detected entity mentions\n", "print(\"{:30s}\\t{}\".format(\"Mention\", \"Type\"))\n", "\n", "for sent in document.sentence:\n", " for m in sent.mentions:\n", " print(\"{:30s}\\t{}\".format(m.entityMentionText, m.entityType))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "ueGzBZ3hWzkN", "colab_type": "text" }, "source": [ "To print all annotations a sentence, token or mention has, you can simply print the corresponding obejct." ] }, { "cell_type": "code", "metadata": { "id": "4_S8o2BHXIed", "colab_type": "code", "colab": {} }, "source": [ "# Print annotations of a token\n", "print(document.sentence[0].token[0])\n", "\n", "# Print annotations of a mention\n", "print(document.sentence[0].mentions[0])" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Qp66wjZ10xia", "colab_type": "text" }, "source": [ "**Note**: Since the Stanza CoreNLP client interface simply ports the CoreNLP annotation results to native Python objects, for a comprehensive lists of available annotators and how their annotation results can be accessed, you will need to visit the [Stanford CoreNLP website](https://stanfordnlp.github.io/CoreNLP/)." ] }, { "cell_type": "markdown", "metadata": { "id": "IPqzMK90X0w3", "colab_type": "text" }, "source": [ "## 4. Shutting Down the CoreNLP Server\n", "\n", "To shut down the background CoreNLP server process, simply call the `stop` function of the client. Note that once a server is shutdown, you'll have to restart the server with the `start()` function before any annotation is requested." ] }, { "cell_type": "code", "metadata": { "id": "xrJq8lZ3Nw7b", "colab_type": "code", "colab": {} }, "source": [ "# Shut down the background CoreNLP server\n", "client.stop()\n", "\n", "time.sleep(10)\n", "!ps -o pid,cmd | grep java" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "23Vwa_ifYfF7", "colab_type": "text" }, "source": [ "### More Information\n", "\n", "For more information on how to use the `CoreNLPClient`, please go to the [CoreNLPClient documentation page](https://stanfordnlp.github.io/stanza/corenlp_client.html)." ] }, { "cell_type": "markdown", "metadata": { "id": "YUrVT6kA_Bzx", "colab_type": "text" }, "source": [ "## 5. Simplifying Client Usage with the Python `with` statement\n", "\n", "In the above demo, we explicitly called the `client.start()` and `client.stop()` functions to start and stop a client-server connection. However, doing this in practice is usually suboptimal, since you may forget to call the `stop()` function at the end, resulting in an unused server process occupying your machine memory.\n", "\n", "To solve is, a simple solution is to use the client interface with the [Python `with` statement](https://docs.python.org/3/reference/compound_stmts.html#the-with-statement). The `with` statement provides an elegant way to automatically start and stop the server process in your Python program, without you needing to worry about this. The following code snippet demonstrates how to establish a client, annotate an example text and then stop the server with a simple `with` statement. Note that we **always recommend** you to use the `with` statement when working with the Stanza CoreNLP client interface." ] }, { "cell_type": "code", "metadata": { "id": "H0ct2-R4AvJh", "colab_type": "code", "colab": {} }, "source": [ "print(\"Starting a server with the Python \\\"with\\\" statement...\")\n", "with CoreNLPClient(annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner'], \n", " memory='4G', endpoint='http://localhost:9001', be_quiet=True) as client:\n", " text = \"Albert Einstein was a German-born theoretical physicist.\"\n", " document = client.annotate(text)\n", "\n", " print(\"{:30s}\\t{}\".format(\"Mention\", \"Type\"))\n", " for sent in document.sentence:\n", " for m in sent.mentions:\n", " print(\"{:30s}\\t{}\".format(m.entityMentionText, m.entityType))\n", "\n", "print(\"\\nThe server should be stopped upon exit from the \\\"with\\\" statement.\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "W435Lwc4YqKb", "colab_type": "text" }, "source": [ "## 6. Other Resources\n", "\n", "- [Stanza Homepage](https://stanfordnlp.github.io/stanza/)\n", "- [FAQs](https://stanfordnlp.github.io/stanza/faq.html)\n", "- [GitHub Repo](https://github.com/stanfordnlp/stanza)\n", "- [Reporting Issues](https://github.com/stanfordnlp/stanza/issues)\n" ] } ] }