File size: 18,481 Bytes
9cbeb98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Stanza-CoreNLP-Interface.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2-4lzQTC9yxG",
        "colab_type": "text"
      },
      "source": [
        "# Stanza: A Tutorial on the Python CoreNLP Interface\n",
        "\n",
        "![Latest Version](https://img.shields.io/pypi/v/stanza.svg?colorB=bc4545)\n",
        "![Python Versions](https://img.shields.io/pypi/pyversions/stanza.svg?colorB=bc4545)\n",
        "\n",
        "While the Stanza library implements accurate neural network modules for basic functionalities such as part-of-speech tagging and dependency parsing, the [Stanford CoreNLP Java library](https://stanfordnlp.github.io/CoreNLP/) has been developed for years and offers more complementary features such as coreference resolution and relation extraction. To unlock these features, the Stanza library also offers an officially maintained Python interface to the CoreNLP Java library. This interface allows you to get NLP anntotations from CoreNLP by writing native Python code.\n",
        "\n",
        "\n",
        "This tutorial walks you through the installation, setup and basic usage of this Python CoreNLP interface. If you want to learn how to use the neural network components in Stanza, please refer to other tutorials."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YpKwWeVkASGt",
        "colab_type": "text"
      },
      "source": [
        "## 1. Installation\n",
        "\n",
        "Before the installation starts, please make sure that you have Python 3 and Java installed on your computer. Since Colab already has them installed, we'll skip this procedure in this notebook."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "k1Az2ECuAfG8",
        "colab_type": "text"
      },
      "source": [
        "### Installing Stanza\n",
        "\n",
        "Installing and importing Stanza are as simple as running the following commands:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "xiFwYAgW4Mss",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Install stanza; note that the prefix \"!\" is not needed if you are running in a terminal\n",
        "!pip install stanza\n",
        "\n",
        "# Import stanza\n",
        "import stanza"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2zFvaA8_A32_",
        "colab_type": "text"
      },
      "source": [
        "### Setting up Stanford CoreNLP\n",
        "\n",
        "In order for the interface to work, the Stanford CoreNLP library has to be installed and a `CORENLP_HOME` environment variable has to be pointed to the installation location.\n",
        "\n",
        "Here we are going to show you how to download and install the CoreNLP library on your machine, with Stanza's installation command:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "MgK6-LPV-OdA",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Download the Stanford CoreNLP package with Stanza's installation command\n",
        "# This'll take several minutes, depending on the network speed\n",
        "corenlp_dir = './corenlp'\n",
        "stanza.install_corenlp(dir=corenlp_dir)\n",
        "\n",
        "# Set the CORENLP_HOME environment variable to point to the installation location\n",
        "import os\n",
        "os.environ[\"CORENLP_HOME\"] = corenlp_dir"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Jdq8MT-NAhKj",
        "colab_type": "text"
      },
      "source": [
        "That's all for the installation! 🎉  We can now double check if the installation is successful by listing files in the CoreNLP directory. You should be able to see a number of `.jar` files by running the following command:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "K5eIOaJp_tuo",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Examine the CoreNLP installation folder to make sure the installation is successful\n",
        "!ls $CORENLP_HOME"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "S0xb9BHt__gx",
        "colab_type": "text"
      },
      "source": [
        "**Note 1**:\n",
        "If you are want to use the interface in a terminal (instead of a Colab notebook), you can properly set the `CORENLP_HOME` environment variable with:\n",
        "\n",
        "```bash\n",
        "export CORENLP_HOME=path_to_corenlp_dir\n",
        "```\n",
        "\n",
        "Here we instead set this variable with the Python `os` library, simply because `export` command is not well-supported in Colab notebook.\n",
        "\n",
        "\n",
        "**Note 2**:\n",
        "The `stanza.install_corenlp()` function is only available since Stanza v1.1.1. If you are using an earlier version of Stanza, please check out our [manual installation page](https://stanfordnlp.github.io/stanza/client_setup.html#manual-installation) for how to install CoreNLP on your computer.\n",
        "\n",
        "**Note 3**:\n",
        "Besides the installation function, we also provide a `stanza.download_corenlp_models()` function to help you download additional CoreNLP models for different languages that are not shipped with the default installation. Check out our [automatic installation website page](https://stanfordnlp.github.io/stanza/client_setup.html#automated-installation) for more information on how to use it."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xJsuO6D8D05q",
        "colab_type": "text"
      },
      "source": [
        "## 2. Annotating Text with CoreNLP Interface"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dZNHxXHkH1K2",
        "colab_type": "text"
      },
      "source": [
        "### Constructing CoreNLPClient\n",
        "\n",
        "At a high level, the CoreNLP Python interface works by first starting a background Java CoreNLP server process, and then initializing a client instance in Python which can pass the text to the background server process, and accept the returned annotation results.\n",
        "\n",
        "We wrap these functionalities in a `CoreNLPClient` class. Therefore, we need to start by importing this class from Stanza."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "LS4OKnqJ8wui",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Import client module\n",
        "from stanza.server import CoreNLPClient"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WP4Dz6PIJHeL",
        "colab_type": "text"
      },
      "source": [
        "After the import is done, we can construct a `CoreNLPClient` instance. The constructor method takes a Python list of annotator names as argument. Here let's explore some basic annotators including tokenization, sentence split, part-of-speech tagging, lemmatization and named entity recognition (NER). \n",
        "\n",
        "Additionally, the client constructor accepts a `memory` argument, which specifies how much memory will be allocated to the background Java process. An `endpoint` option can be used to specify a port number used by the communication between the server and the client. The default port is 9000. However, since this port is pre-occupied by a system process in Colab, we'll manually set it to 9001 in the following example.\n",
        "\n",
        "Also, here we manually set `be_quiet=True` to avoid an IO issue in colab notebook. You should be able to use `be_quiet=False` on your own computer, which will print detailed logging information from CoreNLP during usage.\n",
        "\n",
        "For more options in constructing the clients, please refer to the [CoreNLP Client Options List](https://stanfordnlp.github.io/stanza/corenlp_client.html#corenlp-client-options)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "mbOBugvd9JaM",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001\n",
        "client = CoreNLPClient(\n",
        "    annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner'], \n",
        "    memory='4G', \n",
        "    endpoint='http://localhost:9001',\n",
        "    be_quiet=True)\n",
        "print(client)\n",
        "\n",
        "# Start the background server and wait for some time\n",
        "# Note that in practice this is totally optional, as by default the server will be started when the first annotation is performed\n",
        "client.start()\n",
        "import time; time.sleep(10)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kgTiVjNydmIW",
        "colab_type": "text"
      },
      "source": [
        "After the above code block finishes executing, if you print the background processes, you should be able to find the Java CoreNLP server running."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "spZrJ-oFdkdF",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Print background processes and look for java\n",
        "# You should be able to see a StanfordCoreNLPServer java process running in the background\n",
        "!ps -o pid,cmd | grep java"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KxJeJ0D2LoOs",
        "colab_type": "text"
      },
      "source": [
        "### Annotating Text\n",
        "\n",
        "Annotating a piece of text is as simple as passing the text into an `annotate` function of the client object. After the annotation is complete, a `Document`  object will be returned with all annotations.\n",
        "\n",
        "Note that although in general annotations are very fast, the first annotation might take a while to complete in the notebook. Please stay patient."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "s194RnNg5z95",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Annotate some text\n",
        "text = \"Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity.\"\n",
        "document = client.annotate(text)\n",
        "print(type(document))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "semmA3e0TcM1",
        "colab_type": "text"
      },
      "source": [
        "## 3. Accessing Annotations\n",
        "\n",
        "Annotations can be accessed from the returned `Document` object.\n",
        "\n",
        "A `Document` contains a list of `Sentence`s, which contain a list of `Token`s. Here let's first explore the annotations stored in all tokens."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "lIO4B5d6Rk4I",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Iterate over all tokens in all sentences, and print out the word, lemma, pos and ner tags\n",
        "print(\"{:12s}\\t{:12s}\\t{:6s}\\t{}\".format(\"Word\", \"Lemma\", \"POS\", \"NER\"))\n",
        "\n",
        "for i, sent in enumerate(document.sentence):\n",
        "    print(\"[Sentence {}]\".format(i+1))\n",
        "    for t in sent.token:\n",
        "        print(\"{:12s}\\t{:12s}\\t{:6s}\\t{}\".format(t.word, t.lemma, t.pos, t.ner))\n",
        "    print(\"\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "msrJfvu8VV9m",
        "colab_type": "text"
      },
      "source": [
        "Alternatively, you can also browse the NER results by iterating over entity mentions over the sentences. For example:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ezEjc9LeV2Xs",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Iterate over all detected entity mentions\n",
        "print(\"{:30s}\\t{}\".format(\"Mention\", \"Type\"))\n",
        "\n",
        "for sent in document.sentence:\n",
        "    for m in sent.mentions:\n",
        "        print(\"{:30s}\\t{}\".format(m.entityMentionText, m.entityType))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ueGzBZ3hWzkN",
        "colab_type": "text"
      },
      "source": [
        "To print all annotations a sentence, token or mention has, you can simply print the corresponding obejct."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "4_S8o2BHXIed",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Print annotations of a token\n",
        "print(document.sentence[0].token[0])\n",
        "\n",
        "# Print annotations of a mention\n",
        "print(document.sentence[0].mentions[0])"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Qp66wjZ10xia",
        "colab_type": "text"
      },
      "source": [
        "**Note**: Since the Stanza CoreNLP client interface simply ports the CoreNLP annotation results to native Python objects, for a comprehensive lists of available annotators and how their annotation results can be accessed, you will need to visit the [Stanford CoreNLP website](https://stanfordnlp.github.io/CoreNLP/)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IPqzMK90X0w3",
        "colab_type": "text"
      },
      "source": [
        "## 4. Shutting Down the CoreNLP Server\n",
        "\n",
        "To shut down the background CoreNLP server process, simply call the `stop` function of the client. Note that once a server is shutdown, you'll have to restart the server with the `start()` function before any annotation is requested."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "xrJq8lZ3Nw7b",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Shut down the background CoreNLP server\n",
        "client.stop()\n",
        "\n",
        "time.sleep(10)\n",
        "!ps -o pid,cmd | grep java"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "23Vwa_ifYfF7",
        "colab_type": "text"
      },
      "source": [
        "### More Information\n",
        "\n",
        "For more information on how to use the `CoreNLPClient`, please go to the [CoreNLPClient documentation page](https://stanfordnlp.github.io/stanza/corenlp_client.html)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YUrVT6kA_Bzx",
        "colab_type": "text"
      },
      "source": [
        "## 5. Simplifying Client Usage with the Python `with` statement\n",
        "\n",
        "In the above demo, we explicitly called the `client.start()` and `client.stop()` functions to start and stop a client-server connection. However, doing this in practice is usually suboptimal, since you may forget to call the `stop()` function at the end, resulting in an unused server process occupying your machine memory.\n",
        "\n",
        "To solve is, a simple solution is to use the client interface with the [Python `with` statement](https://docs.python.org/3/reference/compound_stmts.html#the-with-statement). The `with` statement provides an elegant way to automatically start and stop the server process in your Python program, without you needing to worry about this. The following code snippet demonstrates how to establish a client, annotate an example text and then stop the server with a simple `with` statement. Note that we **always recommend** you to use the `with` statement when working with the Stanza CoreNLP client interface."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "H0ct2-R4AvJh",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "print(\"Starting a server with the Python \\\"with\\\" statement...\")\n",
        "with CoreNLPClient(annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner'], \n",
        "                   memory='4G', endpoint='http://localhost:9001', be_quiet=True) as client:\n",
        "    text = \"Albert Einstein was a German-born theoretical physicist.\"\n",
        "    document = client.annotate(text)\n",
        "\n",
        "    print(\"{:30s}\\t{}\".format(\"Mention\", \"Type\"))\n",
        "    for sent in document.sentence:\n",
        "        for m in sent.mentions:\n",
        "            print(\"{:30s}\\t{}\".format(m.entityMentionText, m.entityType))\n",
        "\n",
        "print(\"\\nThe server should be stopped upon exit from the \\\"with\\\" statement.\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "W435Lwc4YqKb",
        "colab_type": "text"
      },
      "source": [
        "## 6. Other Resources\n",
        "\n",
        "- [Stanza Homepage](https://stanfordnlp.github.io/stanza/)\n",
        "- [FAQs](https://stanfordnlp.github.io/stanza/faq.html)\n",
        "- [GitHub Repo](https://github.com/stanfordnlp/stanza)\n",
        "- [Reporting Issues](https://github.com/stanfordnlp/stanza/issues)\n"
      ]
    }
  ]
}