Spaces:
Running
Running
| !!! note | |
| To run this notebook in JupyterLab, load [`examples/ex0_0.ipynb`](https://github.com/DerwenAI/textgraphs/blob/main/examples/ex0_0.ipynb) | |
| # demo: TextGraphs + LLMs to construct a 'lemma graph' | |
| _TextGraphs_ library is intended for iterating through a sequence of paragraphs. | |
| ## environment | |
| ```python | |
| from IPython.display import display, HTML, Image, SVG | |
| import pathlib | |
| import typing | |
| from icecream import ic | |
| from pyinstrument import Profiler | |
| import matplotlib.pyplot as plt | |
| import pandas as pd | |
| import pyvis | |
| import spacy | |
| import textgraphs | |
| ``` | |
| ```python | |
| %load_ext watermark | |
| ``` | |
| ```python | |
| %watermark | |
| ``` | |
| Last updated: 2024-01-16T17:41:51.229985-08:00 | |
| Python implementation: CPython | |
| Python version : 3.10.11 | |
| IPython version : 8.20.0 | |
| Compiler : Clang 13.0.0 (clang-1300.0.29.30) | |
| OS : Darwin | |
| Release : 21.6.0 | |
| Machine : x86_64 | |
| Processor : i386 | |
| CPU cores : 8 | |
| Architecture: 64bit | |
| ```python | |
| %watermark --iversions | |
| ``` | |
| sys : 3.10.11 (v3.10.11:7d4cc5aa85, Apr 4 2023, 19:05:19) [Clang 13.0.0 (clang-1300.0.29.30)] | |
| spacy : 3.7.2 | |
| pandas : 2.1.4 | |
| matplotlib: 3.8.2 | |
| textgraphs: 0.5.0 | |
| pyvis : 0.3.2 | |
| ## parse a document | |
| provide the source text | |
| ```python | |
| SRC_TEXT: str = """ | |
| Werner Herzog is a remarkable filmmaker and an intellectual originally from Germany, the son of Dietrich Herzog. | |
| After the war, Werner fled to America to become famous. | |
| """ | |
| ``` | |
| set up the statistical stack profiling | |
| ```python | |
| profiler: Profiler = Profiler() | |
| profiler.start() | |
| ``` | |
| set up the `TextGraphs` pipeline | |
| ```python | |
| tg: textgraphs.TextGraphs = textgraphs.TextGraphs( | |
| factory = textgraphs.PipelineFactory( | |
| spacy_model = textgraphs.SPACY_MODEL, | |
| ner = None, | |
| kg = textgraphs.KGWikiMedia( | |
| spotlight_api = textgraphs.DBPEDIA_SPOTLIGHT_API, | |
| dbpedia_search_api = textgraphs.DBPEDIA_SEARCH_API, | |
| dbpedia_sparql_api = textgraphs.DBPEDIA_SPARQL_API, | |
| wikidata_api = textgraphs.WIKIDATA_API, | |
| min_alias = textgraphs.DBPEDIA_MIN_ALIAS, | |
| min_similarity = textgraphs.DBPEDIA_MIN_SIM, | |
| ), | |
| infer_rels = [ | |
| textgraphs.InferRel_OpenNRE( | |
| model = textgraphs.OPENNRE_MODEL, | |
| max_skip = textgraphs.MAX_SKIP, | |
| min_prob = textgraphs.OPENNRE_MIN_PROB, | |
| ), | |
| textgraphs.InferRel_Rebel( | |
| lang = "en_XX", | |
| mrebel_model = textgraphs.MREBEL_MODEL, | |
| ), | |
| ], | |
| ), | |
| ) | |
| pipe: textgraphs.Pipeline = tg.create_pipeline( | |
| SRC_TEXT.strip(), | |
| ) | |
| ``` | |
| ## visualize the parse results | |
| ```python | |
| spacy.displacy.render( | |
| pipe.ner_doc, | |
| style = "ent", | |
| jupyter = True, | |
| ) | |
| ``` | |
| <span class="tex2jax_ignore"><div class="entities" style="line-height: 2.5; direction: ltr"> | |
| <mark class="entity" style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;"> | |
| Werner Herzog | |
| <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">PERSON</span> | |
| </mark> | |
| is a remarkable filmmaker and an intellectual originally from | |
| <mark class="entity" style="background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;"> | |
| Germany | |
| <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">GPE</span> | |
| </mark> | |
| , the son of | |
| <mark class="entity" style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;"> | |
| Dietrich Herzog | |
| <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">PERSON</span> | |
| </mark> | |
| .<br>After the war, | |
| <mark class="entity" style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;"> | |
| Werner | |
| <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">PERSON</span> | |
| </mark> | |
| fled to | |
| <mark class="entity" style="background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;"> | |
| America | |
| <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">GPE</span> | |
| </mark> | |
| to become famous.</div></span> | |
| ```python | |
| parse_svg: str = spacy.displacy.render( | |
| pipe.ner_doc, | |
| style = "dep", | |
| jupyter = False, | |
| ) | |
| display(SVG(parse_svg)) | |
| ``` | |
|  | |
| ## collect graph elements from the parse | |
| ```python | |
| tg.collect_graph_elements( | |
| pipe, | |
| debug = False, | |
| ) | |
| ``` | |
| ```python | |
| ic(len(tg.nodes.values())); | |
| ic(len(tg.edges.values())); | |
| ``` | |
| ic| len(tg.nodes.values()): 36 | |
| ic| len(tg.edges.values()): 42 | |
| ## perform entity linking | |
| ```python | |
| tg.perform_entity_linking( | |
| pipe, | |
| debug = False, | |
| ) | |
| ``` | |
| ## infer relations | |
| ```python | |
| inferred_edges: list = await tg.infer_relations_async( | |
| pipe, | |
| debug = False, | |
| ) | |
| inferred_edges | |
| ``` | |
| [Edge(src_node=0, dst_node=10, kind=<RelEnum.INF: 2>, rel='https://schema.org/nationality', prob=1.0, count=1), | |
| Edge(src_node=15, dst_node=0, kind=<RelEnum.INF: 2>, rel='https://schema.org/children', prob=1.0, count=1), | |
| Edge(src_node=27, dst_node=22, kind=<RelEnum.INF: 2>, rel='https://schema.org/event', prob=1.0, count=1)] | |
| ## construct a lemma graph | |
| ```python | |
| tg.construct_lemma_graph( | |
| debug = False, | |
| ) | |
| ``` | |
| ## extract ranked entities | |
| ```python | |
| tg.calc_phrase_ranks( | |
| pr_alpha = textgraphs.PAGERANK_ALPHA, | |
| debug = False, | |
| ) | |
| ``` | |
| show the resulting entities extracted from the document | |
| ```python | |
| df: pd.DataFrame = tg.get_phrases_as_df() | |
| df | |
| ``` | |
| <div> | |
| <style scoped> | |
| .dataframe tbody tr th:only-of-type { | |
| vertical-align: middle; | |
| } | |
| .dataframe tbody tr th { | |
| vertical-align: top; | |
| } | |
| .dataframe thead th { | |
| text-align: right; | |
| } | |
| </style> | |
| <table border="1" class="dataframe"> | |
| <thead> | |
| <tr style="text-align: right;"> | |
| <th></th> | |
| <th>node_id</th> | |
| <th>text</th> | |
| <th>pos</th> | |
| <th>label</th> | |
| <th>count</th> | |
| <th>weight</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <th>0</th> | |
| <td>0</td> | |
| <td>Werner Herzog</td> | |
| <td>PROPN</td> | |
| <td>dbr:Werner_Herzog</td> | |
| <td>1</td> | |
| <td>0.080547</td> | |
| </tr> | |
| <tr> | |
| <th>1</th> | |
| <td>10</td> | |
| <td>Germany</td> | |
| <td>PROPN</td> | |
| <td>dbr:Germany</td> | |
| <td>1</td> | |
| <td>0.080437</td> | |
| </tr> | |
| <tr> | |
| <th>2</th> | |
| <td>15</td> | |
| <td>Dietrich Herzog</td> | |
| <td>PROPN</td> | |
| <td>dbo:Person</td> | |
| <td>1</td> | |
| <td>0.079048</td> | |
| </tr> | |
| <tr> | |
| <th>3</th> | |
| <td>27</td> | |
| <td>America</td> | |
| <td>PROPN</td> | |
| <td>dbr:United_States</td> | |
| <td>1</td> | |
| <td>0.079048</td> | |
| </tr> | |
| <tr> | |
| <th>4</th> | |
| <td>24</td> | |
| <td>Werner</td> | |
| <td>PROPN</td> | |
| <td>dbo:Person</td> | |
| <td>1</td> | |
| <td>0.077633</td> | |
| </tr> | |
| <tr> | |
| <th>5</th> | |
| <td>4</td> | |
| <td>filmmaker</td> | |
| <td>NOUN</td> | |
| <td>owl:Thing</td> | |
| <td>1</td> | |
| <td>0.076309</td> | |
| </tr> | |
| <tr> | |
| <th>6</th> | |
| <td>22</td> | |
| <td>war</td> | |
| <td>NOUN</td> | |
| <td>owl:Thing</td> | |
| <td>1</td> | |
| <td>0.076309</td> | |
| </tr> | |
| <tr> | |
| <th>7</th> | |
| <td>32</td> | |
| <td>a remarkable filmmaker</td> | |
| <td>noun_chunk</td> | |
| <td>None</td> | |
| <td>1</td> | |
| <td>0.076077</td> | |
| </tr> | |
| <tr> | |
| <th>8</th> | |
| <td>7</td> | |
| <td>intellectual</td> | |
| <td>NOUN</td> | |
| <td>owl:Thing</td> | |
| <td>1</td> | |
| <td>0.074725</td> | |
| </tr> | |
| <tr> | |
| <th>9</th> | |
| <td>13</td> | |
| <td>son</td> | |
| <td>NOUN</td> | |
| <td>owl:Thing</td> | |
| <td>1</td> | |
| <td>0.074725</td> | |
| </tr> | |
| <tr> | |
| <th>10</th> | |
| <td>33</td> | |
| <td>an intellectual</td> | |
| <td>noun_chunk</td> | |
| <td>None</td> | |
| <td>1</td> | |
| <td>0.074606</td> | |
| </tr> | |
| <tr> | |
| <th>11</th> | |
| <td>34</td> | |
| <td>the son</td> | |
| <td>noun_chunk</td> | |
| <td>None</td> | |
| <td>1</td> | |
| <td>0.074606</td> | |
| </tr> | |
| <tr> | |
| <th>12</th> | |
| <td>35</td> | |
| <td>the war</td> | |
| <td>noun_chunk</td> | |
| <td>None</td> | |
| <td>1</td> | |
| <td>0.074606</td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| ## visualize the lemma graph | |
| ```python | |
| render: textgraphs.RenderPyVis = tg.create_render() | |
| pv_graph: pyvis.network.Network = render.render_lemma_graph( | |
| debug = False, | |
| ) | |
| ``` | |
| initialize the layout parameters | |
| ```python | |
| pv_graph.force_atlas_2based( | |
| gravity = -38, | |
| central_gravity = 0.01, | |
| spring_length = 231, | |
| spring_strength = 0.7, | |
| damping = 0.8, | |
| overlap = 0, | |
| ) | |
| pv_graph.show_buttons(filter_ = [ "physics" ]) | |
| pv_graph.toggle_physics(True) | |
| ``` | |
| ```python | |
| pv_graph.prep_notebook() | |
| pv_graph.show("tmp.fig01.html") | |
| ``` | |
| tmp.fig01.html | |
|  | |
| ## generate a word cloud | |
| ```python | |
| wordcloud = render.generate_wordcloud() | |
| display(wordcloud.to_image()) | |
| ``` | |
|  | |
| ## cluster communities in the lemma graph | |
| In the tutorial | |
| <a href="https://towardsdatascience.com/how-to-convert-any-text-into-a-graph-of-concepts-110844f22a1a" target="_blank">"How to Convert Any Text Into a Graph of Concepts"</a>, | |
| Rahul Nayak uses the | |
| <a href="https://en.wikipedia.org/wiki/Girvan%E2%80%93Newman_algorithm"><em>girvan-newman</em></a> | |
| algorithm to split the graph into communities, then clusters on those communities. | |
| His approach works well for unsupervised clustering of key phrases which have been extracted from many documents. | |
| In contrast, Nayak was working with entities extracted from "chunks" of text, not with a text graph. | |
| ```python | |
| render.draw_communities(); | |
| ``` | |
|  | |
| ## graph of relations transform | |
| Show a transformed graph, based on _graph of relations_ (see: `lee2023ingram`) | |
| ```python | |
| graph: textgraphs.GraphOfRelations = textgraphs.GraphOfRelations( | |
| tg | |
| ) | |
| graph.seeds() | |
| graph.construct_gor() | |
| ``` | |
| ```python | |
| scores: typing.Dict[ tuple, float ] = graph.get_affinity_scores() | |
| pv_graph: pyvis.network.Network = graph.render_gor_pyvis(scores) | |
| pv_graph.force_atlas_2based( | |
| gravity = -38, | |
| central_gravity = 0.01, | |
| spring_length = 231, | |
| spring_strength = 0.7, | |
| damping = 0.8, | |
| overlap = 0, | |
| ) | |
| pv_graph.show_buttons(filter_ = [ "physics" ]) | |
| pv_graph.toggle_physics(True) | |
| pv_graph.prep_notebook() | |
| pv_graph.show("tmp.fig02.html") | |
| ``` | |
| tmp.fig02.html | |
|  | |
| *What does this transform provide?* | |
| By using a _graph of relations_ dual representation of our graph data, first and foremost we obtain a more compact representation of the relations in the graph, and means of making inferences (e.g., _link prediction_) where there is substantially more invariance in the training data. | |
| Also recognize that for a parse graph of a paragraph in the English language, the most interesting nodes will probably be either subjects (`nsubj`) or direct objects (`pobj`). Here in the _graph of relations_ we see illustrated how the important details from _entity linking_ tend to cluster near either `nsubj` or `pobj` entities, connected through punctuation. This is not as readily observed in the earlier visualization of the _lemma graph_. | |
| ## extract as RDF triples | |
| Extract the nodes and edges which have IRIs, to create an "abstraction layer" as a semantic graph at a higher level of detail above the _lemma graph_: | |
| ```python | |
| triples: str = tg.export_rdf() | |
| print(triples) | |
| ``` | |
| @base <https://github.com/DerwenAI/textgraphs/ns/> . | |
| @prefix dbo: <http://dbpedia.org/ontology/> . | |
| @prefix dbr: <http://dbpedia.org/resource/> . | |
| @prefix schema: <https://schema.org/> . | |
| @prefix skos: <http://www.w3.org/2004/02/skos/core#> . | |
| @prefix wd_ent: <http://www.wikidata.org/entity/> . | |
| dbr:Germany skos:definition "Germany (German: Deutschland, German pronunciation: [ˈdɔʏtʃlant]), constitutionally the Federal"@en ; | |
| skos:prefLabel "Germany"@en . | |
| dbr:United_States skos:definition "The United States of America (USA), commonly known as the United States (U.S. or US) or America"@en ; | |
| skos:prefLabel "United States"@en . | |
| dbr:Werner_Herzog skos:definition "Werner Herzog (German: [ˈvɛɐ̯nɐ ˈhɛɐ̯tsoːk]; born 5 September 1942) is a German film director"@en ; | |
| skos:prefLabel "Werner Herzog"@en . | |
| wd_ent:Q183 skos:definition "country in Central Europe"@en ; | |
| skos:prefLabel "Germany"@en . | |
| wd_ent:Q44131 skos:definition "German film director, producer, screenwriter, actor and opera director"@en ; | |
| skos:prefLabel "Werner Herzog"@en . | |
| <entity/america_PROPN> a dbo:Country ; | |
| skos:prefLabel "America"@en ; | |
| schema:event <entity/war_NOUN> . | |
| <entity/dietrich_PROPN_herzog_PROPN> a dbo:Person ; | |
| skos:prefLabel "Dietrich Herzog"@en ; | |
| schema:children <entity/werner_PROPN_herzog_PROPN> . | |
| <entity/filmmaker_NOUN> skos:prefLabel "filmmaker"@en . | |
| <entity/intellectual_NOUN> skos:prefLabel "intellectual"@en . | |
| <entity/son_NOUN> skos:prefLabel "son"@en . | |
| <entity/werner_PROPN> a dbo:Person ; | |
| skos:prefLabel "Werner"@en . | |
| <entity/germany_PROPN> a dbo:Country ; | |
| skos:prefLabel "Germany"@en . | |
| <entity/war_NOUN> skos:prefLabel "war"@en . | |
| <entity/werner_PROPN_herzog_PROPN> a dbo:Person ; | |
| skos:prefLabel "Werner Herzog"@en ; | |
| schema:nationality <entity/germany_PROPN> . | |
| dbo:Country skos:definition "Countries, cities, states"@en ; | |
| skos:prefLabel "country"@en . | |
| dbo:Person skos:definition "People, including fictional"@en ; | |
| skos:prefLabel "person"@en . | |
| ## statistical stack profile instrumentation | |
| ```python | |
| profiler.stop() | |
| ``` | |
| <pyinstrument.session.Session at 0x141446080> | |
| ```python | |
| profiler.print() | |
| ``` | |
| _ ._ __/__ _ _ _ _ _/_ Recorded: 17:41:51 Samples: 11163 | |
| /_//_/// /_\ / //_// / //_'/ // Duration: 57.137 CPU time: 72.235 | |
| / _/ v4.6.1 | |
| Program: /Users/paco/src/textgraphs/venv/lib/python3.10/site-packages/ipykernel_launcher.py -f /Users/paco/Library/Jupyter/runtime/kernel-8ffadb7d-3b45-4e0e-a94f-f098e5ad9fbe.json | |
| 57.136 _UnixSelectorEventLoop._run_once asyncio/base_events.py:1832 | |
| └─ 57.135 Handle._run asyncio/events.py:78 | |
| [12 frames hidden] asyncio, ipykernel, IPython | |
| 41.912 ZMQInteractiveShell.run_ast_nodes IPython/core/interactiveshell.py:3394 | |
| ├─ 20.701 <module> ../ipykernel_5151/1245857438.py:1 | |
| │ └─ 20.701 TextGraphs.perform_entity_linking textgraphs/doc.py:534 | |
| │ └─ 20.701 KGWikiMedia.perform_entity_linking textgraphs/kg.py:306 | |
| │ ├─ 10.790 KGWikiMedia._link_kg_search_entities textgraphs/kg.py:932 | |
| │ │ └─ 10.787 KGWikiMedia.dbpedia_search_entity textgraphs/kg.py:641 | |
| │ │ └─ 10.711 get requests/api.py:62 | |
| │ │ [37 frames hidden] requests, urllib3, http, socket, ssl,... | |
| │ ├─ 9.143 KGWikiMedia._link_spotlight_entities textgraphs/kg.py:851 | |
| │ │ └─ 9.140 KGWikiMedia.dbpedia_search_entity textgraphs/kg.py:641 | |
| │ │ └─ 9.095 get requests/api.py:62 | |
| │ │ [37 frames hidden] requests, urllib3, http, socket, ssl,... | |
| │ └─ 0.768 KGWikiMedia._secondary_entity_linking textgraphs/kg.py:1060 | |
| │ └─ 0.768 KGWikiMedia.wikidata_search textgraphs/kg.py:575 | |
| │ └─ 0.765 KGWikiMedia._wikidata_endpoint textgraphs/kg.py:444 | |
| │ └─ 0.765 get requests/api.py:62 | |
| │ [7 frames hidden] requests, urllib3 | |
| └─ 19.514 <module> ../ipykernel_5151/1708547378.py:1 | |
| ├─ 14.502 InferRel_Rebel.__init__ textgraphs/rel.py:121 | |
| │ └─ 14.338 pipeline transformers/pipelines/__init__.py:531 | |
| │ [39 frames hidden] transformers, torch, <built-in>, json | |
| ├─ 3.437 PipelineFactory.__init__ textgraphs/pipe.py:434 | |
| │ └─ 3.420 load spacy/__init__.py:27 | |
| │ [20 frames hidden] spacy, en_core_web_sm, catalogue, imp... | |
| ├─ 0.900 InferRel_OpenNRE.__init__ textgraphs/rel.py:33 | |
| │ └─ 0.888 get_model opennre/pretrain.py:126 | |
| └─ 0.672 TextGraphs.create_pipeline textgraphs/doc.py:103 | |
| └─ 0.672 PipelineFactory.create_pipeline textgraphs/pipe.py:508 | |
| └─ 0.672 Pipeline.__init__ textgraphs/pipe.py:216 | |
| └─ 0.672 English.__call__ spacy/language.py:1016 | |
| [11 frames hidden] spacy, spacy_dbpedia_spotlight, reque... | |
| 14.363 InferRel_Rebel.gen_triples_async textgraphs/pipe.py:188 | |
| ├─ 13.670 InferRel_Rebel.gen_triples textgraphs/rel.py:259 | |
| │ ├─ 12.439 InferRel_Rebel.tokenize_sent textgraphs/rel.py:145 | |
| │ │ └─ 12.436 TranslationPipeline.__call__ transformers/pipelines/text2text_generation.py:341 | |
| │ │ [42 frames hidden] transformers, torch, <built-in> | |
| │ └─ 1.231 KGWikiMedia.resolve_rel_iri textgraphs/kg.py:370 | |
| │ └─ 0.753 get_entity_dict_from_api qwikidata/linked_data_interface.py:21 | |
| │ [8 frames hidden] qwikidata, requests, urllib3 | |
| └─ 0.693 InferRel_OpenNRE.gen_triples textgraphs/rel.py:58 | |
| ## outro | |
| _\[ more parts are in progress, getting added to this demo \]_ | |