Spaces:

DerwenAI
/

textgraphs

Running

App Files Files Community

textgraphs / docs /nlp.md

Paco Nathan

A new start

91eaff6 almost 2 years ago

|

history blame contribute delete

1.5 kB

	The open source `spaCy` library in Python provides full-featured NLP capabilities.
	[#honnibal2020spacy](biblio.md#honnibal2020spacy)
	This serves as a core component of this project.
	Recent releases of `spaCy` have provided features to integrate with selected large models, and also support native features for entity linking.

	On the one hand, `spaCy` pipelines offer a broad range of integrations and "opinionated" selections for both utility and ease of use.
	The resulting pipelines are optimized for annotating streams of spans of tokens.
	On the other hand, the opinionated API calls and the abstractions use for pipeline construction and configuration present some important constraints:

	- Pipelines are not especially well-suited for propagating other forms of generated data, beyond token/span streams.
	- Tokenization used in `spaCy` does not align with the requirements for relation extraction projects of interest.
	- Entity linking capabilities rely on using an internally defined "knowledge base" which is not well-suited for integrating with heterogeneous resources.

	Consequently, while `spaCy` serves as a core component for NLP capabilities, this project presents a library of Python class definitions for KG construction which can be extended and configured to accommodate a broad range of LLM components.
	These "less opinionated" pipeline definitions, in the broader scope, are optimized for managing streams of KG candidate elements which have been produced by generative AI.