Papers
arxiv:2602.22045

DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain

Published on Feb 25
· Submitted by
Walter Hernandez Cruz
on Feb 27
Authors:
,
,

Abstract

The DLT-Corpus dataset, containing 2.98 billion tokens from diverse sources, enables analysis of technology emergence patterns and market-innovation correlations in the distributed ledger technology sector.

AI-generated summary

We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrencies price prediction and smart contracts, leaving domain-specific language under explored despite the sector's ~$3 trillion market capitalization and rapid technological evolution. We demonstrate DLT-Corpus' utility by analyzing technology emergence patterns and market-innovation correlations. Findings reveal that technologies originate in scientific literature before reaching patents and social media, following traditional technology transfer patterns. While social media sentiment remains overwhelmingly bullish even during crypto winters, scientific and patent activity grow independently of market fluctuations, tracking overall market expansion in a virtuous cycle where research precedes and enables economic growth that funds further innovation. We publicly release the full DLT-Corpus; LedgerBERT, a domain-adapted model achieving 23% improvement over BERT-base on a DLT-specific Named Entity Recognition (NER) task; and all associated tools and code.

Community

Paper author Paper submitter

We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrencies price prediction and smart contracts, leaving domain-specific language under explored despite the sector's ~$3 trillion market capitalization and rapid technological evolution.

In summary, our contributions are:

  • DLT-Corpus: 2.98 billion tokens from 22.12 million documents (37,440 scientific publications, 49,023 patents, 22M social media posts) with rich metadata enabling cross-disciplinary research.

  • Innovation diffusion analysis: Evidence that Distributed Ledger Technologies (DLTs) follow traditional technology transfer patterns, with research preceding market expansion and creating a virtuous funding cycle.

  • Sentiment analysis dataset: 23,301 cryptocurrency news headlines and brief descriptions with crowdsourced annotations from active community members, addressing the need for domain-specific labeled data.

  • LedgerBERT: A domain-adapted language model achieving 23% improvement over BERT-base on DLT-specific NER task, developed through continued pre-training of SciBERT.

DLT-Corpus, sentiment analysis dataset, models, and code are publicly available to support reproducibility and future research.

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 4

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.22045 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.