LedgerBERT / README.md

Added Model Card

5faf9e0 verified about 2 months ago

8.75 kB

	---
	datasets:
	- ExponentialScience/DLT-Tweets
	- ExponentialScience/DLT-Patents
	- ExponentialScience/DLT-Scientific-Literature
	language:
	- en
	base_model:
	- allenai/scibert_scivocab_cased
	---
	# LedgerBERT

	## Model Description

	### Model Summary

	LedgerBERT is a domain-adapted language model specialized for the Distributed Ledger Technology (DLT) field. It was created through continual pre-training of SciBERT on the DLT-Corpus, a comprehensive collection of 2.98 billion tokens from scientific literature, patents, and social media focused on blockchain, cryptocurrencies, and distributed ledger systems.

	LedgerBERT captures DLT-specific terminology and concepts, making it particularly effective for NLP tasks involving blockchain technologies, cryptocurrency discourse, smart contracts, consensus mechanisms, and related domain-specific content.

	- Developed by: Walter Hernandez Cruz, Peter Devine, Nikhil Vadgama, Paolo Tasca, Jiahua Xu
	- Model type: BERT-base encoder (bidirectional transformer)
	- Language: English
	- License: CC-BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International)
	- Base model: SciBERT (allenai/scibert_scivocab_cased)
	- Training corpus: DLT-Corpus (2.98 billion tokens)

	### Model Architecture

	- Architecture: BERT-base
	- Parameters: 110 million
	- Hidden size: 768
	- Number of layers: 12
	- Attention heads: 12
	- Vocabulary size: 30,522 (SciBERT vocabulary)
	- Max sequence length: 512 tokens

	## Intended Uses

	### Primary Use Cases

	LedgerBERT is designed for NLP tasks in the DLT domain, including, but not limited to:

	- Named Entity Recognition (NER): Identifying DLT-specific entities such as consensus mechanisms (e.g., Proof of Stake), blockchain platforms (e.g., Ethereum, Hedera), cryptographic concepts (e.g., Merkle tree, hashing)
	- Text Classification: Categorizing DLT-related documents, patents, or social media posts
	- Sentiment Analysis: Analyzing sentiment in cryptocurrency news and social media
	- Information Extraction: Extracting technical concepts and relationships from DLT literature
	- Document Retrieval: Building search systems for DLT content
	- Question Answering (QA): Creating QA systems for blockchain and cryptocurrency topics

	### Out-of-Scope Uses

	- Real-time trading systems: LedgerBERT should not be used as the sole basis for automated trading decisions
	- Investment advice: Not suitable for providing financial or investment recommendations without proper disclaimers
	- General-purpose NLP: While LedgerBERT maintains general language understanding, it is optimized for DLT-specific tasks
	- Legal or regulatory compliance: Should not be used for legal interpretation without expert review

	## Training Details

	### Training Data

	LedgerBERT was continually pre-trained on the DLT-Corpus, consisting of:

	- Scientific Literature: 37,440 documents, 564M tokens (1978-2025). See https://huggingface.co/datasets/ExponentialScience/DLT-Scientific-Literature
	- Patents: 49,023 documents, 1,296M tokens (1990-2025). See https://huggingface.co/datasets/ExponentialScience/DLT-Patents
	- Social Media: 22.03M documents, 1,120M tokens (2013-mid 2023). See https://huggingface.co/datasets/ExponentialScience/DLT-Tweets

	Total: 22.12 million documents, 2.98 billion tokens

	For more details, see: https://huggingface.co/collections/ExponentialScience/dlt-corpus-68e44e40d4e7a3bd7a224402

	### Training Procedure

	Continual Pre-training:

	Starting from SciBERT (which already captures multidisciplinary scientific content), LedgerBERT was trained using Masked Language Modeling (MLM) on the DLT-Corpus to adapt the model to DLT-specific terminology and concepts.

	Training hyperparameters:
	- Epochs: 3
	- Learning rate: 5×10⁻⁵ with linear decay schedule
	- MLM probability: 0.15 (standard BERT masking)
	- Warmup ratio: 0.10
	- Batch size: 12 per device
	- Sequence length: 512 tokens
	- Weight decay: 0.01
	- Optimizer: Stable AdamW
	- Precision: bfloat16


	## Limitations and Biases

	### Known Limitations

	- Language coverage: English only; does not support other languages
	- Temporal coverage: Training data extends to mid-2023 for social media; may not capture very recent terminology
	- Domain specificity: Optimized for DLT tasks; may underperform on general-purpose benchmarks compared to models like RoBERTa
	- Context length: Limited to 512 tokens; longer documents require truncation or chunking

	### Potential Biases

	The model may reflect biases present in the training data:

	- Geographic bias: English-language sources may over-represent certain regions
	- Platform bias: Social media data only from Twitter/X; other platforms not represented
	- Temporal bias: More recent DLT developments are more heavily represented
	- Market bias: Training during periods of market volatility may influence sentiment understanding
	- Source bias: Certain cryptocurrencies (e.g., Bitcoin, Ethereum) are more discussed than others

	### Ethical Considerations

	- Market manipulation risk: Could potentially be misused for analyzing or generating content for market manipulation
	- Investment decisions: Should not be used as sole basis for financial decisions without proper risk disclaimers
	- Misinformation: May reproduce or fail to identify false claims present in training data
	- Privacy: While usernames were removed from social media data, care should be taken not to re-identify individuals

	## How to Use

	### Basic Usage

	```python
	from transformers import AutoTokenizer, AutoModel

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT")
	model = AutoModel.from_pretrained("ExponentialScience/LedgerBERT")

	# Example text
	text = "Ethereum uses Proof of Stake consensus mechanism for transaction validation."

	# Tokenize and encode
	inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)

	# Get embeddings
	outputs = model(**inputs)
	embeddings = outputs.last_hidden_state
	```

	### Fine-tuning for NER

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer

	# Load for token classification
	tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT")
	model = AutoModelForTokenClassification.from_pretrained(
	"ExponentialScience/LedgerBERT",
	num_labels=num_labels # Set based on your NER task
	)

	# Fine-tune on your dataset
	training_args = TrainingArguments(
	output_dir="./results",
	learning_rate=1e-5,
	per_device_train_batch_size=16,
	num_train_epochs=20,
	warmup_steps=500
	)

	trainer = Trainer(
	model=model,
	args=training_args,
	train_dataset=train_dataset,
	eval_dataset=eval_dataset
	)

	trainer.train()
	```

	### Fine-tuning for Sentiment Analysis

	A fine-tuned version for market sentiment is available at: https://huggingface.co/ExponentialScience/LedgerBERT-Market-Sentiment

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT-Market-Sentiment")
	model = AutoModelForSequenceClassification.from_pretrained("ExponentialScience/LedgerBERT-Market-Sentiment")

	text = "Bitcoin reaches new all-time high amid institutional adoption"
	inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
	outputs = model(**inputs)
	predictions = outputs.logits.argmax(dim=-1)
	```

	## Citation

	If you use LedgerBERT in your research, please cite:

	```bibtex
	@article{hernandez2025dlt-corpus,
	title={DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain},
	author={Hernandez Cruz, Walter and Devine, Peter and Vadgama, Nikhil and Tasca, Paolo and Xu, Jiahua},
	year={2025}
	}
	```

	## Related Resources

	- DLT-Corpus Collection: https://huggingface.co/collections/ExponentialScience/dlt-corpus-68e44e40d4e7a3bd7a224402
	- Scientific Literature Dataset: https://huggingface.co/datasets/ExponentialScience/DLT-Scientific-Literature
	- Patents Dataset: https://huggingface.co/datasets/ExponentialScience/DLT-Patents
	- Social Media Dataset: https://huggingface.co/datasets/ExponentialScience/DLT-Tweets
	- Sentiment Analysis Dataset: https://huggingface.co/datasets/ExponentialScience/DLT-Sentiment-News
	- Fine-tuned Market Sentiment Model: https://huggingface.co/ExponentialScience/LedgerBERT-Market-Sentiment

	## Model Card Contact

	For questions or feedback about LedgerBERT, please open an issue on the model repository or contact the authors through the DLT-Corpus GitHub repository: https://github.com/dlt-science/DLT-Corpus