walterhernandez commited on
Commit
4ae6f60
·
verified ·
1 Parent(s): 4f96a1e

Added Model Card

Browse files
Files changed (1) hide show
  1. README.md +254 -0
README.md ADDED
@@ -0,0 +1,254 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - ExponentialScience/DLT-Sentiment-News
4
+ language:
5
+ - en
6
+ base_model:
7
+ - ExponentialScience/LedgerBERT
8
+ ---
9
+ # LedgerBERT-Market-Sentiment
10
+
11
+ ## Model Description
12
+
13
+ ### Model Summary
14
+
15
+ LedgerBERT-Market-Sentiment is a fine-tuned version of LedgerBERT (https://huggingface.co/ExponentialScience/LedgerBERT) specialized for sentiment analysis of cryptocurrency and DLT-related content. The model classifies text into three market direction sentiment categories: **bullish** (positive market outlook), **bearish** (negative market outlook), and **neutral** (balanced or unclear market direction).
16
+
17
+ This model is particularly effective for analyzing cryptocurrency news headlines, social media posts, and other DLT-related content where understanding market sentiment is important.
18
+
19
+ - **Model type:** BERT-base encoder for sequence classification
20
+ - **Language:** English
21
+ - **License:** Creative Commons Attribution-NonCommercial 4.0 International (CC-BY-NC 4.0)
22
+ - **Base model:** LedgerBERT (ExponentialScience/LedgerBERT)
23
+ - **Fine-tuning dataset:** DLT-Sentiment-News (23,301 examples)
24
+ - **Task:** Multi-class sentiment classification (3 classes)
25
+
26
+ ### Model Architecture
27
+
28
+ - **Architecture:** BERT-base for sequence classification
29
+ - **Parameters:** 110 million
30
+ - **Hidden size:** 768
31
+ - **Number of layers:** 12
32
+ - **Attention heads:** 12
33
+ - **Vocabulary size:** 30,522 (SciBERT vocabulary)
34
+ - **Max sequence length:** 512 tokens
35
+ - **Output:** 3-class logits (bullish, bearish, neutral)
36
+
37
+ ## Intended Uses
38
+
39
+ ### Primary Use Cases
40
+
41
+ This model is designed for sentiment analysis tasks in the cryptocurrency and DLT domain:
42
+
43
+ - **Market sentiment analysis**: Analyzing sentiment in cryptocurrency news articles, headlines, and market commentary
44
+ - **Social media monitoring**: Understanding market direction sentiment in tweets, Reddit posts, and forum discussions
45
+ - **News aggregation**: Automatically categorizing cryptocurrency news by market sentiment
46
+ - **Research applications**: Studying sentiment trends and their relationship to market dynamics
47
+ - **Content filtering**: Organizing DLT content based on market outlook
48
+
49
+ ### Example Applications
50
+
51
+ ```python
52
+ # Analyzing news headlines
53
+ "Bitcoin surges to new all-time high" → Bullish
54
+ "Ethereum faces regulatory scrutiny" → Bearish
55
+ "Stablecoin market remains stable" → Neutral
56
+
57
+ # Social media sentiment
58
+ "To the moon! 🚀" → Bullish
59
+ "Another crypto winter incoming" → Bearish
60
+ "Waiting for clear market direction" → Neutral
61
+ ```
62
+
63
+ ### Out-of-Scope Uses
64
+
65
+ - **Investment decisions**: This model should NOT be used as the sole basis for making investment or trading decisions
66
+ - **Financial advice**: Not suitable for providing personalized financial or investment recommendations
67
+ - **Real-time trading**: Should not be used for automated high-frequency trading systems
68
+ - **Market manipulation**: Must not be used to coordinate or facilitate market manipulation
69
+ - **General sentiment analysis**: Optimized for market direction sentiment; may not perform well on general emotional sentiment
70
+
71
+ ## Training Details
72
+
73
+ ### Training Data
74
+
75
+ The model was fine-tuned on the **DLT-Sentiment-News dataset**, which contains:
76
+
77
+ - **Size:** 23,301 examples
78
+ - **Tokens:** 1.85 million tokens (average 79.51 tokens per example)
79
+ - **Temporal coverage:** January 2021 to May 2025
80
+ - **Source:** CryptoPanic platform cryptocurrency news headlines and descriptions
81
+ - **Labels:** Crowdsourced votes from active cryptocurrency community members
82
+ - **Classification method:** Percentile-based labeling (25th and 75th percentiles as boundaries)
83
+
84
+ **Label distribution by sentiment dimension:**
85
+ - **Market Direction:** bullish, bearish, neutral
86
+
87
+ The dataset provides domain expertise through crowdsourced annotations from cryptocurrency users, making the labels more relevant than general crowdworker annotations.
88
+
89
+ **Note:** News articles are absent from the DLT-Corpus used to pre-train LedgerBERT, making this an out-of-domain generalization test that demonstrates the model's robust language understanding.
90
+
91
+ For more details on the dataset used for tine-tuning, see: https://huggingface.co/datasets/ExponentialScience/DLT-Sentiment-News
92
+
93
+ ### Training Procedure
94
+
95
+ **Fine-tuning hyperparameters:**
96
+ - **Epochs:** 3
97
+ - **Learning rate:** 2×10⁻⁵
98
+ - **Warmup steps:** 500
99
+ - **Batch size:** 8 per device (training and evaluation)
100
+ - **Train/test split:** 90% training, 10% testing
101
+ - **Optimizer:** AdamW with fused operations
102
+ - **Precision:** bfloat16
103
+ - **Max sequence length:** 512 tokens (tokenizer default)
104
+ - **Truncation:** Enabled
105
+ - **Padding:** Enabled
106
+
107
+ ## Limitations and Biases
108
+
109
+ ### Known Limitations
110
+
111
+ - **Temporal lag**: Not suitable for real-time sentiment analysis; trained on historical data (2021-2025)
112
+ - **Context dependency**: Headlines and descriptions lack full article context, which may affect sentiment interpretation
113
+ - **Language coverage**: English only; does not support other languages
114
+ - **Sarcasm and irony**: May struggle with nuanced language common in cryptocurrency discourse (e.g., "HFSP" - Have Fun Staying Poor)
115
+ - **Evolving terminology**: Cryptocurrency memes and terminology evolve rapidly; may not capture newest slang
116
+ - **Market volatility**: Sentiment can change rapidly after news publication; static predictions may become outdated quickly
117
+
118
+ ### Potential Biases
119
+
120
+ The model may reflect biases present in the training data:
121
+
122
+ - **Platform bias**: Data from CryptoPanic users only; may not represent broader market sentiment
123
+ - **User bias**: Active crypto community members may have different perspectives than general investors
124
+ - **Temporal bias**: Training data spans 2021-2025, reflecting specific market conditions (bull markets, bear markets, crypto winters)
125
+ - **Source bias**: Certain news sources or cryptocurrencies may be over-represented in the training data
126
+ - **Geographic bias**: English-language news sources are over-represented
127
+ - **Market condition bias**: Dataset reflects specific market cycles that may not generalize to all conditions
128
+
129
+ ### Data Collection Biases
130
+
131
+ - **Vote manipulation**: Despite quality filters, coordinated voting on the source platform cannot be completely ruled out
132
+ - **Minimum vote threshold**: Filtering by median votes may exclude less popular but valid sentiment signals
133
+ - **Percentile-based labeling**: Classification boundaries (25th/75th percentiles) are somewhat arbitrary
134
+
135
+ ## How to Use
136
+
137
+ ### Basic Usage
138
+
139
+ ```python
140
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
141
+ import torch
142
+
143
+ # Load model and tokenizer
144
+ model_name = "ExponentialScience/LedgerBERT-Market-Sentiment"
145
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
146
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
147
+
148
+ # Example texts
149
+ texts = [
150
+ "Bitcoin reaches new all-time high amid institutional adoption",
151
+ "SEC announces crackdown on cryptocurrency exchanges",
152
+ "Ethereum network upgrade proceeding as planned"
153
+ ]
154
+
155
+ # Classify sentiment
156
+ for text in texts:
157
+ inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)
158
+
159
+ with torch.no_grad():
160
+ outputs = model(**inputs)
161
+ predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
162
+ predicted_class = predictions.argmax(dim=-1).item()
163
+
164
+ # Map to labels (adjust based on your label mapping)
165
+ labels = ["bearish", "bullish", "neutral"] # Order may vary
166
+ sentiment = labels[predicted_class]
167
+ confidence = predictions[0][predicted_class].item()
168
+
169
+ print(f"Text: {text}")
170
+ print(f"Sentiment: {sentiment} (confidence: {confidence:.3f})\n")
171
+ ```
172
+
173
+ ### Batch Processing
174
+
175
+ ```python
176
+ from transformers import pipeline
177
+
178
+ # Create sentiment analysis pipeline
179
+ classifier = pipeline(
180
+ "text-classification",
181
+ model="ExponentialScience/LedgerBERT-Market-Sentiment",
182
+ tokenizer="ExponentialScience/LedgerBERT-Market-Sentiment"
183
+ )
184
+
185
+ # Process multiple texts
186
+ texts = [
187
+ "DeFi protocol launches new staking mechanism",
188
+ "Major cryptocurrency exchange faces liquidity crisis",
189
+ "Blockchain adoption continues in enterprise sector"
190
+ ]
191
+
192
+ results = classifier(texts, truncation=True, max_length=512)
193
+
194
+ for text, result in zip(texts, results):
195
+ print(f"Text: {text}")
196
+ print(f"Sentiment: {result['label']} (score: {result['score']:.3f})\n")
197
+ ```
198
+
199
+ ### Integration with News Feeds
200
+
201
+ ```python
202
+ import feedparser
203
+ from transformers import pipeline
204
+
205
+ # Initialize classifier
206
+ classifier = pipeline(
207
+ "text-classification",
208
+ model="ExponentialScience/LedgerBERT-Market-Sentiment"
209
+ )
210
+
211
+ # Example: Analyze cryptocurrency news feed
212
+ feed_url = "https://example-crypto-news.com/rss"
213
+ feed = feedparser.parse(feed_url)
214
+
215
+ for entry in feed.entries[:5]: # Process first 5 entries
216
+ title = entry.title
217
+ result = classifier(title, truncation=True, max_length=512)[0]
218
+
219
+ print(f"Headline: {title}")
220
+ print(f"Market Sentiment: {result['label']} ({result['score']:.2%})")
221
+ print(f"Link: {entry.link}\n")
222
+ ```
223
+
224
+ ## Citation
225
+
226
+ If you use LedgerBERT-Market-Sentiment in your research, please cite:
227
+
228
+ ```bibtex
229
+ @article{hernandez2025dlt-corpus,
230
+ title={DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain},
231
+ author={Hernandez Cruz, Walter and Devine, Peter and Vadgama, Nikhil and Tasca, Paolo and Xu, Jiahua},
232
+ year={2025}
233
+ }
234
+ ```
235
+
236
+ ## Related Resources
237
+
238
+ - **Base Model (LedgerBERT)**: https://huggingface.co/ExponentialScience/LedgerBERT
239
+ - **Training Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Sentiment-News
240
+ - **DLT-Corpus Collection**: https://huggingface.co/collections/ExponentialScience/dlt-corpus-68e44e40d4e7a3bd7a224402
241
+
242
+ ### Additional Fine-tuned Models
243
+
244
+ LedgerBERT can also be fine-tuned for other sentiment dimensions available in the DLT-Sentiment-News dataset (https://huggingface.co/datasets/ExponentialScience/DLT-Sentiment-News):
245
+ - **Content Characteristics** (liked, disliked, neutral)
246
+ - **Engagement Quality** (important, lol, neutral)
247
+
248
+ ## Model Card Contact
249
+
250
+ For questions or feedback about LedgerBERT-Market-Sentiment, please open an issue on the GitHub repository: https://github.com/dlt-science/DLT-Corpus
251
+
252
+ ---
253
+
254
+ **⚠️ Important Disclaimer:** This model is provided for research and educational purposes only. It should not be used as financial advice or as the sole basis for investment decisions. Cryptocurrency markets are highly volatile and unpredictable. Always conduct your own research and consult with qualified financial advisors before making investment decisions.