Improve model card attractiveness: richer tags and usage snippet
Browse files
README.md
CHANGED
|
@@ -1,18 +1,20 @@
|
|
| 1 |
---
|
| 2 |
language: en
|
| 3 |
-
pipeline_tag: feature-extraction
|
| 4 |
tags:
|
| 5 |
- finance
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
- xgboost
|
| 7 |
- bert
|
| 8 |
- finbert
|
| 9 |
- embeddings
|
| 10 |
- preliminary-results
|
| 11 |
-
- cryptocurrency
|
| 12 |
-
- news
|
| 13 |
-
- sentiment-analysis
|
| 14 |
-
- classification
|
| 15 |
-
- tabular
|
| 16 |
datasets:
|
| 17 |
- SahandNZ/cryptonews-articles-with-price-momentum-labels
|
| 18 |
base_model:
|
|
@@ -20,20 +22,15 @@ base_model:
|
|
| 20 |
license: mit
|
| 21 |
---
|
| 22 |
|
| 23 |
-
# Crypto News
|
| 24 |
-
|
| 25 |
-
Lightweight, ready-to-use CLS embeddings for crypto news direction modeling, paired with reproducible benchmark metrics.
|
| 26 |
-
|
| 27 |
-
If you want fast experiments without rerunning transformer inference, this repo gives you downloadable `.npy` features and baseline scores out of the box.
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
|
| 32 |
-
- Cached CLS embedding artifacts used in those experiments
|
| 33 |
|
| 34 |
-
Primary text encoder
|
| 35 |
|
| 36 |
-
##
|
| 37 |
|
| 38 |
This work uses the public Hugging Face dataset:
|
| 39 |
|
|
@@ -41,19 +38,19 @@ This work uses the public Hugging Face dataset:
|
|
| 41 |
|
| 42 |
Please cite and credit the original dataset creator (`SahandNZ`) when reusing these artifacts.
|
| 43 |
|
| 44 |
-
##
|
| 45 |
-
|
| 46 |
-
- Fast start: skip expensive embedding generation and jump directly into model development.
|
| 47 |
-
- Practical benchmark: direct comparison of numeric-only, CLS+numeric, and pure-CLS baselines.
|
| 48 |
-
- Easy integration: embeddings are standard NumPy arrays suitable for scikit-learn/XGBoost/PyTorch pipelines.
|
| 49 |
-
- Reproducibility: file checksums are provided in `embeddings/embeddings_manifest.csv`.
|
| 50 |
-
|
| 51 |
-
## Three Workstreams Covered
|
| 52 |
|
| 53 |
1. Numeric-only XGBoost baseline
|
| 54 |
2. Frozen CLS embedding + numeric XGBoost
|
| 55 |
3. Pure CLS linear baseline
|
| 56 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
## What Is Inside
|
| 58 |
|
| 59 |
### 1) Results artifacts
|
|
@@ -77,57 +74,37 @@ Stored under `embeddings/`:
|
|
| 77 |
|
| 78 |
- `embeddings/embeddings_manifest.csv`
|
| 79 |
|
| 80 |
-
## Quick Start
|
| 81 |
|
| 82 |
```python
|
| 83 |
-
from huggingface_hub import hf_hub_download
|
| 84 |
import numpy as np
|
|
|
|
| 85 |
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
repo_id=REPO_ID,
|
| 90 |
-
filename="embeddings/bertlite_full_fresh__train_cls_embeddings.npy",
|
| 91 |
-
repo_type="model",
|
| 92 |
-
)
|
| 93 |
-
val_path = hf_hub_download(
|
| 94 |
-
repo_id=REPO_ID,
|
| 95 |
-
filename="embeddings/bertlite_full_fresh__val_cls_embeddings.npy",
|
| 96 |
-
repo_type="model",
|
| 97 |
-
)
|
| 98 |
-
test_path = hf_hub_download(
|
| 99 |
-
repo_id=REPO_ID,
|
| 100 |
-
filename="embeddings/bertlite_full_fresh__test_cls_embeddings.npy",
|
| 101 |
-
repo_type="model",
|
| 102 |
-
)
|
| 103 |
-
|
| 104 |
-
X_train_cls = np.load(train_path)
|
| 105 |
-
X_val_cls = np.load(val_path)
|
| 106 |
-
X_test_cls = np.load(test_path)
|
| 107 |
-
|
| 108 |
-
print(X_train_cls.shape, X_val_cls.shape, X_test_cls.shape)
|
| 109 |
-
```
|
| 110 |
-
|
| 111 |
-
## Pseudocode: Use Embeddings in a Hybrid Classifier
|
| 112 |
|
| 113 |
-
|
| 114 |
-
#
|
| 115 |
-
# - X_*_numeric: lagged market/FnG tabular features
|
| 116 |
-
# - X_*_cls: CLS embeddings from this repo
|
| 117 |
-
# - y_*: labels
|
| 118 |
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
|
| 123 |
-
|
| 124 |
-
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
-
|
| 127 |
-
test_pred = model.predict(X_test)
|
| 128 |
|
| 129 |
-
|
| 130 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 131 |
```
|
| 132 |
|
| 133 |
## Main Test Metrics (from results summary)
|
|
@@ -137,12 +114,11 @@ report_metrics(y_test, test_pred)
|
|
| 137 |
- Pure CLS linear: accuracy 0.5252, F1 0.2860, precision 0.2866, recall 0.2854
|
| 138 |
- Delta test F1 (CLS+Numeric - Numeric-only): +0.0740
|
| 139 |
|
| 140 |
-
##
|
| 141 |
-
|
| 142 |
-
If you use these artifacts, please cite:
|
| 143 |
|
| 144 |
-
-
|
| 145 |
-
-
|
|
|
|
| 146 |
|
| 147 |
## Notes
|
| 148 |
|
|
|
|
| 1 |
---
|
| 2 |
language: en
|
|
|
|
| 3 |
tags:
|
| 4 |
- finance
|
| 5 |
+
- quant-finance
|
| 6 |
+
- algo-trading
|
| 7 |
+
- trading
|
| 8 |
+
- crypto
|
| 9 |
+
- sentiment-analysis
|
| 10 |
+
- feature-engineering
|
| 11 |
+
- alpha-research
|
| 12 |
+
- machine-learning
|
| 13 |
- xgboost
|
| 14 |
- bert
|
| 15 |
- finbert
|
| 16 |
- embeddings
|
| 17 |
- preliminary-results
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
datasets:
|
| 19 |
- SahandNZ/cryptonews-articles-with-price-momentum-labels
|
| 20 |
base_model:
|
|
|
|
| 22 |
license: mit
|
| 23 |
---
|
| 24 |
|
| 25 |
+
# Crypto News Alpha Features: BERT-Lite Embeddings + Benchmark Results
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
+
Precomputed crypto-news CLS embeddings and benchmark outputs for fast quant experimentation.
|
| 28 |
|
| 29 |
+
If you want to test whether text alpha adds signal over market/FnG-style numeric features, this repo gives you ready-to-use artifacts without rerunning expensive encoding.
|
|
|
|
| 30 |
|
| 31 |
+
Primary text encoder used in this release: `boltuix/bert-lite`.
|
| 32 |
|
| 33 |
+
## Data Source
|
| 34 |
|
| 35 |
This work uses the public Hugging Face dataset:
|
| 36 |
|
|
|
|
| 38 |
|
| 39 |
Please cite and credit the original dataset creator (`SahandNZ`) when reusing these artifacts.
|
| 40 |
|
| 41 |
+
## Three Benchmark Tracks
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
1. Numeric-only XGBoost baseline
|
| 44 |
2. Frozen CLS embedding + numeric XGBoost
|
| 45 |
3. Pure CLS linear baseline
|
| 46 |
|
| 47 |
+
## Why Download This
|
| 48 |
+
|
| 49 |
+
- Start modeling immediately with ready-made `.npy` embedding tensors.
|
| 50 |
+
- Reproduce a strong baseline quickly for crypto text+tabular fusion.
|
| 51 |
+
- Compare pure numeric alpha vs text-augmented alpha with a clean metric summary.
|
| 52 |
+
- Use as a drop-in feature pack for your own classifier/regressor experiments.
|
| 53 |
+
|
| 54 |
## What Is Inside
|
| 55 |
|
| 56 |
### 1) Results artifacts
|
|
|
|
| 74 |
|
| 75 |
- `embeddings/embeddings_manifest.csv`
|
| 76 |
|
| 77 |
+
## Quick Start (Embedding Usage)
|
| 78 |
|
| 79 |
```python
|
|
|
|
| 80 |
import numpy as np
|
| 81 |
+
from xgboost import XGBClassifier
|
| 82 |
|
| 83 |
+
# 1) Load precomputed text embeddings
|
| 84 |
+
X_text_train = np.load("embeddings/bertlite_full_fresh__train_cls_embeddings.npy")
|
| 85 |
+
X_text_val = np.load("embeddings/bertlite_full_fresh__val_cls_embeddings.npy")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
+
# 2) Load your numeric features aligned to the same row order
|
| 88 |
+
# X_num_train, X_num_val = ...
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
+
# 3) Fuse text + numeric features
|
| 91 |
+
# X_train = np.concatenate([X_num_train, X_text_train], axis=1)
|
| 92 |
+
# X_val = np.concatenate([X_num_val, X_text_val], axis=1)
|
| 93 |
|
| 94 |
+
# 4) Train a downstream model
|
| 95 |
+
# clf = XGBClassifier(n_estimators=250, max_depth=4, learning_rate=0.1)
|
| 96 |
+
# clf.fit(X_train, y_train)
|
| 97 |
+
# y_pred = clf.predict(X_val)
|
| 98 |
+
```
|
| 99 |
|
| 100 |
+
Minimal pseudocode:
|
|
|
|
| 101 |
|
| 102 |
+
```text
|
| 103 |
+
load CLS embeddings
|
| 104 |
+
align with numeric feature rows
|
| 105 |
+
concatenate [numeric, CLS]
|
| 106 |
+
train XGBoost
|
| 107 |
+
compare vs numeric-only baseline
|
| 108 |
```
|
| 109 |
|
| 110 |
## Main Test Metrics (from results summary)
|
|
|
|
| 114 |
- Pure CLS linear: accuracy 0.5252, F1 0.2860, precision 0.2866, recall 0.2854
|
| 115 |
- Delta test F1 (CLS+Numeric - Numeric-only): +0.0740
|
| 116 |
|
| 117 |
+
## Practical Notes
|
|
|
|
|
|
|
| 118 |
|
| 119 |
+
- Embeddings are frozen feature tensors (`.npy`), not fine-tuned checkpoints.
|
| 120 |
+
- Files are split by train/val/test for direct experimental use.
|
| 121 |
+
- Check `embeddings/embeddings_manifest.csv` to verify integrity.
|
| 122 |
|
| 123 |
## Notes
|
| 124 |
|