slm-bert-emb / README.md
JamieYuu's picture
Improve model card attractiveness: richer tags and usage snippet
2453fe8 verified
---
language: en
tags:
- finance
- quant-finance
- algo-trading
- trading
- crypto
- sentiment-analysis
- feature-engineering
- alpha-research
- machine-learning
- xgboost
- bert
- finbert
- embeddings
- preliminary-results
datasets:
- SahandNZ/cryptonews-articles-with-price-momentum-labels
base_model:
- boltuix/bert-lite
license: mit
---
# Crypto News Alpha Features: BERT-Lite Embeddings + Benchmark Results
Precomputed crypto-news CLS embeddings and benchmark outputs for fast quant experimentation.
If you want to test whether text alpha adds signal over market/FnG-style numeric features, this repo gives you ready-to-use artifacts without rerunning expensive encoding.
Primary text encoder used in this release: `boltuix/bert-lite`.
## Data Source
This work uses the public Hugging Face dataset:
- https://huggingface.co/datasets/SahandNZ/cryptonews-articles-with-price-momentum-labels
Please cite and credit the original dataset creator (`SahandNZ`) when reusing these artifacts.
## Three Benchmark Tracks
1. Numeric-only XGBoost baseline
2. Frozen CLS embedding + numeric XGBoost
3. Pure CLS linear baseline
## Why Download This
- Start modeling immediately with ready-made `.npy` embedding tensors.
- Reproduce a strong baseline quickly for crypto text+tabular fusion.
- Compare pure numeric alpha vs text-augmented alpha with a clean metric summary.
- Use as a drop-in feature pack for your own classifier/regressor experiments.
## What Is Inside
### 1) Results artifacts
Stored under `results/`:
- `results/metrics_xgb_cls_vs_numeric.json`
- `results/results_summary.csv`
### 2) Embedding artifacts
Stored under `embeddings/`:
- `embeddings/bertlite_full_fresh__train_cls_embeddings.npy`
- `embeddings/bertlite_full_fresh__val_cls_embeddings.npy`
- `embeddings/bertlite_full_fresh__test_cls_embeddings.npy`
- `embeddings/finbert_full_fresh__train_cls_embeddings.npy`
- `embeddings/finbert_full_fresh__val_cls_embeddings.npy`
- `embeddings/finbert_full_fresh__test_cls_embeddings.npy`
### 3) Integrity manifest
Stored under `embeddings/`:
- `embeddings/embeddings_manifest.csv`
## Quick Start (Embedding Usage)
```python
import numpy as np
from xgboost import XGBClassifier
# 1) Load precomputed text embeddings
X_text_train = np.load("embeddings/bertlite_full_fresh__train_cls_embeddings.npy")
X_text_val = np.load("embeddings/bertlite_full_fresh__val_cls_embeddings.npy")
# 2) Load your numeric features aligned to the same row order
# X_num_train, X_num_val = ...
# 3) Fuse text + numeric features
# X_train = np.concatenate([X_num_train, X_text_train], axis=1)
# X_val = np.concatenate([X_num_val, X_text_val], axis=1)
# 4) Train a downstream model
# clf = XGBClassifier(n_estimators=250, max_depth=4, learning_rate=0.1)
# clf.fit(X_train, y_train)
# y_pred = clf.predict(X_val)
```
Minimal pseudocode:
```text
load CLS embeddings
align with numeric feature rows
concatenate [numeric, CLS]
train XGBoost
compare vs numeric-only baseline
```
## Main Test Metrics (from results summary)
- Numeric-only XGB: accuracy 0.7098, F1 0.6638, precision 0.5406, recall 0.8597
- CLS+Numeric XGB: accuracy 0.7881, F1 0.7378, precision 0.6278, recall 0.8947
- Pure CLS linear: accuracy 0.5252, F1 0.2860, precision 0.2866, recall 0.2854
- Delta test F1 (CLS+Numeric - Numeric-only): +0.0740
## Practical Notes
- Embeddings are frozen feature tensors (`.npy`), not fine-tuned checkpoints.
- Files are split by train/val/test for direct experimental use.
- Check `embeddings/embeddings_manifest.csv` to verify integrity.
## Notes
- Preliminary release only; confidence intervals and repeated-seed aggregates are not included yet.
- Embedding files are derived feature tensors (`.npy`), not raw text records.
- D-1 lagging was used for market/FnG features to avoid look-ahead bias.
- Training and experiment scripts are intentionally not mirrored in this artifact-only repo.