Crypto News Alpha Features: BERT-Lite Embeddings + Benchmark Results

Precomputed crypto-news CLS embeddings and benchmark outputs for fast quant experimentation.

If you want to test whether text alpha adds signal over market/FnG-style numeric features, this repo gives you ready-to-use artifacts without rerunning expensive encoding.

Primary text encoder used in this release: boltuix/bert-lite.

Data Source

This work uses the public Hugging Face dataset:

Please cite and credit the original dataset creator (SahandNZ) when reusing these artifacts.

Three Benchmark Tracks

  1. Numeric-only XGBoost baseline
  2. Frozen CLS embedding + numeric XGBoost
  3. Pure CLS linear baseline

Why Download This

  • Start modeling immediately with ready-made .npy embedding tensors.
  • Reproduce a strong baseline quickly for crypto text+tabular fusion.
  • Compare pure numeric alpha vs text-augmented alpha with a clean metric summary.
  • Use as a drop-in feature pack for your own classifier/regressor experiments.

What Is Inside

1) Results artifacts

Stored under results/:

  • results/metrics_xgb_cls_vs_numeric.json
  • results/results_summary.csv

2) Embedding artifacts

Stored under embeddings/:

  • embeddings/bertlite_full_fresh__train_cls_embeddings.npy
  • embeddings/bertlite_full_fresh__val_cls_embeddings.npy
  • embeddings/bertlite_full_fresh__test_cls_embeddings.npy
  • embeddings/finbert_full_fresh__train_cls_embeddings.npy
  • embeddings/finbert_full_fresh__val_cls_embeddings.npy
  • embeddings/finbert_full_fresh__test_cls_embeddings.npy

3) Integrity manifest

Stored under embeddings/:

  • embeddings/embeddings_manifest.csv

Quick Start (Embedding Usage)

import numpy as np
from xgboost import XGBClassifier

# 1) Load precomputed text embeddings
X_text_train = np.load("embeddings/bertlite_full_fresh__train_cls_embeddings.npy")
X_text_val = np.load("embeddings/bertlite_full_fresh__val_cls_embeddings.npy")

# 2) Load your numeric features aligned to the same row order
# X_num_train, X_num_val = ...

# 3) Fuse text + numeric features
# X_train = np.concatenate([X_num_train, X_text_train], axis=1)
# X_val = np.concatenate([X_num_val, X_text_val], axis=1)

# 4) Train a downstream model
# clf = XGBClassifier(n_estimators=250, max_depth=4, learning_rate=0.1)
# clf.fit(X_train, y_train)
# y_pred = clf.predict(X_val)

Minimal pseudocode:

load CLS embeddings
align with numeric feature rows
concatenate [numeric, CLS]
train XGBoost
compare vs numeric-only baseline

Main Test Metrics (from results summary)

  • Numeric-only XGB: accuracy 0.7098, F1 0.6638, precision 0.5406, recall 0.8597
  • CLS+Numeric XGB: accuracy 0.7881, F1 0.7378, precision 0.6278, recall 0.8947
  • Pure CLS linear: accuracy 0.5252, F1 0.2860, precision 0.2866, recall 0.2854
  • Delta test F1 (CLS+Numeric - Numeric-only): +0.0740

Practical Notes

  • Embeddings are frozen feature tensors (.npy), not fine-tuned checkpoints.
  • Files are split by train/val/test for direct experimental use.
  • Check embeddings/embeddings_manifest.csv to verify integrity.

Notes

  • Preliminary release only; confidence intervals and repeated-seed aggregates are not included yet.
  • Embedding files are derived feature tensors (.npy), not raw text records.
  • D-1 lagging was used for market/FnG features to avoid look-ahead bias.
  • Training and experiment scripts are intentionally not mirrored in this artifact-only repo.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JamieYuu/slm-bert-emb

Finetuned
(2)
this model

Dataset used to train JamieYuu/slm-bert-emb