| --- |
| language: en |
| tags: |
| - finance |
| - quant-finance |
| - algo-trading |
| - trading |
| - crypto |
| - sentiment-analysis |
| - feature-engineering |
| - alpha-research |
| - machine-learning |
| - xgboost |
| - bert |
| - finbert |
| - embeddings |
| - preliminary-results |
| datasets: |
| - SahandNZ/cryptonews-articles-with-price-momentum-labels |
| base_model: |
| - boltuix/bert-lite |
| license: mit |
| --- |
| |
| # Crypto News Alpha Features: BERT-Lite Embeddings + Benchmark Results |
|
|
| Precomputed crypto-news CLS embeddings and benchmark outputs for fast quant experimentation. |
|
|
| If you want to test whether text alpha adds signal over market/FnG-style numeric features, this repo gives you ready-to-use artifacts without rerunning expensive encoding. |
|
|
| Primary text encoder used in this release: `boltuix/bert-lite`. |
|
|
| ## Data Source |
|
|
| This work uses the public Hugging Face dataset: |
|
|
| - https://huggingface.co/datasets/SahandNZ/cryptonews-articles-with-price-momentum-labels |
|
|
| Please cite and credit the original dataset creator (`SahandNZ`) when reusing these artifacts. |
|
|
| ## Three Benchmark Tracks |
|
|
| 1. Numeric-only XGBoost baseline |
| 2. Frozen CLS embedding + numeric XGBoost |
| 3. Pure CLS linear baseline |
|
|
| ## Why Download This |
|
|
| - Start modeling immediately with ready-made `.npy` embedding tensors. |
| - Reproduce a strong baseline quickly for crypto text+tabular fusion. |
| - Compare pure numeric alpha vs text-augmented alpha with a clean metric summary. |
| - Use as a drop-in feature pack for your own classifier/regressor experiments. |
|
|
| ## What Is Inside |
|
|
| ### 1) Results artifacts |
| Stored under `results/`: |
|
|
| - `results/metrics_xgb_cls_vs_numeric.json` |
| - `results/results_summary.csv` |
|
|
| ### 2) Embedding artifacts |
| Stored under `embeddings/`: |
|
|
| - `embeddings/bertlite_full_fresh__train_cls_embeddings.npy` |
| - `embeddings/bertlite_full_fresh__val_cls_embeddings.npy` |
| - `embeddings/bertlite_full_fresh__test_cls_embeddings.npy` |
| - `embeddings/finbert_full_fresh__train_cls_embeddings.npy` |
| - `embeddings/finbert_full_fresh__val_cls_embeddings.npy` |
| - `embeddings/finbert_full_fresh__test_cls_embeddings.npy` |
|
|
| ### 3) Integrity manifest |
| Stored under `embeddings/`: |
|
|
| - `embeddings/embeddings_manifest.csv` |
|
|
| ## Quick Start (Embedding Usage) |
|
|
| ```python |
| import numpy as np |
| from xgboost import XGBClassifier |
| |
| # 1) Load precomputed text embeddings |
| X_text_train = np.load("embeddings/bertlite_full_fresh__train_cls_embeddings.npy") |
| X_text_val = np.load("embeddings/bertlite_full_fresh__val_cls_embeddings.npy") |
| |
| # 2) Load your numeric features aligned to the same row order |
| # X_num_train, X_num_val = ... |
| |
| # 3) Fuse text + numeric features |
| # X_train = np.concatenate([X_num_train, X_text_train], axis=1) |
| # X_val = np.concatenate([X_num_val, X_text_val], axis=1) |
| |
| # 4) Train a downstream model |
| # clf = XGBClassifier(n_estimators=250, max_depth=4, learning_rate=0.1) |
| # clf.fit(X_train, y_train) |
| # y_pred = clf.predict(X_val) |
| ``` |
|
|
| Minimal pseudocode: |
|
|
| ```text |
| load CLS embeddings |
| align with numeric feature rows |
| concatenate [numeric, CLS] |
| train XGBoost |
| compare vs numeric-only baseline |
| ``` |
|
|
| ## Main Test Metrics (from results summary) |
|
|
| - Numeric-only XGB: accuracy 0.7098, F1 0.6638, precision 0.5406, recall 0.8597 |
| - CLS+Numeric XGB: accuracy 0.7881, F1 0.7378, precision 0.6278, recall 0.8947 |
| - Pure CLS linear: accuracy 0.5252, F1 0.2860, precision 0.2866, recall 0.2854 |
| - Delta test F1 (CLS+Numeric - Numeric-only): +0.0740 |
|
|
| ## Practical Notes |
|
|
| - Embeddings are frozen feature tensors (`.npy`), not fine-tuned checkpoints. |
| - Files are split by train/val/test for direct experimental use. |
| - Check `embeddings/embeddings_manifest.csv` to verify integrity. |
|
|
| ## Notes |
|
|
| - Preliminary release only; confidence intervals and repeated-seed aggregates are not included yet. |
| - Embedding files are derived feature tensors (`.npy`), not raw text records. |
| - D-1 lagging was used for market/FnG features to avoid look-ahead bias. |
| - Training and experiment scripts are intentionally not mirrored in this artifact-only repo. |
|
|