README.md · JamieYuu/slm-bert-emb at main

slm-bert-emb / README.md

JamieYuu

Improve model card attractiveness: richer tags and usage snippet

2453fe8 verified 20 days ago

preview code

raw

history blame contribute delete

3.91 kB

	---
	language: en
	tags:
	- finance
	- quant-finance
	- algo-trading
	- trading
	- crypto
	- sentiment-analysis
	- feature-engineering
	- alpha-research
	- machine-learning
	- xgboost
	- bert
	- finbert
	- embeddings
	- preliminary-results
	datasets:
	- SahandNZ/cryptonews-articles-with-price-momentum-labels
	base_model:
	- boltuix/bert-lite
	license: mit
	---

	# Crypto News Alpha Features: BERT-Lite Embeddings + Benchmark Results

	Precomputed crypto-news CLS embeddings and benchmark outputs for fast quant experimentation.

	If you want to test whether text alpha adds signal over market/FnG-style numeric features, this repo gives you ready-to-use artifacts without rerunning expensive encoding.

	Primary text encoder used in this release: `boltuix/bert-lite`.

	## Data Source

	This work uses the public Hugging Face dataset:

	- https://huggingface.co/datasets/SahandNZ/cryptonews-articles-with-price-momentum-labels

	Please cite and credit the original dataset creator (`SahandNZ`) when reusing these artifacts.

	## Three Benchmark Tracks

	1. Numeric-only XGBoost baseline
	2. Frozen CLS embedding + numeric XGBoost
	3. Pure CLS linear baseline

	## Why Download This

	- Start modeling immediately with ready-made `.npy` embedding tensors.
	- Reproduce a strong baseline quickly for crypto text+tabular fusion.
	- Compare pure numeric alpha vs text-augmented alpha with a clean metric summary.
	- Use as a drop-in feature pack for your own classifier/regressor experiments.

	## What Is Inside

	### 1) Results artifacts
	Stored under `results/`:

	- `results/metrics_xgb_cls_vs_numeric.json`
	- `results/results_summary.csv`

	### 2) Embedding artifacts
	Stored under `embeddings/`:

	- `embeddings/bertlite_full_fresh__train_cls_embeddings.npy`
	- `embeddings/bertlite_full_fresh__val_cls_embeddings.npy`
	- `embeddings/bertlite_full_fresh__test_cls_embeddings.npy`
	- `embeddings/finbert_full_fresh__train_cls_embeddings.npy`
	- `embeddings/finbert_full_fresh__val_cls_embeddings.npy`
	- `embeddings/finbert_full_fresh__test_cls_embeddings.npy`

	### 3) Integrity manifest
	Stored under `embeddings/`:

	- `embeddings/embeddings_manifest.csv`

	## Quick Start (Embedding Usage)

	```python
	import numpy as np
	from xgboost import XGBClassifier

	# 1) Load precomputed text embeddings
	X_text_train = np.load("embeddings/bertlite_full_fresh__train_cls_embeddings.npy")
	X_text_val = np.load("embeddings/bertlite_full_fresh__val_cls_embeddings.npy")

	# 2) Load your numeric features aligned to the same row order
	# X_num_train, X_num_val = ...

	# 3) Fuse text + numeric features
	# X_train = np.concatenate([X_num_train, X_text_train], axis=1)
	# X_val = np.concatenate([X_num_val, X_text_val], axis=1)

	# 4) Train a downstream model
	# clf = XGBClassifier(n_estimators=250, max_depth=4, learning_rate=0.1)
	# clf.fit(X_train, y_train)
	# y_pred = clf.predict(X_val)
	```

	Minimal pseudocode:

	```text
	load CLS embeddings
	align with numeric feature rows
	concatenate [numeric, CLS]
	train XGBoost
	compare vs numeric-only baseline
	```

	## Main Test Metrics (from results summary)

	- Numeric-only XGB: accuracy 0.7098, F1 0.6638, precision 0.5406, recall 0.8597
	- CLS+Numeric XGB: accuracy 0.7881, F1 0.7378, precision 0.6278, recall 0.8947
	- Pure CLS linear: accuracy 0.5252, F1 0.2860, precision 0.2866, recall 0.2854
	- Delta test F1 (CLS+Numeric - Numeric-only): +0.0740

	## Practical Notes

	- Embeddings are frozen feature tensors (`.npy`), not fine-tuned checkpoints.
	- Files are split by train/val/test for direct experimental use.
	- Check `embeddings/embeddings_manifest.csv` to verify integrity.

	## Notes

	- Preliminary release only; confidence intervals and repeated-seed aggregates are not included yet.
	- Embedding files are derived feature tensors (`.npy`), not raw text records.
	- D-1 lagging was used for market/FnG features to avoid look-ahead bias.
	- Training and experiment scripts are intentionally not mirrored in this artifact-only repo.