Improve model card attractiveness: richer tags and usage snippet

Browse files

Files changed (1) hide show

README.md +47 -71

README.md CHANGED Viewed

@@ -1,18 +1,20 @@
 ---
 language: en
-pipeline_tag: feature-extraction
 tags:
 - finance
 - xgboost
 - bert
 - finbert
 - embeddings
 - preliminary-results
-- cryptocurrency
-- news
-- sentiment-analysis
-- classification
-- tabular
 datasets:
 - SahandNZ/cryptonews-articles-with-price-momentum-labels
 base_model:
@@ -20,20 +22,15 @@ base_model:
 license: mit
 ---
-# Crypto News Momentum Embeddings (BERT-Lite) + Benchmark Results
-Lightweight, ready-to-use CLS embeddings for crypto news direction modeling, paired with reproducible benchmark metrics.
-If you want fast experiments without rerunning transformer inference, this repo gives you downloadable `.npy` features and baseline scores out of the box.
-This single Hugging Face repo contains both:
-- Preliminary evaluation results for numeric-only vs CLS+numeric models
-- Cached CLS embedding artifacts used in those experiments
-Primary text encoder/base model used for this release: `boltuix/bert-lite`.
-## Source Dataset Declaration
 This work uses the public Hugging Face dataset:
@@ -41,19 +38,19 @@ This work uses the public Hugging Face dataset:
 Please cite and credit the original dataset creator (`SahandNZ`) when reusing these artifacts.
-## Why This Repo Is Useful
-- Fast start: skip expensive embedding generation and jump directly into model development.
-- Practical benchmark: direct comparison of numeric-only, CLS+numeric, and pure-CLS baselines.
-- Easy integration: embeddings are standard NumPy arrays suitable for scikit-learn/XGBoost/PyTorch pipelines.
-- Reproducibility: file checksums are provided in `embeddings/embeddings_manifest.csv`.
-## Three Workstreams Covered
 1. Numeric-only XGBoost baseline
 2. Frozen CLS embedding + numeric XGBoost
 3. Pure CLS linear baseline
 ## What Is Inside
 ### 1) Results artifacts
@@ -77,57 +74,37 @@ Stored under `embeddings/`:
 - `embeddings/embeddings_manifest.csv`
-## Quick Start: Download and Load Embeddings
 ```python
-from huggingface_hub import hf_hub_download
 import numpy as np
-REPO_ID = "JamieYuu/slm-bert-emb"
-train_path = hf_hub_download(
-	repo_id=REPO_ID,
-	filename="embeddings/bertlite_full_fresh__train_cls_embeddings.npy",
-	repo_type="model",
-)
-val_path = hf_hub_download(
-	repo_id=REPO_ID,
-	filename="embeddings/bertlite_full_fresh__val_cls_embeddings.npy",
-	repo_type="model",
-)
-test_path = hf_hub_download(
-	repo_id=REPO_ID,
-	filename="embeddings/bertlite_full_fresh__test_cls_embeddings.npy",
-	repo_type="model",
-)
-X_train_cls = np.load(train_path)
-X_val_cls = np.load(val_path)
-X_test_cls = np.load(test_path)
-print(X_train_cls.shape, X_val_cls.shape, X_test_cls.shape)
-```
-## Pseudocode: Use Embeddings in a Hybrid Classifier
-```python
-# Inputs:
-# - X_*_numeric: lagged market/FnG tabular features
-# - X_*_cls: CLS embeddings from this repo
-# - y_*: labels
-X_train = concat([X_train_numeric, X_train_cls], axis=1)
-X_val = concat([X_val_numeric, X_val_cls], axis=1)
-X_test = concat([X_test_numeric, X_test_cls], axis=1)
-model = XGBoostClassifier(tuned_hyperparameters)
-model.fit(X_train, y_train)
-val_pred = model.predict(X_val)
-test_pred = model.predict(X_test)
-report_metrics(y_val, val_pred)
-report_metrics(y_test, test_pred)
 ```
 ## Main Test Metrics (from results summary)
@@ -137,12 +114,11 @@ report_metrics(y_test, test_pred)
 - Pure CLS linear: accuracy 0.5252, F1 0.2860, precision 0.2866, recall 0.2854
 - Delta test F1 (CLS+Numeric - Numeric-only): +0.0740
-## Suggested Citation
-If you use these artifacts, please cite:
-- This repository (`JamieYuu/slm-bert-emb`)
-- The source dataset (`SahandNZ/cryptonews-articles-with-price-momentum-labels`)
 ## Notes

 ---
 language: en
 tags:
 - finance
+- quant-finance
+- algo-trading
+- trading
+- crypto
+- sentiment-analysis
+- feature-engineering
+- alpha-research
+- machine-learning
 - xgboost
 - bert
 - finbert
 - embeddings
 - preliminary-results
 datasets:
 - SahandNZ/cryptonews-articles-with-price-momentum-labels
 base_model:
 license: mit
 ---
+# Crypto News Alpha Features: BERT-Lite Embeddings + Benchmark Results
+Precomputed crypto-news CLS embeddings and benchmark outputs for fast quant experimentation.
+If you want to test whether text alpha adds signal over market/FnG-style numeric features, this repo gives you ready-to-use artifacts without rerunning expensive encoding.
+Primary text encoder used in this release: `boltuix/bert-lite`.
+## Data Source
 This work uses the public Hugging Face dataset:
 Please cite and credit the original dataset creator (`SahandNZ`) when reusing these artifacts.
+## Three Benchmark Tracks
 1. Numeric-only XGBoost baseline
 2. Frozen CLS embedding + numeric XGBoost
 3. Pure CLS linear baseline
+## Why Download This
+- Start modeling immediately with ready-made `.npy` embedding tensors.
+- Reproduce a strong baseline quickly for crypto text+tabular fusion.
+- Compare pure numeric alpha vs text-augmented alpha with a clean metric summary.
+- Use as a drop-in feature pack for your own classifier/regressor experiments.
 ## What Is Inside
 ### 1) Results artifacts
 - `embeddings/embeddings_manifest.csv`
+## Quick Start (Embedding Usage)
 ```python
 import numpy as np
+from xgboost import XGBClassifier
+# 1) Load precomputed text embeddings
+X_text_train = np.load("embeddings/bertlite_full_fresh__train_cls_embeddings.npy")
+X_text_val = np.load("embeddings/bertlite_full_fresh__val_cls_embeddings.npy")
+# 2) Load your numeric features aligned to the same row order
+# X_num_train, X_num_val = ...
+# 3) Fuse text + numeric features
+# X_train = np.concatenate([X_num_train, X_text_train], axis=1)
+# X_val = np.concatenate([X_num_val, X_text_val], axis=1)
+# 4) Train a downstream model
+# clf = XGBClassifier(n_estimators=250, max_depth=4, learning_rate=0.1)
+# clf.fit(X_train, y_train)
+# y_pred = clf.predict(X_val)
+```
+Minimal pseudocode:
+```text
+load CLS embeddings
+align with numeric feature rows
+concatenate [numeric, CLS]
+train XGBoost
+compare vs numeric-only baseline
 ```
 ## Main Test Metrics (from results summary)
 - Pure CLS linear: accuracy 0.5252, F1 0.2860, precision 0.2866, recall 0.2854
 - Delta test F1 (CLS+Numeric - Numeric-only): +0.0740
+## Practical Notes
+- Embeddings are frozen feature tensors (`.npy`), not fine-tuned checkpoints.
+- Files are split by train/val/test for direct experimental use.
+- Check `embeddings/embeddings_manifest.csv` to verify integrity.
 ## Notes