JamieYuu commited on
Commit
2453fe8
·
verified ·
1 Parent(s): 698795d

Improve model card attractiveness: richer tags and usage snippet

Browse files
Files changed (1) hide show
  1. README.md +47 -71
README.md CHANGED
@@ -1,18 +1,20 @@
1
  ---
2
  language: en
3
- pipeline_tag: feature-extraction
4
  tags:
5
  - finance
 
 
 
 
 
 
 
 
6
  - xgboost
7
  - bert
8
  - finbert
9
  - embeddings
10
  - preliminary-results
11
- - cryptocurrency
12
- - news
13
- - sentiment-analysis
14
- - classification
15
- - tabular
16
  datasets:
17
  - SahandNZ/cryptonews-articles-with-price-momentum-labels
18
  base_model:
@@ -20,20 +22,15 @@ base_model:
20
  license: mit
21
  ---
22
 
23
- # Crypto News Momentum Embeddings (BERT-Lite) + Benchmark Results
24
-
25
- Lightweight, ready-to-use CLS embeddings for crypto news direction modeling, paired with reproducible benchmark metrics.
26
-
27
- If you want fast experiments without rerunning transformer inference, this repo gives you downloadable `.npy` features and baseline scores out of the box.
28
 
29
- This single Hugging Face repo contains both:
30
 
31
- - Preliminary evaluation results for numeric-only vs CLS+numeric models
32
- - Cached CLS embedding artifacts used in those experiments
33
 
34
- Primary text encoder/base model used for this release: `boltuix/bert-lite`.
35
 
36
- ## Source Dataset Declaration
37
 
38
  This work uses the public Hugging Face dataset:
39
 
@@ -41,19 +38,19 @@ This work uses the public Hugging Face dataset:
41
 
42
  Please cite and credit the original dataset creator (`SahandNZ`) when reusing these artifacts.
43
 
44
- ## Why This Repo Is Useful
45
-
46
- - Fast start: skip expensive embedding generation and jump directly into model development.
47
- - Practical benchmark: direct comparison of numeric-only, CLS+numeric, and pure-CLS baselines.
48
- - Easy integration: embeddings are standard NumPy arrays suitable for scikit-learn/XGBoost/PyTorch pipelines.
49
- - Reproducibility: file checksums are provided in `embeddings/embeddings_manifest.csv`.
50
-
51
- ## Three Workstreams Covered
52
 
53
  1. Numeric-only XGBoost baseline
54
  2. Frozen CLS embedding + numeric XGBoost
55
  3. Pure CLS linear baseline
56
 
 
 
 
 
 
 
 
57
  ## What Is Inside
58
 
59
  ### 1) Results artifacts
@@ -77,57 +74,37 @@ Stored under `embeddings/`:
77
 
78
  - `embeddings/embeddings_manifest.csv`
79
 
80
- ## Quick Start: Download and Load Embeddings
81
 
82
  ```python
83
- from huggingface_hub import hf_hub_download
84
  import numpy as np
 
85
 
86
- REPO_ID = "JamieYuu/slm-bert-emb"
87
-
88
- train_path = hf_hub_download(
89
- repo_id=REPO_ID,
90
- filename="embeddings/bertlite_full_fresh__train_cls_embeddings.npy",
91
- repo_type="model",
92
- )
93
- val_path = hf_hub_download(
94
- repo_id=REPO_ID,
95
- filename="embeddings/bertlite_full_fresh__val_cls_embeddings.npy",
96
- repo_type="model",
97
- )
98
- test_path = hf_hub_download(
99
- repo_id=REPO_ID,
100
- filename="embeddings/bertlite_full_fresh__test_cls_embeddings.npy",
101
- repo_type="model",
102
- )
103
-
104
- X_train_cls = np.load(train_path)
105
- X_val_cls = np.load(val_path)
106
- X_test_cls = np.load(test_path)
107
-
108
- print(X_train_cls.shape, X_val_cls.shape, X_test_cls.shape)
109
- ```
110
-
111
- ## Pseudocode: Use Embeddings in a Hybrid Classifier
112
 
113
- ```python
114
- # Inputs:
115
- # - X_*_numeric: lagged market/FnG tabular features
116
- # - X_*_cls: CLS embeddings from this repo
117
- # - y_*: labels
118
 
119
- X_train = concat([X_train_numeric, X_train_cls], axis=1)
120
- X_val = concat([X_val_numeric, X_val_cls], axis=1)
121
- X_test = concat([X_test_numeric, X_test_cls], axis=1)
122
 
123
- model = XGBoostClassifier(tuned_hyperparameters)
124
- model.fit(X_train, y_train)
 
 
 
125
 
126
- val_pred = model.predict(X_val)
127
- test_pred = model.predict(X_test)
128
 
129
- report_metrics(y_val, val_pred)
130
- report_metrics(y_test, test_pred)
 
 
 
 
131
  ```
132
 
133
  ## Main Test Metrics (from results summary)
@@ -137,12 +114,11 @@ report_metrics(y_test, test_pred)
137
  - Pure CLS linear: accuracy 0.5252, F1 0.2860, precision 0.2866, recall 0.2854
138
  - Delta test F1 (CLS+Numeric - Numeric-only): +0.0740
139
 
140
- ## Suggested Citation
141
-
142
- If you use these artifacts, please cite:
143
 
144
- - This repository (`JamieYuu/slm-bert-emb`)
145
- - The source dataset (`SahandNZ/cryptonews-articles-with-price-momentum-labels`)
 
146
 
147
  ## Notes
148
 
 
1
  ---
2
  language: en
 
3
  tags:
4
  - finance
5
+ - quant-finance
6
+ - algo-trading
7
+ - trading
8
+ - crypto
9
+ - sentiment-analysis
10
+ - feature-engineering
11
+ - alpha-research
12
+ - machine-learning
13
  - xgboost
14
  - bert
15
  - finbert
16
  - embeddings
17
  - preliminary-results
 
 
 
 
 
18
  datasets:
19
  - SahandNZ/cryptonews-articles-with-price-momentum-labels
20
  base_model:
 
22
  license: mit
23
  ---
24
 
25
+ # Crypto News Alpha Features: BERT-Lite Embeddings + Benchmark Results
 
 
 
 
26
 
27
+ Precomputed crypto-news CLS embeddings and benchmark outputs for fast quant experimentation.
28
 
29
+ If you want to test whether text alpha adds signal over market/FnG-style numeric features, this repo gives you ready-to-use artifacts without rerunning expensive encoding.
 
30
 
31
+ Primary text encoder used in this release: `boltuix/bert-lite`.
32
 
33
+ ## Data Source
34
 
35
  This work uses the public Hugging Face dataset:
36
 
 
38
 
39
  Please cite and credit the original dataset creator (`SahandNZ`) when reusing these artifacts.
40
 
41
+ ## Three Benchmark Tracks
 
 
 
 
 
 
 
42
 
43
  1. Numeric-only XGBoost baseline
44
  2. Frozen CLS embedding + numeric XGBoost
45
  3. Pure CLS linear baseline
46
 
47
+ ## Why Download This
48
+
49
+ - Start modeling immediately with ready-made `.npy` embedding tensors.
50
+ - Reproduce a strong baseline quickly for crypto text+tabular fusion.
51
+ - Compare pure numeric alpha vs text-augmented alpha with a clean metric summary.
52
+ - Use as a drop-in feature pack for your own classifier/regressor experiments.
53
+
54
  ## What Is Inside
55
 
56
  ### 1) Results artifacts
 
74
 
75
  - `embeddings/embeddings_manifest.csv`
76
 
77
+ ## Quick Start (Embedding Usage)
78
 
79
  ```python
 
80
  import numpy as np
81
+ from xgboost import XGBClassifier
82
 
83
+ # 1) Load precomputed text embeddings
84
+ X_text_train = np.load("embeddings/bertlite_full_fresh__train_cls_embeddings.npy")
85
+ X_text_val = np.load("embeddings/bertlite_full_fresh__val_cls_embeddings.npy")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
+ # 2) Load your numeric features aligned to the same row order
88
+ # X_num_train, X_num_val = ...
 
 
 
89
 
90
+ # 3) Fuse text + numeric features
91
+ # X_train = np.concatenate([X_num_train, X_text_train], axis=1)
92
+ # X_val = np.concatenate([X_num_val, X_text_val], axis=1)
93
 
94
+ # 4) Train a downstream model
95
+ # clf = XGBClassifier(n_estimators=250, max_depth=4, learning_rate=0.1)
96
+ # clf.fit(X_train, y_train)
97
+ # y_pred = clf.predict(X_val)
98
+ ```
99
 
100
+ Minimal pseudocode:
 
101
 
102
+ ```text
103
+ load CLS embeddings
104
+ align with numeric feature rows
105
+ concatenate [numeric, CLS]
106
+ train XGBoost
107
+ compare vs numeric-only baseline
108
  ```
109
 
110
  ## Main Test Metrics (from results summary)
 
114
  - Pure CLS linear: accuracy 0.5252, F1 0.2860, precision 0.2866, recall 0.2854
115
  - Delta test F1 (CLS+Numeric - Numeric-only): +0.0740
116
 
117
+ ## Practical Notes
 
 
118
 
119
+ - Embeddings are frozen feature tensors (`.npy`), not fine-tuned checkpoints.
120
+ - Files are split by train/val/test for direct experimental use.
121
+ - Check `embeddings/embeddings_manifest.csv` to verify integrity.
122
 
123
  ## Notes
124