ruitao-edward-chen commited on
Commit
97cad85
·
1 Parent(s): f69735f

Update model card to HF style with metrics and usage

Browse files
Files changed (1) hide show
  1. README.md +19 -8
README.md CHANGED
@@ -1,9 +1,23 @@
1
- ### Aparecium Baseline (Crypto‑focused) — Model Card
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  #### Summary
4
  - **Task**: Reconstruct natural language posts from token‑level MPNet embeddings (reverse embedding).
5
  - **Focus**: Crypto domain, with equities as auxiliary domain.
6
- - **Current checkpoint**: `models/baseline` reflects Phase 3 (early stop triggered after Phase 3 due to out‑of‑sample drop). Phase 2 performed best; consider publishing the Phase 2 checkpoint if available.
7
  - **Data**: 1.0M synthetic posts (500k crypto + 500k equities), programmatically generated via OpenAI API. No real social‑media content used.
8
  - **Input contract**: token‑level MPNet matrix of shape `(seq_len, 768)`, not a pooled vector.
9
 
@@ -64,8 +78,7 @@ Recommended inference defaults:
64
  - Checkpointing: every 2,000 steps and at phase end.
65
 
66
  Notes
67
- - In this run, Phase 1 Phase 2 showed clear out‑of‑sample improvements; Phase 3 degraded; early stop triggered.
68
- - Best observed checkpoint: Phase 2 (if retained). The directory currently contains Phase 3; consider re‑exporting Phase 2.
69
 
70
  ---
71
 
@@ -120,12 +133,10 @@ Interpretation
120
  ---
121
 
122
  ### Reproducibility (high‑level)
123
- - Prepare caches:
124
- - crypto: `data/pipeline/aparecium_crypto_500k.db`
125
- - equities: `data/pipeline/aparecium_equities_500k.db`
126
  - Baseline training: iterative 10% phases, 80:20 (crypto:equities), LR=5e‑5, BS=16, early‑stop on out‑of‑sample cosine degradation.
127
  - Evaluation: 1,000 samples/domain with the decode settings shown above.
128
- - Best observed baseline: Phase 2 (early‑stop triggered after Phase 3). The directory currently contains Phase 3 unless a Phase 2 copy is retained.
129
 
130
  ---
131
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ library_name: pytorch
5
+ tags:
6
+ - transformer-decoder
7
+ - seq2seq
8
+ - embeddings
9
+ - mpnet
10
+ - text-reconstruction
11
+ - crypto
12
+ pipeline_tag: text2text-generation
13
+ ---
14
+
15
+ ### Aparecium Baseline Model Card
16
 
17
  #### Summary
18
  - **Task**: Reconstruct natural language posts from token‑level MPNet embeddings (reverse embedding).
19
  - **Focus**: Crypto domain, with equities as auxiliary domain.
20
+ - **Checkpoint**: Baseline model trained with a phased schedule and early stopping.
21
  - **Data**: 1.0M synthetic posts (500k crypto + 500k equities), programmatically generated via OpenAI API. No real social‑media content used.
22
  - **Input contract**: token‑level MPNet matrix of shape `(seq_len, 768)`, not a pooled vector.
23
 
 
78
  - Checkpointing: every 2,000 steps and at phase end.
79
 
80
  Notes
81
+ - Training used early stopping based on out‑of‑sample cosine.
 
82
 
83
  ---
84
 
 
133
  ---
134
 
135
  ### Reproducibility (high‑level)
136
+ - Prepare embedding caches (not included): build local token‑level MPNet embedding caches for your corpora (e.g., via a data prep script) and store them in your own paths.
 
 
137
  - Baseline training: iterative 10% phases, 80:20 (crypto:equities), LR=5e‑5, BS=16, early‑stop on out‑of‑sample cosine degradation.
138
  - Evaluation: 1,000 samples/domain with the decode settings shown above.
139
+ - The released checkpoint corresponds to the latest non‑degrading phase under early‑stopping.
140
 
141
  ---
142