ruitao-edward-chen
commited on
Commit
·
97cad85
1
Parent(s):
f69735f
Update model card to HF style with metrics and usage
Browse files
README.md
CHANGED
|
@@ -1,9 +1,23 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
#### Summary
|
| 4 |
- **Task**: Reconstruct natural language posts from token‑level MPNet embeddings (reverse embedding).
|
| 5 |
- **Focus**: Crypto domain, with equities as auxiliary domain.
|
| 6 |
-
- **
|
| 7 |
- **Data**: 1.0M synthetic posts (500k crypto + 500k equities), programmatically generated via OpenAI API. No real social‑media content used.
|
| 8 |
- **Input contract**: token‑level MPNet matrix of shape `(seq_len, 768)`, not a pooled vector.
|
| 9 |
|
|
@@ -64,8 +78,7 @@ Recommended inference defaults:
|
|
| 64 |
- Checkpointing: every 2,000 steps and at phase end.
|
| 65 |
|
| 66 |
Notes
|
| 67 |
-
-
|
| 68 |
-
- Best observed checkpoint: Phase 2 (if retained). The directory currently contains Phase 3; consider re‑exporting Phase 2.
|
| 69 |
|
| 70 |
---
|
| 71 |
|
|
@@ -120,12 +133,10 @@ Interpretation
|
|
| 120 |
---
|
| 121 |
|
| 122 |
### Reproducibility (high‑level)
|
| 123 |
-
- Prepare caches:
|
| 124 |
-
- crypto: `data/pipeline/aparecium_crypto_500k.db`
|
| 125 |
-
- equities: `data/pipeline/aparecium_equities_500k.db`
|
| 126 |
- Baseline training: iterative 10% phases, 80:20 (crypto:equities), LR=5e‑5, BS=16, early‑stop on out‑of‑sample cosine degradation.
|
| 127 |
- Evaluation: 1,000 samples/domain with the decode settings shown above.
|
| 128 |
-
-
|
| 129 |
|
| 130 |
---
|
| 131 |
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: en
|
| 3 |
+
license: mit
|
| 4 |
+
library_name: pytorch
|
| 5 |
+
tags:
|
| 6 |
+
- transformer-decoder
|
| 7 |
+
- seq2seq
|
| 8 |
+
- embeddings
|
| 9 |
+
- mpnet
|
| 10 |
+
- text-reconstruction
|
| 11 |
+
- crypto
|
| 12 |
+
pipeline_tag: text2text-generation
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
### Aparecium Baseline Model Card
|
| 16 |
|
| 17 |
#### Summary
|
| 18 |
- **Task**: Reconstruct natural language posts from token‑level MPNet embeddings (reverse embedding).
|
| 19 |
- **Focus**: Crypto domain, with equities as auxiliary domain.
|
| 20 |
+
- **Checkpoint**: Baseline model trained with a phased schedule and early stopping.
|
| 21 |
- **Data**: 1.0M synthetic posts (500k crypto + 500k equities), programmatically generated via OpenAI API. No real social‑media content used.
|
| 22 |
- **Input contract**: token‑level MPNet matrix of shape `(seq_len, 768)`, not a pooled vector.
|
| 23 |
|
|
|
|
| 78 |
- Checkpointing: every 2,000 steps and at phase end.
|
| 79 |
|
| 80 |
Notes
|
| 81 |
+
- Training used early stopping based on out‑of‑sample cosine.
|
|
|
|
| 82 |
|
| 83 |
---
|
| 84 |
|
|
|
|
| 133 |
---
|
| 134 |
|
| 135 |
### Reproducibility (high‑level)
|
| 136 |
+
- Prepare embedding caches (not included): build local token‑level MPNet embedding caches for your corpora (e.g., via a data prep script) and store them in your own paths.
|
|
|
|
|
|
|
| 137 |
- Baseline training: iterative 10% phases, 80:20 (crypto:equities), LR=5e‑5, BS=16, early‑stop on out‑of‑sample cosine degradation.
|
| 138 |
- Evaluation: 1,000 samples/domain with the decode settings shown above.
|
| 139 |
+
- The released checkpoint corresponds to the latest non‑degrading phase under early‑stopping.
|
| 140 |
|
| 141 |
---
|
| 142 |
|