SentiChain
/

aparecium-seq2seq-reverser

@@ -1,9 +1,23 @@
-### Aparecium Baseline (Crypto‑focused) — Model Card
 #### Summary
 - **Task**: Reconstruct natural language posts from token‑level MPNet embeddings (reverse embedding).
 - **Focus**: Crypto domain, with equities as auxiliary domain.
-- **Current checkpoint**: `models/baseline` reflects Phase 3 (early stop triggered after Phase 3 due to out‑of‑sample drop). Phase 2 performed best; consider publishing the Phase 2 checkpoint if available.
 - **Data**: 1.0M synthetic posts (500k crypto + 500k equities), programmatically generated via OpenAI API. No real social‑media content used.
 - **Input contract**: token‑level MPNet matrix of shape `(seq_len, 768)`, not a pooled vector.
@@ -64,8 +78,7 @@ Recommended inference defaults:
 - Checkpointing: every 2,000 steps and at phase end.
 Notes
-- In this run, Phase 1 → Phase 2 showed clear out‑of‑sample improvements; Phase 3 degraded; early stop triggered.
-- Best observed checkpoint: Phase 2 (if retained). The directory currently contains Phase 3; consider re‑exporting Phase 2.
 ---
@@ -120,12 +133,10 @@ Interpretation
 ---
 ### Reproducibility (high‑level)
-- Prepare caches:
-  - crypto: `data/pipeline/aparecium_crypto_500k.db`
-  - equities: `data/pipeline/aparecium_equities_500k.db`
 - Baseline training: iterative 10% phases, 80:20 (crypto:equities), LR=5e‑5, BS=16, early‑stop on out‑of‑sample cosine degradation.
 - Evaluation: 1,000 samples/domain with the decode settings shown above.
-- Best observed baseline: Phase 2 (early‑stop triggered after Phase 3). The directory currently contains Phase 3 unless a Phase 2 copy is retained.
 ---

+---
+language: en
+license: mit
+library_name: pytorch
+tags:
+  - transformer-decoder
+  - seq2seq
+  - embeddings
+  - mpnet
+  - text-reconstruction
+  - crypto
+pipeline_tag: text2text-generation
+---
+### Aparecium Baseline Model Card
 #### Summary
 - **Task**: Reconstruct natural language posts from token‑level MPNet embeddings (reverse embedding).
 - **Focus**: Crypto domain, with equities as auxiliary domain.
+- **Checkpoint**: Baseline model trained with a phased schedule and early stopping.
 - **Data**: 1.0M synthetic posts (500k crypto + 500k equities), programmatically generated via OpenAI API. No real social‑media content used.
 - **Input contract**: token‑level MPNet matrix of shape `(seq_len, 768)`, not a pooled vector.
 - Checkpointing: every 2,000 steps and at phase end.
 Notes
+- Training used early stopping based on out‑of‑sample cosine.
 ---
 ---
 ### Reproducibility (high‑level)
+- Prepare embedding caches (not included): build local token‑level MPNet embedding caches for your corpora (e.g., via a data prep script) and store them in your own paths.
 - Baseline training: iterative 10% phases, 80:20 (crypto:equities), LR=5e‑5, BS=16, early‑stop on out‑of‑sample cosine degradation.
 - Evaluation: 1,000 samples/domain with the decode settings shown above.
+- The released checkpoint corresponds to the latest non‑degrading phase under early‑stopping.
 ---