ruitao-edward-chen commited on
Commit
3a12737
·
1 Parent(s): 4bab501

Clarify wording: social-media posts instead of tweets

Browse files
Files changed (1) hide show
  1. README.md +233 -3
README.md CHANGED
@@ -1,4 +1,234 @@
1
- # Aparecium v2 – Pooled MPNet Reverser (S1 baseline)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
- S1 supervised baseline for pooled ll-mpnet-base-v2 embeddings.
4
- Requires the Aparecium codebase to load and run (see your training repo).
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - text-generation
7
+ - transformer-decoder
8
+ - embeddings
9
+ - mpnet
10
+ - crypto
11
+ - pooled-embeddings
12
+ - social-media
13
+ library_name: pytorch
14
+ pipeline_tag: text-generation
15
+ base_model: sentence-transformers/all-mpnet-base-v2
16
+ ---
17
+
18
+ # Aparecium v2 – Pooled MPNet Reverser (S1 Baseline)
19
+
20
+ ## Summary
21
+
22
+ - **Task**: Reconstruct natural-language crypto social-media posts from a **single pooled MPNet embedding** (reverse embedding).
23
+ - **Focus**: Crypto domain (social-media posts / short-form content).
24
+ - **Checkpoint**: `aparecium_v2_s1.pt` — S1 supervised baseline, trained on synthetic crypto social-media posts.
25
+ - **Input contract**: a **pooled** `all-mpnet-base-v2` vector of shape `(768,)`, *not* a token-level `(seq_len, 768)` matrix.
26
+ - **Code**: this repo only hosts weights; loading & decoding are implemented in the Aparecium codebase
27
+ (the v2 training repo and service are analogous in spirit to the v1 project
28
+ [`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser)).
29
+
30
+ This is a **pooled-embedding variant** of Aparecium, distinct from the original token-level seq2seq reverser described in
31
+ [`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser).
32
+
33
+ ---
34
+
35
+ ## Intended use
36
+
37
+ - **Research / engineering**:
38
+ - Study how much crypto-domain information is recoverable from a single pooled embedding.
39
+ - Prototype tools around embedding interpretability, diagnostics, and “gist reconstruction” from vectors.
40
+ - **Not intended** for:
41
+ - Reconstructing private, user-identifying, or sensitive content.
42
+ - Any de‑anonymization of embedding corpora.
43
+
44
+ Reconstruction quality depends heavily on:
45
+
46
+ - The upstream encoder (`sentence-transformers/all-mpnet-base-v2`),
47
+ - Domain match (crypto social-media posts vs. your data),
48
+ - Decode settings (beam vs. sampling, constraints, reranking).
49
+
50
+ ---
51
+
52
+ ## Model architecture
53
+
54
+ On the encoder side, we assume a **pooled MPNet** encoder:
55
+
56
+ - Recommended: `sentence-transformers/all-mpnet-base-v2` (768‑D pooled output).
57
+
58
+ On the decoder side, v2 uses the Aparecium components:
59
+
60
+ - **EmbAdapter**:
61
+ - Input: pooled vector `e ∈ R^768`.
62
+ - Output: pseudo‑sequence memory `H ∈ R^{B × S × D}` suitable for a transformer decoder (multi‑scale).
63
+ - **Sketcher**:
64
+ - Lightweight network producing a “plan” and simple control flags (e.g., URL presence) from `e`.
65
+ - In the S1 baseline checkpoint, it is trained but only lightly used at inference.
66
+ - **RealizerDecoder**:
67
+ - Transformer decoder (GPT‑style) with:
68
+ - `d_model = 768`
69
+ - `n_layer = 12`
70
+ - `n_head = 8`
71
+ - `d_ff = 3072`
72
+ - Dropout ≈ 0.1
73
+ - Consumes `H` as cross‑attention memory and generates text tokens.
74
+
75
+ Decoding:
76
+
77
+ - Deterministic beam search or sampling, with optional:
78
+ - **Constraints** (e.g., require certain tickers/hashtags/amounts based on a plan).
79
+ - **Surrogate similarity scorer `r(x, e)`** for reranking candidates.
80
+ - **Final MPNet cosine rerank** across top‑K candidates.
81
+
82
+ The `aparecium_v2_s1.pt` checkpoint contains the adapter, sketcher, decoder, and tokenizer name, matching the training repo layout.
83
+
84
+ ---
85
+
86
+ ## Training data and provenance
87
+
88
+ - **Source**: synthetic crypto social-media posts generated via OpenAI models into a DB (e.g., `tweets.db`).
89
+ - **Domain**:
90
+ - Crypto markets, DeFi, L2s, MEV, governance, NFTs, etc.
91
+ - **Preparation (v2 pipeline)**:
92
+ 1. Extract raw text from the DB into JSONL.
93
+ 2. Embed each tweet with `sentence-transformers/all-mpnet-base-v2`:
94
+ - `embedding ∈ R^768` (pooled), L2‑normalized.
95
+ - Optionally store a simple “plan” (tickers, hashtags, amounts, addresses).
96
+ 3. Split into train/val/test and shard into JSONL files.
97
+
98
+ No real social‑media content is used; all posts are synthetic, similar in spirit to the v1 project
99
+ [`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser).
100
+
101
+ ---
102
+
103
+ ## Training procedure (S1 baseline regimen)
104
+
105
+ This checkpoint corresponds to **S1 supervised training only** (no SCST/RL):
106
+
107
+ - Objective: teacher‑forcing cross‑entropy over the crypto tweet text, given the pooled embedding.
108
+ - Optimizer: AdamW
109
+ - Typical hyperparameters (baseline run):
110
+ - Batch size: 64
111
+ - Max length: 96 tokens (tweets)
112
+ - Learning rate: 3e‑4 (cosine decay), warmup ~1k steps
113
+ - Weight decay: 0.01
114
+ - Grad clip: 1.0
115
+ - Dropout: 0.1
116
+ - Data:
117
+ - ~100k synthetic crypto tweets (train/val split).
118
+ - Embeddings precomputed via `all-mpnet-base-v2` and normalized.
119
+ - Checkpointing:
120
+ - Save final weights as `aparecium_v2_s1.pt` once training plateaus on validation cross‑entropy.
121
+
122
+ Future work (not in this checkpoint):
123
+
124
+ - SCST RL (S2) with a reward combining MPNet cosine, surrogate `r`, repetition penalty, and entity coverage.
125
+ - Stronger constraints and rerank policies as described in the training plan.
126
+
127
+ ---
128
+
129
+ ## Evaluation protocol (baseline qualitative)
130
+
131
+ This repo does **not** include a full eval harness. The S1 baseline was validated qualitatively:
132
+
133
+ - Sample 10–20 crypto sentences (held‑out).
134
+ - For each:
135
+ 1. Embed text with `all-mpnet-base-v2` (pooled, normalized).
136
+ 2. Invert with Aparecium v2 S1 (beam search + rerank).
137
+ 3. Re‑embed the generated text with MPNet and compute cosine with the original embedding.
138
+
139
+ For a v1‑style, large‑scale evaluation (crypto/equities split, cosine statistics, degeneracy rate, domain drift), refer to the v1 model card:
140
+ [`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser).
141
+
142
+ ---
143
+
144
+ ## Input contract and usage
145
+
146
+ **Input** (v2, S1 baseline):
147
+
148
+ - A **single pooled MPNet embedding** (crypto tweet) of shape `(768,)`, L2‑normalized.
149
+ - Recommended encoder: `sentence-transformers/all-mpnet-base-v2` from `sentence-transformers`.
150
+
151
+ Do **not** pass a token‑level `(seq_len, 768)` matrix – that is the contract for the v1 seq2seq model
152
+ [`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser), not this checkpoint.
153
+
154
+ **Usage pattern (high level, pseudocode)**:
155
+
156
+ ```python
157
+ import torch, json
158
+ from sentence_transformers import SentenceTransformer
159
+
160
+ # 1) Pooled MPNet embedding
161
+ mpnet = SentenceTransformer("sentence-transformers/all-mpnet-base-v2",
162
+ device="cuda" if torch.cuda.is_available() else "cpu")
163
+ text = "Ethereum L2 blob fees spiked after EIP-4844; MEV still shapes order flow."
164
+ e = mpnet.encode([text], convert_to_numpy=True, normalize_embeddings=True)[0] # (768,)
165
+
166
+ # 2) Load Aparecium v2 S1 checkpoint
167
+ ckpt = torch.load("aparecium_v2_s1.pt", map_location="cpu")
168
+
169
+ # 3) Recreate models from the Aparecium codebase (not included in this HF repo)
170
+ # from aparecium.aparecium.models.emb_adapter import EmbAdapter
171
+ # from aparecium.aparecium.models.decoder import RealizerDecoder
172
+ # from aparecium.aparecium.models.sketcher import Sketcher
173
+ # from aparecium.aparecium.utils.tokens import build_tokenizer
174
+ # and run the same decoding logic as in `aparecium/infer/service.py` or
175
+ # `aparecium/scripts/invert_once.py`.
176
+
177
+ # 4) Use beam search / constraints / reranking as in the training repo.
178
+ ```
179
+
180
+ To actually use the model, you need the Aparecium codebase (training repo) where the `EmbAdapter`, `Sketcher`, `RealizerDecoder`, constraints, and decoding functions are defined.
181
+
182
+ ---
183
+
184
+ ## Limitations and responsible use
185
+
186
+ - Outputs are *approximations* of the original text under the MPNet embedding and LM prior:
187
+ - They aim to preserve semantic gist and domain entities,
188
+ - They are **not exact reconstructions**.
189
+ - The model can:
190
+ - Produce generic phrasing,
191
+ - Over‑use crypto buzzwords/hashtags,
192
+ - Occasionally show noisy punctuation/emoji.
193
+ - Data are synthetic; domain semantics might differ from real social‑media distributions.
194
+ - Do **not** use this model to attempt to reconstruct sensitive or private user content from embeddings.
195
+
196
+ ---
197
+
198
+ ## Reproducibility (high‑level)
199
+
200
+ To reproduce or extend this checkpoint:
201
+
202
+ 1. **Prepare data**:
203
+ - Generate synthetic crypto tweets (or your own domain) into a DB (e.g., SQLite).
204
+ - Extract raw text to `train/val/test` JSONL.
205
+ - Embed with `all-mpnet-base-v2` (pooled 768‑D) and save as JSONL with `{"text","embedding","plan"}` fields.
206
+ 2. **Train S1**:
207
+ - Use the Aparecium v2 trainer (S1 supervised) with:
208
+ - `batch_size ≈ 64`, `max_len ≈ 96`, `lr ≈ 3e-4`, cosine scheduler, warmup steps.
209
+ - Train until validation cross‑entropy and cosine proxy metrics plateau.
210
+ 3. **Optional**:
211
+ - Train surrogate similarity scorer `r` for reranking.
212
+ - Add SCST RL (S2) if you implement the safe reward/decoding policies.
213
+ 4. **Evaluate**:
214
+ - Build a small evaluation harness (as in the v1 project) to measure cosine, degeneracy, and domain drift.
215
+
216
+ ---
217
+
218
+ ## License
219
+
220
+ - **Code**: MIT (per Aparecium repositories).
221
+ - **Weights**: MIT, same as the code, unless explicitly overridden.
222
+
223
+ ---
224
+
225
+ ## Citation
226
+
227
+ If you use this model or the Aparecium codebase, please cite:
228
+
229
+ > Aparecium v2: Pooled MPNet Embedding Reversal for Crypto Tweets
230
+ > SentiChain (Aparecium project)
231
+
232
+ You may also reference the v1 baseline model card:
233
+ [`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser).
234