File size: 9,259 Bytes
3a12737
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4bab501
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
---
language:
- en
license: mit
tags:
- text-generation
- transformer-decoder
- embeddings
- mpnet
- crypto
- pooled-embeddings
- social-media
library_name: pytorch
pipeline_tag: text-generation
base_model: sentence-transformers/all-mpnet-base-v2
---

# Aparecium v2 – Pooled MPNet Reverser (S1 Baseline)

## Summary

- **Task**: Reconstruct natural-language crypto social-media posts from a **single pooled MPNet embedding** (reverse embedding).
- **Focus**: Crypto domain (social-media posts / short-form content).
- **Checkpoint**: `aparecium_v2_s1.pt` — S1 supervised baseline, trained on synthetic crypto social-media posts.
- **Input contract**: a **pooled** `all-mpnet-base-v2` vector of shape `(768,)`, *not* a token-level `(seq_len, 768)` matrix.
- **Code**: this repo only hosts weights; loading & decoding are implemented in the Aparecium codebase
  (the v2 training repo and service are analogous in spirit to the v1 project  
  [`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser)).

This is a **pooled-embedding variant** of Aparecium, distinct from the original token-level seq2seq reverser described in  
[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser).

---

## Intended use

- **Research / engineering**:
  - Study how much crypto-domain information is recoverable from a single pooled embedding.
  - Prototype tools around embedding interpretability, diagnostics, and “gist reconstruction” from vectors.
- **Not intended** for:
  - Reconstructing private, user-identifying, or sensitive content.
  - Any de‑anonymization of embedding corpora.

Reconstruction quality depends heavily on:

- The upstream encoder (`sentence-transformers/all-mpnet-base-v2`),
- Domain match (crypto social-media posts vs. your data),
- Decode settings (beam vs. sampling, constraints, reranking).

---

## Model architecture

On the encoder side, we assume a **pooled MPNet** encoder:

- Recommended: `sentence-transformers/all-mpnet-base-v2` (768‑D pooled output).

On the decoder side, v2 uses the Aparecium components:

- **EmbAdapter**:
  - Input: pooled vector `e ∈ R^768`.
  - Output: pseudo‑sequence memory `H ∈ R^{B × S × D}` suitable for a transformer decoder (multi‑scale).
- **Sketcher**:
  - Lightweight network producing a “plan” and simple control flags (e.g., URL presence) from `e`.
  - In the S1 baseline checkpoint, it is trained but only lightly used at inference.
- **RealizerDecoder**:
  - Transformer decoder (GPT‑style) with:
    - `d_model = 768`
    - `n_layer = 12`
    - `n_head = 8`
    - `d_ff = 3072`
    - Dropout ≈ 0.1
  - Consumes `H` as cross‑attention memory and generates text tokens.

Decoding:

- Deterministic beam search or sampling, with optional:
  - **Constraints** (e.g., require certain tickers/hashtags/amounts based on a plan).
  - **Surrogate similarity scorer `r(x, e)`** for reranking candidates.
  - **Final MPNet cosine rerank** across top‑K candidates.

The `aparecium_v2_s1.pt` checkpoint contains the adapter, sketcher, decoder, and tokenizer name, matching the training repo layout.

---

## Training data and provenance

- **Source**: synthetic crypto social-media posts generated via OpenAI models into a DB (e.g., `tweets.db`).
- **Domain**:
  - Crypto markets, DeFi, L2s, MEV, governance, NFTs, etc.
- **Preparation (v2 pipeline)**:
  1. Extract raw text from the DB into JSONL.
  2. Embed each tweet with `sentence-transformers/all-mpnet-base-v2`:
     - `embedding ∈ R^768` (pooled), L2‑normalized.
     - Optionally store a simple “plan” (tickers, hashtags, amounts, addresses).
  3. Split into train/val/test and shard into JSONL files.

No real social‑media content is used; all posts are synthetic, similar in spirit to the v1 project  
[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser).

---

## Training procedure (S1 baseline regimen)

This checkpoint corresponds to **S1 supervised training only** (no SCST/RL):

- Objective: teacher‑forcing cross‑entropy over the crypto tweet text, given the pooled embedding.
- Optimizer: AdamW
- Typical hyperparameters (baseline run):
  - Batch size: 64
  - Max length: 96 tokens (tweets)
  - Learning rate: 3e‑4 (cosine decay), warmup ~1k steps
  - Weight decay: 0.01
  - Grad clip: 1.0
  - Dropout: 0.1
- Data:
  - ~100k synthetic crypto tweets (train/val split).
  - Embeddings precomputed via `all-mpnet-base-v2` and normalized.
- Checkpointing:
  - Save final weights as `aparecium_v2_s1.pt` once training plateaus on validation cross‑entropy.

Future work (not in this checkpoint):

- SCST RL (S2) with a reward combining MPNet cosine, surrogate `r`, repetition penalty, and entity coverage.
- Stronger constraints and rerank policies as described in the training plan.

---

## Evaluation protocol (baseline qualitative)

This repo does **not** include a full eval harness. The S1 baseline was validated qualitatively:

- Sample 10–20 crypto sentences (held‑out).
- For each:
  1. Embed text with `all-mpnet-base-v2` (pooled, normalized).
  2. Invert with Aparecium v2 S1 (beam search + rerank).
  3. Re‑embed the generated text with MPNet and compute cosine with the original embedding.

For a v1‑style, large‑scale evaluation (crypto/equities split, cosine statistics, degeneracy rate, domain drift), refer to the v1 model card:  
[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser).

---

## Input contract and usage

**Input** (v2, S1 baseline):

- A **single pooled MPNet embedding** (crypto tweet) of shape `(768,)`, L2‑normalized.
- Recommended encoder: `sentence-transformers/all-mpnet-base-v2` from `sentence-transformers`.

Do **not** pass a token‑level `(seq_len, 768)` matrix – that is the contract for the v1 seq2seq model  
[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser), not this checkpoint.

**Usage pattern (high level, pseudocode)**:

```python
import torch, json
from sentence_transformers import SentenceTransformer

# 1) Pooled MPNet embedding
mpnet = SentenceTransformer("sentence-transformers/all-mpnet-base-v2",
                            device="cuda" if torch.cuda.is_available() else "cpu")
text = "Ethereum L2 blob fees spiked after EIP-4844; MEV still shapes order flow."
e = mpnet.encode([text], convert_to_numpy=True, normalize_embeddings=True)[0]  # (768,)

# 2) Load Aparecium v2 S1 checkpoint
ckpt = torch.load("aparecium_v2_s1.pt", map_location="cpu")

# 3) Recreate models from the Aparecium codebase (not included in this HF repo)
# from aparecium.aparecium.models.emb_adapter import EmbAdapter
# from aparecium.aparecium.models.decoder import RealizerDecoder
# from aparecium.aparecium.models.sketcher import Sketcher
# from aparecium.aparecium.utils.tokens import build_tokenizer
# and run the same decoding logic as in `aparecium/infer/service.py` or
# `aparecium/scripts/invert_once.py`.

# 4) Use beam search / constraints / reranking as in the training repo.
```

To actually use the model, you need the Aparecium codebase (training repo) where the `EmbAdapter`, `Sketcher`, `RealizerDecoder`, constraints, and decoding functions are defined.

---

## Limitations and responsible use

- Outputs are *approximations* of the original text under the MPNet embedding and LM prior:
  - They aim to preserve semantic gist and domain entities,
  - They are **not exact reconstructions**.
- The model can:
  - Produce generic phrasing,
  - Over‑use crypto buzzwords/hashtags,
  - Occasionally show noisy punctuation/emoji.
- Data are synthetic; domain semantics might differ from real social‑media distributions.
- Do **not** use this model to attempt to reconstruct sensitive or private user content from embeddings.

---

## Reproducibility (high‑level)

To reproduce or extend this checkpoint:

1. **Prepare data**:
   - Generate synthetic crypto tweets (or your own domain) into a DB (e.g., SQLite).
   - Extract raw text to `train/val/test` JSONL.
   - Embed with `all-mpnet-base-v2` (pooled 768‑D) and save as JSONL with `{"text","embedding","plan"}` fields.
2. **Train S1**:
   - Use the Aparecium v2 trainer (S1 supervised) with:
     - `batch_size ≈ 64`, `max_len ≈ 96`, `lr ≈ 3e-4`, cosine scheduler, warmup steps.
   - Train until validation cross‑entropy and cosine proxy metrics plateau.
3. **Optional**:
   - Train surrogate similarity scorer `r` for reranking.
   - Add SCST RL (S2) if you implement the safe reward/decoding policies.
4. **Evaluate**:
   - Build a small evaluation harness (as in the v1 project) to measure cosine, degeneracy, and domain drift.

---

## License

- **Code**: MIT (per Aparecium repositories).
- **Weights**: MIT, same as the code, unless explicitly overridden.

---

## Citation

If you use this model or the Aparecium codebase, please cite:

> Aparecium v2: Pooled MPNet Embedding Reversal for Crypto Tweets  
> SentiChain (Aparecium project)

You may also reference the v1 baseline model card:  
[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser).