File size: 9,259 Bytes
3a12737 4bab501 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 |
---
language:
- en
license: mit
tags:
- text-generation
- transformer-decoder
- embeddings
- mpnet
- crypto
- pooled-embeddings
- social-media
library_name: pytorch
pipeline_tag: text-generation
base_model: sentence-transformers/all-mpnet-base-v2
---
# Aparecium v2 – Pooled MPNet Reverser (S1 Baseline)
## Summary
- **Task**: Reconstruct natural-language crypto social-media posts from a **single pooled MPNet embedding** (reverse embedding).
- **Focus**: Crypto domain (social-media posts / short-form content).
- **Checkpoint**: `aparecium_v2_s1.pt` — S1 supervised baseline, trained on synthetic crypto social-media posts.
- **Input contract**: a **pooled** `all-mpnet-base-v2` vector of shape `(768,)`, *not* a token-level `(seq_len, 768)` matrix.
- **Code**: this repo only hosts weights; loading & decoding are implemented in the Aparecium codebase
(the v2 training repo and service are analogous in spirit to the v1 project
[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser)).
This is a **pooled-embedding variant** of Aparecium, distinct from the original token-level seq2seq reverser described in
[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser).
---
## Intended use
- **Research / engineering**:
- Study how much crypto-domain information is recoverable from a single pooled embedding.
- Prototype tools around embedding interpretability, diagnostics, and “gist reconstruction” from vectors.
- **Not intended** for:
- Reconstructing private, user-identifying, or sensitive content.
- Any de‑anonymization of embedding corpora.
Reconstruction quality depends heavily on:
- The upstream encoder (`sentence-transformers/all-mpnet-base-v2`),
- Domain match (crypto social-media posts vs. your data),
- Decode settings (beam vs. sampling, constraints, reranking).
---
## Model architecture
On the encoder side, we assume a **pooled MPNet** encoder:
- Recommended: `sentence-transformers/all-mpnet-base-v2` (768‑D pooled output).
On the decoder side, v2 uses the Aparecium components:
- **EmbAdapter**:
- Input: pooled vector `e ∈ R^768`.
- Output: pseudo‑sequence memory `H ∈ R^{B × S × D}` suitable for a transformer decoder (multi‑scale).
- **Sketcher**:
- Lightweight network producing a “plan” and simple control flags (e.g., URL presence) from `e`.
- In the S1 baseline checkpoint, it is trained but only lightly used at inference.
- **RealizerDecoder**:
- Transformer decoder (GPT‑style) with:
- `d_model = 768`
- `n_layer = 12`
- `n_head = 8`
- `d_ff = 3072`
- Dropout ≈ 0.1
- Consumes `H` as cross‑attention memory and generates text tokens.
Decoding:
- Deterministic beam search or sampling, with optional:
- **Constraints** (e.g., require certain tickers/hashtags/amounts based on a plan).
- **Surrogate similarity scorer `r(x, e)`** for reranking candidates.
- **Final MPNet cosine rerank** across top‑K candidates.
The `aparecium_v2_s1.pt` checkpoint contains the adapter, sketcher, decoder, and tokenizer name, matching the training repo layout.
---
## Training data and provenance
- **Source**: synthetic crypto social-media posts generated via OpenAI models into a DB (e.g., `tweets.db`).
- **Domain**:
- Crypto markets, DeFi, L2s, MEV, governance, NFTs, etc.
- **Preparation (v2 pipeline)**:
1. Extract raw text from the DB into JSONL.
2. Embed each tweet with `sentence-transformers/all-mpnet-base-v2`:
- `embedding ∈ R^768` (pooled), L2‑normalized.
- Optionally store a simple “plan” (tickers, hashtags, amounts, addresses).
3. Split into train/val/test and shard into JSONL files.
No real social‑media content is used; all posts are synthetic, similar in spirit to the v1 project
[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser).
---
## Training procedure (S1 baseline regimen)
This checkpoint corresponds to **S1 supervised training only** (no SCST/RL):
- Objective: teacher‑forcing cross‑entropy over the crypto tweet text, given the pooled embedding.
- Optimizer: AdamW
- Typical hyperparameters (baseline run):
- Batch size: 64
- Max length: 96 tokens (tweets)
- Learning rate: 3e‑4 (cosine decay), warmup ~1k steps
- Weight decay: 0.01
- Grad clip: 1.0
- Dropout: 0.1
- Data:
- ~100k synthetic crypto tweets (train/val split).
- Embeddings precomputed via `all-mpnet-base-v2` and normalized.
- Checkpointing:
- Save final weights as `aparecium_v2_s1.pt` once training plateaus on validation cross‑entropy.
Future work (not in this checkpoint):
- SCST RL (S2) with a reward combining MPNet cosine, surrogate `r`, repetition penalty, and entity coverage.
- Stronger constraints and rerank policies as described in the training plan.
---
## Evaluation protocol (baseline qualitative)
This repo does **not** include a full eval harness. The S1 baseline was validated qualitatively:
- Sample 10–20 crypto sentences (held‑out).
- For each:
1. Embed text with `all-mpnet-base-v2` (pooled, normalized).
2. Invert with Aparecium v2 S1 (beam search + rerank).
3. Re‑embed the generated text with MPNet and compute cosine with the original embedding.
For a v1‑style, large‑scale evaluation (crypto/equities split, cosine statistics, degeneracy rate, domain drift), refer to the v1 model card:
[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser).
---
## Input contract and usage
**Input** (v2, S1 baseline):
- A **single pooled MPNet embedding** (crypto tweet) of shape `(768,)`, L2‑normalized.
- Recommended encoder: `sentence-transformers/all-mpnet-base-v2` from `sentence-transformers`.
Do **not** pass a token‑level `(seq_len, 768)` matrix – that is the contract for the v1 seq2seq model
[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser), not this checkpoint.
**Usage pattern (high level, pseudocode)**:
```python
import torch, json
from sentence_transformers import SentenceTransformer
# 1) Pooled MPNet embedding
mpnet = SentenceTransformer("sentence-transformers/all-mpnet-base-v2",
device="cuda" if torch.cuda.is_available() else "cpu")
text = "Ethereum L2 blob fees spiked after EIP-4844; MEV still shapes order flow."
e = mpnet.encode([text], convert_to_numpy=True, normalize_embeddings=True)[0] # (768,)
# 2) Load Aparecium v2 S1 checkpoint
ckpt = torch.load("aparecium_v2_s1.pt", map_location="cpu")
# 3) Recreate models from the Aparecium codebase (not included in this HF repo)
# from aparecium.aparecium.models.emb_adapter import EmbAdapter
# from aparecium.aparecium.models.decoder import RealizerDecoder
# from aparecium.aparecium.models.sketcher import Sketcher
# from aparecium.aparecium.utils.tokens import build_tokenizer
# and run the same decoding logic as in `aparecium/infer/service.py` or
# `aparecium/scripts/invert_once.py`.
# 4) Use beam search / constraints / reranking as in the training repo.
```
To actually use the model, you need the Aparecium codebase (training repo) where the `EmbAdapter`, `Sketcher`, `RealizerDecoder`, constraints, and decoding functions are defined.
---
## Limitations and responsible use
- Outputs are *approximations* of the original text under the MPNet embedding and LM prior:
- They aim to preserve semantic gist and domain entities,
- They are **not exact reconstructions**.
- The model can:
- Produce generic phrasing,
- Over‑use crypto buzzwords/hashtags,
- Occasionally show noisy punctuation/emoji.
- Data are synthetic; domain semantics might differ from real social‑media distributions.
- Do **not** use this model to attempt to reconstruct sensitive or private user content from embeddings.
---
## Reproducibility (high‑level)
To reproduce or extend this checkpoint:
1. **Prepare data**:
- Generate synthetic crypto tweets (or your own domain) into a DB (e.g., SQLite).
- Extract raw text to `train/val/test` JSONL.
- Embed with `all-mpnet-base-v2` (pooled 768‑D) and save as JSONL with `{"text","embedding","plan"}` fields.
2. **Train S1**:
- Use the Aparecium v2 trainer (S1 supervised) with:
- `batch_size ≈ 64`, `max_len ≈ 96`, `lr ≈ 3e-4`, cosine scheduler, warmup steps.
- Train until validation cross‑entropy and cosine proxy metrics plateau.
3. **Optional**:
- Train surrogate similarity scorer `r` for reranking.
- Add SCST RL (S2) if you implement the safe reward/decoding policies.
4. **Evaluate**:
- Build a small evaluation harness (as in the v1 project) to measure cosine, degeneracy, and domain drift.
---
## License
- **Code**: MIT (per Aparecium repositories).
- **Weights**: MIT, same as the code, unless explicitly overridden.
---
## Citation
If you use this model or the Aparecium codebase, please cite:
> Aparecium v2: Pooled MPNet Embedding Reversal for Crypto Tweets
> SentiChain (Aparecium project)
You may also reference the v1 baseline model card:
[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser).
|