|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: mit |
|
|
tags: |
|
|
- text-generation |
|
|
- transformer-decoder |
|
|
- embeddings |
|
|
- mpnet |
|
|
- crypto |
|
|
- pooled-embeddings |
|
|
- social-media |
|
|
library_name: pytorch |
|
|
pipeline_tag: text-generation |
|
|
base_model: sentence-transformers/all-mpnet-base-v2 |
|
|
--- |
|
|
|
|
|
# Aparecium v2 – Pooled MPNet Reverser (S1 Baseline) |
|
|
|
|
|
## Summary |
|
|
|
|
|
- **Task**: Reconstruct natural-language crypto social-media posts from a **single pooled MPNet embedding** (reverse embedding). |
|
|
- **Focus**: Crypto domain (social-media posts / short-form content). |
|
|
- **Checkpoint**: `aparecium_v2_s1.pt` — S1 supervised baseline, trained on synthetic crypto social-media posts. |
|
|
- **Input contract**: a **pooled** `all-mpnet-base-v2` vector of shape `(768,)`, *not* a token-level `(seq_len, 768)` matrix. |
|
|
- **Code**: this repo only hosts weights; loading & decoding are implemented in the Aparecium codebase |
|
|
(the v2 training repo and service are analogous in spirit to the v1 project |
|
|
[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser)). |
|
|
|
|
|
This is a **pooled-embedding variant** of Aparecium, distinct from the original token-level seq2seq reverser described in |
|
|
[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser). |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended use |
|
|
|
|
|
- **Research / engineering**: |
|
|
- Study how much crypto-domain information is recoverable from a single pooled embedding. |
|
|
- Prototype tools around embedding interpretability, diagnostics, and “gist reconstruction” from vectors. |
|
|
- **Not intended** for: |
|
|
- Reconstructing private, user-identifying, or sensitive content. |
|
|
- Any de‑anonymization of embedding corpora. |
|
|
|
|
|
Reconstruction quality depends heavily on: |
|
|
|
|
|
- The upstream encoder (`sentence-transformers/all-mpnet-base-v2`), |
|
|
- Domain match (crypto social-media posts vs. your data), |
|
|
- Decode settings (beam vs. sampling, constraints, reranking). |
|
|
|
|
|
--- |
|
|
|
|
|
## Model architecture |
|
|
|
|
|
On the encoder side, we assume a **pooled MPNet** encoder: |
|
|
|
|
|
- Recommended: `sentence-transformers/all-mpnet-base-v2` (768‑D pooled output). |
|
|
|
|
|
On the decoder side, v2 uses the Aparecium components: |
|
|
|
|
|
- **EmbAdapter**: |
|
|
- Input: pooled vector `e ∈ R^768`. |
|
|
- Output: pseudo‑sequence memory `H ∈ R^{B × S × D}` suitable for a transformer decoder (multi‑scale). |
|
|
- **Sketcher**: |
|
|
- Lightweight network producing a “plan” and simple control flags (e.g., URL presence) from `e`. |
|
|
- In the S1 baseline checkpoint, it is trained but only lightly used at inference. |
|
|
- **RealizerDecoder**: |
|
|
- Transformer decoder (GPT‑style) with: |
|
|
- `d_model = 768` |
|
|
- `n_layer = 12` |
|
|
- `n_head = 8` |
|
|
- `d_ff = 3072` |
|
|
- Dropout ≈ 0.1 |
|
|
- Consumes `H` as cross‑attention memory and generates text tokens. |
|
|
|
|
|
Decoding: |
|
|
|
|
|
- Deterministic beam search or sampling, with optional: |
|
|
- **Constraints** (e.g., require certain tickers/hashtags/amounts based on a plan). |
|
|
- **Surrogate similarity scorer `r(x, e)`** for reranking candidates. |
|
|
- **Final MPNet cosine rerank** across top‑K candidates. |
|
|
|
|
|
The `aparecium_v2_s1.pt` checkpoint contains the adapter, sketcher, decoder, and tokenizer name, matching the training repo layout. |
|
|
|
|
|
--- |
|
|
|
|
|
## Training data and provenance |
|
|
|
|
|
- **Source**: synthetic crypto social-media posts generated via OpenAI models into a DB (e.g., `tweets.db`). |
|
|
- **Domain**: |
|
|
- Crypto markets, DeFi, L2s, MEV, governance, NFTs, etc. |
|
|
- **Preparation (v2 pipeline)**: |
|
|
1. Extract raw text from the DB into JSONL. |
|
|
2. Embed each tweet with `sentence-transformers/all-mpnet-base-v2`: |
|
|
- `embedding ∈ R^768` (pooled), L2‑normalized. |
|
|
- Optionally store a simple “plan” (tickers, hashtags, amounts, addresses). |
|
|
3. Split into train/val/test and shard into JSONL files. |
|
|
|
|
|
No real social‑media content is used; all posts are synthetic, similar in spirit to the v1 project |
|
|
[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser). |
|
|
|
|
|
--- |
|
|
|
|
|
## Training procedure (S1 baseline regimen) |
|
|
|
|
|
This checkpoint corresponds to **S1 supervised training only** (no SCST/RL): |
|
|
|
|
|
- Objective: teacher‑forcing cross‑entropy over the crypto tweet text, given the pooled embedding. |
|
|
- Optimizer: AdamW |
|
|
- Typical hyperparameters (baseline run): |
|
|
- Batch size: 64 |
|
|
- Max length: 96 tokens (tweets) |
|
|
- Learning rate: 3e‑4 (cosine decay), warmup ~1k steps |
|
|
- Weight decay: 0.01 |
|
|
- Grad clip: 1.0 |
|
|
- Dropout: 0.1 |
|
|
- Data: |
|
|
- ~100k synthetic crypto tweets (train/val split). |
|
|
- Embeddings precomputed via `all-mpnet-base-v2` and normalized. |
|
|
- Checkpointing: |
|
|
- Save final weights as `aparecium_v2_s1.pt` once training plateaus on validation cross‑entropy. |
|
|
|
|
|
Future work (not in this checkpoint): |
|
|
|
|
|
- SCST RL (S2) with a reward combining MPNet cosine, surrogate `r`, repetition penalty, and entity coverage. |
|
|
- Stronger constraints and rerank policies as described in the training plan. |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation protocol (baseline qualitative) |
|
|
|
|
|
This repo does **not** include a full eval harness. The S1 baseline was validated qualitatively: |
|
|
|
|
|
- Sample 10–20 crypto sentences (held‑out). |
|
|
- For each: |
|
|
1. Embed text with `all-mpnet-base-v2` (pooled, normalized). |
|
|
2. Invert with Aparecium v2 S1 (beam search + rerank). |
|
|
3. Re‑embed the generated text with MPNet and compute cosine with the original embedding. |
|
|
|
|
|
For a v1‑style, large‑scale evaluation (crypto/equities split, cosine statistics, degeneracy rate, domain drift), refer to the v1 model card: |
|
|
[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser). |
|
|
|
|
|
--- |
|
|
|
|
|
## Input contract and usage |
|
|
|
|
|
**Input** (v2, S1 baseline): |
|
|
|
|
|
- A **single pooled MPNet embedding** (crypto tweet) of shape `(768,)`, L2‑normalized. |
|
|
- Recommended encoder: `sentence-transformers/all-mpnet-base-v2` from `sentence-transformers`. |
|
|
|
|
|
Do **not** pass a token‑level `(seq_len, 768)` matrix – that is the contract for the v1 seq2seq model |
|
|
[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser), not this checkpoint. |
|
|
|
|
|
**Usage pattern (high level, pseudocode)**: |
|
|
|
|
|
```python |
|
|
import torch, json |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
# 1) Pooled MPNet embedding |
|
|
mpnet = SentenceTransformer("sentence-transformers/all-mpnet-base-v2", |
|
|
device="cuda" if torch.cuda.is_available() else "cpu") |
|
|
text = "Ethereum L2 blob fees spiked after EIP-4844; MEV still shapes order flow." |
|
|
e = mpnet.encode([text], convert_to_numpy=True, normalize_embeddings=True)[0] # (768,) |
|
|
|
|
|
# 2) Load Aparecium v2 S1 checkpoint |
|
|
ckpt = torch.load("aparecium_v2_s1.pt", map_location="cpu") |
|
|
|
|
|
# 3) Recreate models from the Aparecium codebase (not included in this HF repo) |
|
|
# from aparecium.aparecium.models.emb_adapter import EmbAdapter |
|
|
# from aparecium.aparecium.models.decoder import RealizerDecoder |
|
|
# from aparecium.aparecium.models.sketcher import Sketcher |
|
|
# from aparecium.aparecium.utils.tokens import build_tokenizer |
|
|
# and run the same decoding logic as in `aparecium/infer/service.py` or |
|
|
# `aparecium/scripts/invert_once.py`. |
|
|
|
|
|
# 4) Use beam search / constraints / reranking as in the training repo. |
|
|
``` |
|
|
|
|
|
To actually use the model, you need the Aparecium codebase (training repo) where the `EmbAdapter`, `Sketcher`, `RealizerDecoder`, constraints, and decoding functions are defined. |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations and responsible use |
|
|
|
|
|
- Outputs are *approximations* of the original text under the MPNet embedding and LM prior: |
|
|
- They aim to preserve semantic gist and domain entities, |
|
|
- They are **not exact reconstructions**. |
|
|
- The model can: |
|
|
- Produce generic phrasing, |
|
|
- Over‑use crypto buzzwords/hashtags, |
|
|
- Occasionally show noisy punctuation/emoji. |
|
|
- Data are synthetic; domain semantics might differ from real social‑media distributions. |
|
|
- Do **not** use this model to attempt to reconstruct sensitive or private user content from embeddings. |
|
|
|
|
|
--- |
|
|
|
|
|
## Reproducibility (high‑level) |
|
|
|
|
|
To reproduce or extend this checkpoint: |
|
|
|
|
|
1. **Prepare data**: |
|
|
- Generate synthetic crypto tweets (or your own domain) into a DB (e.g., SQLite). |
|
|
- Extract raw text to `train/val/test` JSONL. |
|
|
- Embed with `all-mpnet-base-v2` (pooled 768‑D) and save as JSONL with `{"text","embedding","plan"}` fields. |
|
|
2. **Train S1**: |
|
|
- Use the Aparecium v2 trainer (S1 supervised) with: |
|
|
- `batch_size ≈ 64`, `max_len ≈ 96`, `lr ≈ 3e-4`, cosine scheduler, warmup steps. |
|
|
- Train until validation cross‑entropy and cosine proxy metrics plateau. |
|
|
3. **Optional**: |
|
|
- Train surrogate similarity scorer `r` for reranking. |
|
|
- Add SCST RL (S2) if you implement the safe reward/decoding policies. |
|
|
4. **Evaluate**: |
|
|
- Build a small evaluation harness (as in the v1 project) to measure cosine, degeneracy, and domain drift. |
|
|
|
|
|
--- |
|
|
|
|
|
## License |
|
|
|
|
|
- **Code**: MIT (per Aparecium repositories). |
|
|
- **Weights**: MIT, same as the code, unless explicitly overridden. |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model or the Aparecium codebase, please cite: |
|
|
|
|
|
> Aparecium v2: Pooled MPNet Embedding Reversal for Crypto Tweets |
|
|
> SentiChain (Aparecium project) |
|
|
|
|
|
You may also reference the v1 baseline model card: |
|
|
[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser). |
|
|
|
|
|
|