aparecium-v2-pooled-reverser / README.md

ruitao-edward-chen

Clarify wording: social-media posts instead of tweets

3a12737 3 months ago

9.26 kB

	---
	language:
	- en
	license: mit
	tags:
	- text-generation
	- transformer-decoder
	- embeddings
	- mpnet
	- crypto
	- pooled-embeddings
	- social-media
	library_name: pytorch
	pipeline_tag: text-generation
	base_model: sentence-transformers/all-mpnet-base-v2
	---

	# Aparecium v2 – Pooled MPNet Reverser (S1 Baseline)

	## Summary

	- Task: Reconstruct natural-language crypto social-media posts from a single pooled MPNet embedding (reverse embedding).
	- Focus: Crypto domain (social-media posts / short-form content).
	- Checkpoint: `aparecium_v2_s1.pt` — S1 supervised baseline, trained on synthetic crypto social-media posts.
	- Input contract: a pooled `all-mpnet-base-v2` vector of shape `(768,)`, not a token-level `(seq_len, 768)` matrix.
	- Code: this repo only hosts weights; loading & decoding are implemented in the Aparecium codebase
	(the v2 training repo and service are analogous in spirit to the v1 project
	[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser)).

	This is a pooled-embedding variant of Aparecium, distinct from the original token-level seq2seq reverser described in
	[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser).

	---

	## Intended use

	- Research / engineering:
	- Study how much crypto-domain information is recoverable from a single pooled embedding.
	- Prototype tools around embedding interpretability, diagnostics, and “gist reconstruction” from vectors.
	- Not intended for:
	- Reconstructing private, user-identifying, or sensitive content.
	- Any de‑anonymization of embedding corpora.

	Reconstruction quality depends heavily on:

	- The upstream encoder (`sentence-transformers/all-mpnet-base-v2`),
	- Domain match (crypto social-media posts vs. your data),
	- Decode settings (beam vs. sampling, constraints, reranking).

	---

	## Model architecture

	On the encoder side, we assume a pooled MPNet encoder:

	- Recommended: `sentence-transformers/all-mpnet-base-v2` (768‑D pooled output).

	On the decoder side, v2 uses the Aparecium components:

	- EmbAdapter:
	- Input: pooled vector `e ∈ R^768`.
	- Output: pseudo‑sequence memory `H ∈ R^{B × S × D}` suitable for a transformer decoder (multi‑scale).
	- Sketcher:
	- Lightweight network producing a “plan” and simple control flags (e.g., URL presence) from `e`.
	- In the S1 baseline checkpoint, it is trained but only lightly used at inference.
	- RealizerDecoder:
	- Transformer decoder (GPT‑style) with:
	- `d_model = 768`
	- `n_layer = 12`
	- `n_head = 8`
	- `d_ff = 3072`
	- Dropout ≈ 0.1
	- Consumes `H` as cross‑attention memory and generates text tokens.

	Decoding:

	- Deterministic beam search or sampling, with optional:
	- Constraints (e.g., require certain tickers/hashtags/amounts based on a plan).
	- Surrogate similarity scorer `r(x, e)` for reranking candidates.
	- Final MPNet cosine rerank across top‑K candidates.

	The `aparecium_v2_s1.pt` checkpoint contains the adapter, sketcher, decoder, and tokenizer name, matching the training repo layout.

	---

	## Training data and provenance

	- Source: synthetic crypto social-media posts generated via OpenAI models into a DB (e.g., `tweets.db`).
	- Domain:
	- Crypto markets, DeFi, L2s, MEV, governance, NFTs, etc.
	- Preparation (v2 pipeline):
	1. Extract raw text from the DB into JSONL.
	2. Embed each tweet with `sentence-transformers/all-mpnet-base-v2`:
	- `embedding ∈ R^768` (pooled), L2‑normalized.
	- Optionally store a simple “plan” (tickers, hashtags, amounts, addresses).
	3. Split into train/val/test and shard into JSONL files.

	No real social‑media content is used; all posts are synthetic, similar in spirit to the v1 project
	[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser).

	---

	## Training procedure (S1 baseline regimen)

	This checkpoint corresponds to S1 supervised training only (no SCST/RL):

	- Objective: teacher‑forcing cross‑entropy over the crypto tweet text, given the pooled embedding.
	- Optimizer: AdamW
	- Typical hyperparameters (baseline run):
	- Batch size: 64
	- Max length: 96 tokens (tweets)
	- Learning rate: 3e‑4 (cosine decay), warmup ~1k steps
	- Weight decay: 0.01
	- Grad clip: 1.0
	- Dropout: 0.1
	- Data:
	- ~100k synthetic crypto tweets (train/val split).
	- Embeddings precomputed via `all-mpnet-base-v2` and normalized.
	- Checkpointing:
	- Save final weights as `aparecium_v2_s1.pt` once training plateaus on validation cross‑entropy.

	Future work (not in this checkpoint):

	- SCST RL (S2) with a reward combining MPNet cosine, surrogate `r`, repetition penalty, and entity coverage.
	- Stronger constraints and rerank policies as described in the training plan.

	---

	## Evaluation protocol (baseline qualitative)

	This repo does not include a full eval harness. The S1 baseline was validated qualitatively:

	- Sample 10–20 crypto sentences (held‑out).
	- For each:
	1. Embed text with `all-mpnet-base-v2` (pooled, normalized).
	2. Invert with Aparecium v2 S1 (beam search + rerank).
	3. Re‑embed the generated text with MPNet and compute cosine with the original embedding.

	For a v1‑style, large‑scale evaluation (crypto/equities split, cosine statistics, degeneracy rate, domain drift), refer to the v1 model card:
	[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser).

	---

	## Input contract and usage

	Input (v2, S1 baseline):

	- A single pooled MPNet embedding (crypto tweet) of shape `(768,)`, L2‑normalized.
	- Recommended encoder: `sentence-transformers/all-mpnet-base-v2` from `sentence-transformers`.

	Do not pass a token‑level `(seq_len, 768)` matrix – that is the contract for the v1 seq2seq model
	[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser), not this checkpoint.

	Usage pattern (high level, pseudocode):

	```python
	import torch, json
	from sentence_transformers import SentenceTransformer

	# 1) Pooled MPNet embedding
	mpnet = SentenceTransformer("sentence-transformers/all-mpnet-base-v2",
	device="cuda" if torch.cuda.is_available() else "cpu")
	text = "Ethereum L2 blob fees spiked after EIP-4844; MEV still shapes order flow."
	e = mpnet.encode([text], convert_to_numpy=True, normalize_embeddings=True)[0] # (768,)

	# 2) Load Aparecium v2 S1 checkpoint
	ckpt = torch.load("aparecium_v2_s1.pt", map_location="cpu")

	# 3) Recreate models from the Aparecium codebase (not included in this HF repo)
	# from aparecium.aparecium.models.emb_adapter import EmbAdapter
	# from aparecium.aparecium.models.decoder import RealizerDecoder
	# from aparecium.aparecium.models.sketcher import Sketcher
	# from aparecium.aparecium.utils.tokens import build_tokenizer
	# and run the same decoding logic as in `aparecium/infer/service.py` or
	# `aparecium/scripts/invert_once.py`.

	# 4) Use beam search / constraints / reranking as in the training repo.
	```

	To actually use the model, you need the Aparecium codebase (training repo) where the `EmbAdapter`, `Sketcher`, `RealizerDecoder`, constraints, and decoding functions are defined.

	---

	## Limitations and responsible use

	- Outputs are approximations of the original text under the MPNet embedding and LM prior:
	- They aim to preserve semantic gist and domain entities,
	- They are not exact reconstructions.
	- The model can:
	- Produce generic phrasing,
	- Over‑use crypto buzzwords/hashtags,
	- Occasionally show noisy punctuation/emoji.
	- Data are synthetic; domain semantics might differ from real social‑media distributions.
	- Do not use this model to attempt to reconstruct sensitive or private user content from embeddings.

	---

	## Reproducibility (high‑level)

	To reproduce or extend this checkpoint:

	1. Prepare data:
	- Generate synthetic crypto tweets (or your own domain) into a DB (e.g., SQLite).
	- Extract raw text to `train/val/test` JSONL.
	- Embed with `all-mpnet-base-v2` (pooled 768‑D) and save as JSONL with `{"text","embedding","plan"}` fields.
	2. Train S1:
	- Use the Aparecium v2 trainer (S1 supervised) with:
	- `batch_size ≈ 64`, `max_len ≈ 96`, `lr ≈ 3e-4`, cosine scheduler, warmup steps.
	- Train until validation cross‑entropy and cosine proxy metrics plateau.
	3. Optional:
	- Train surrogate similarity scorer `r` for reranking.
	- Add SCST RL (S2) if you implement the safe reward/decoding policies.
	4. Evaluate:
	- Build a small evaluation harness (as in the v1 project) to measure cosine, degeneracy, and domain drift.

	---

	## License

	- Code: MIT (per Aparecium repositories).
	- Weights: MIT, same as the code, unless explicitly overridden.

	---

	## Citation

	If you use this model or the Aparecium codebase, please cite:

	> Aparecium v2: Pooled MPNet Embedding Reversal for Crypto Tweets
	> SentiChain (Aparecium project)

	You may also reference the v1 baseline model card:
	[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser).