SentiChain
/

aparecium-v2-pooled-reverser

@@ -1,4 +1,234 @@
-# Aparecium v2 – Pooled MPNet Reverser (S1 baseline)
-S1 supervised baseline for pooled ll-mpnet-base-v2 embeddings.
-Requires the Aparecium codebase to load and run (see your training repo).

+---
+language:
+- en
+license: mit
+tags:
+- text-generation
+- transformer-decoder
+- embeddings
+- mpnet
+- crypto
+- pooled-embeddings
+- social-media
+library_name: pytorch
+pipeline_tag: text-generation
+base_model: sentence-transformers/all-mpnet-base-v2
+---
+# Aparecium v2 – Pooled MPNet Reverser (S1 Baseline)
+## Summary
+- **Task**: Reconstruct natural-language crypto social-media posts from a **single pooled MPNet embedding** (reverse embedding).
+- **Focus**: Crypto domain (social-media posts / short-form content).
+- **Checkpoint**: `aparecium_v2_s1.pt` — S1 supervised baseline, trained on synthetic crypto social-media posts.
+- **Input contract**: a **pooled** `all-mpnet-base-v2` vector of shape `(768,)`, *not* a token-level `(seq_len, 768)` matrix.
+- **Code**: this repo only hosts weights; loading & decoding are implemented in the Aparecium codebase
+  (the v2 training repo and service are analogous in spirit to the v1 project
+  [`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser)).
+This is a **pooled-embedding variant** of Aparecium, distinct from the original token-level seq2seq reverser described in
+[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser).
+---
+## Intended use
+- **Research / engineering**:
+  - Study how much crypto-domain information is recoverable from a single pooled embedding.
+  - Prototype tools around embedding interpretability, diagnostics, and “gist reconstruction” from vectors.
+- **Not intended** for:
+  - Reconstructing private, user-identifying, or sensitive content.
+  - Any de‑anonymization of embedding corpora.
+Reconstruction quality depends heavily on:
+- The upstream encoder (`sentence-transformers/all-mpnet-base-v2`),
+- Domain match (crypto social-media posts vs. your data),
+- Decode settings (beam vs. sampling, constraints, reranking).
+---
+## Model architecture
+On the encoder side, we assume a **pooled MPNet** encoder:
+- Recommended: `sentence-transformers/all-mpnet-base-v2` (768‑D pooled output).
+On the decoder side, v2 uses the Aparecium components:
+- **EmbAdapter**:
+  - Input: pooled vector `e ∈ R^768`.
+  - Output: pseudo‑sequence memory `H ∈ R^{B × S × D}` suitable for a transformer decoder (multi‑scale).
+- **Sketcher**:
+  - Lightweight network producing a “plan” and simple control flags (e.g., URL presence) from `e`.
+  - In the S1 baseline checkpoint, it is trained but only lightly used at inference.
+- **RealizerDecoder**:
+  - Transformer decoder (GPT‑style) with:
+    - `d_model = 768`
+    - `n_layer = 12`
+    - `n_head = 8`
+    - `d_ff = 3072`
+    - Dropout ≈ 0.1
+  - Consumes `H` as cross‑attention memory and generates text tokens.
+Decoding:
+- Deterministic beam search or sampling, with optional:
+  - **Constraints** (e.g., require certain tickers/hashtags/amounts based on a plan).
+  - **Surrogate similarity scorer `r(x, e)`** for reranking candidates.
+  - **Final MPNet cosine rerank** across top‑K candidates.
+The `aparecium_v2_s1.pt` checkpoint contains the adapter, sketcher, decoder, and tokenizer name, matching the training repo layout.
+---
+## Training data and provenance
+- **Source**: synthetic crypto social-media posts generated via OpenAI models into a DB (e.g., `tweets.db`).
+- **Domain**:
+  - Crypto markets, DeFi, L2s, MEV, governance, NFTs, etc.
+- **Preparation (v2 pipeline)**:
+  1. Extract raw text from the DB into JSONL.
+  2. Embed each tweet with `sentence-transformers/all-mpnet-base-v2`:
+     - `embedding ∈ R^768` (pooled), L2‑normalized.
+     - Optionally store a simple “plan” (tickers, hashtags, amounts, addresses).
+  3. Split into train/val/test and shard into JSONL files.
+No real social‑media content is used; all posts are synthetic, similar in spirit to the v1 project
+[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser).
+---
+## Training procedure (S1 baseline regimen)
+This checkpoint corresponds to **S1 supervised training only** (no SCST/RL):
+- Objective: teacher‑forcing cross‑entropy over the crypto tweet text, given the pooled embedding.
+- Optimizer: AdamW
+- Typical hyperparameters (baseline run):
+  - Batch size: 64
+  - Max length: 96 tokens (tweets)
+  - Learning rate: 3e‑4 (cosine decay), warmup ~1k steps
+  - Weight decay: 0.01
+  - Grad clip: 1.0
+  - Dropout: 0.1
+- Data:
+  - ~100k synthetic crypto tweets (train/val split).
+  - Embeddings precomputed via `all-mpnet-base-v2` and normalized.
+- Checkpointing:
+  - Save final weights as `aparecium_v2_s1.pt` once training plateaus on validation cross‑entropy.
+Future work (not in this checkpoint):
+- SCST RL (S2) with a reward combining MPNet cosine, surrogate `r`, repetition penalty, and entity coverage.
+- Stronger constraints and rerank policies as described in the training plan.
+---
+## Evaluation protocol (baseline qualitative)
+This repo does **not** include a full eval harness. The S1 baseline was validated qualitatively:
+- Sample 10–20 crypto sentences (held‑out).
+- For each:
+  1. Embed text with `all-mpnet-base-v2` (pooled, normalized).
+  2. Invert with Aparecium v2 S1 (beam search + rerank).
+  3. Re‑embed the generated text with MPNet and compute cosine with the original embedding.
+For a v1‑style, large‑scale evaluation (crypto/equities split, cosine statistics, degeneracy rate, domain drift), refer to the v1 model card:
+[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser).
+---
+## Input contract and usage
+**Input** (v2, S1 baseline):
+- A **single pooled MPNet embedding** (crypto tweet) of shape `(768,)`, L2‑normalized.
+- Recommended encoder: `sentence-transformers/all-mpnet-base-v2` from `sentence-transformers`.
+Do **not** pass a token‑level `(seq_len, 768)` matrix – that is the contract for the v1 seq2seq model
+[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser), not this checkpoint.
+**Usage pattern (high level, pseudocode)**:
+```python
+import torch, json
+from sentence_transformers import SentenceTransformer
+# 1) Pooled MPNet embedding
+mpnet = SentenceTransformer("sentence-transformers/all-mpnet-base-v2",
+                            device="cuda" if torch.cuda.is_available() else "cpu")
+text = "Ethereum L2 blob fees spiked after EIP-4844; MEV still shapes order flow."
+e = mpnet.encode([text], convert_to_numpy=True, normalize_embeddings=True)[0]  # (768,)
+# 2) Load Aparecium v2 S1 checkpoint
+ckpt = torch.load("aparecium_v2_s1.pt", map_location="cpu")
+# 3) Recreate models from the Aparecium codebase (not included in this HF repo)
+# from aparecium.aparecium.models.emb_adapter import EmbAdapter
+# from aparecium.aparecium.models.decoder import RealizerDecoder
+# from aparecium.aparecium.models.sketcher import Sketcher
+# from aparecium.aparecium.utils.tokens import build_tokenizer
+# and run the same decoding logic as in `aparecium/infer/service.py` or
+# `aparecium/scripts/invert_once.py`.
+# 4) Use beam search / constraints / reranking as in the training repo.
+```
+To actually use the model, you need the Aparecium codebase (training repo) where the `EmbAdapter`, `Sketcher`, `RealizerDecoder`, constraints, and decoding functions are defined.
+---
+## Limitations and responsible use
+- Outputs are *approximations* of the original text under the MPNet embedding and LM prior:
+  - They aim to preserve semantic gist and domain entities,
+  - They are **not exact reconstructions**.
+- The model can:
+  - Produce generic phrasing,
+  - Over‑use crypto buzzwords/hashtags,
+  - Occasionally show noisy punctuation/emoji.
+- Data are synthetic; domain semantics might differ from real social‑media distributions.
+- Do **not** use this model to attempt to reconstruct sensitive or private user content from embeddings.
+---
+## Reproducibility (high‑level)
+To reproduce or extend this checkpoint:
+1. **Prepare data**:
+   - Generate synthetic crypto tweets (or your own domain) into a DB (e.g., SQLite).
+   - Extract raw text to `train/val/test` JSONL.
+   - Embed with `all-mpnet-base-v2` (pooled 768‑D) and save as JSONL with `{"text","embedding","plan"}` fields.
+2. **Train S1**:
+   - Use the Aparecium v2 trainer (S1 supervised) with:
+     - `batch_size ≈ 64`, `max_len ≈ 96`, `lr ≈ 3e-4`, cosine scheduler, warmup steps.
+   - Train until validation cross‑entropy and cosine proxy metrics plateau.
+3. **Optional**:
+   - Train surrogate similarity scorer `r` for reranking.
+   - Add SCST RL (S2) if you implement the safe reward/decoding policies.
+4. **Evaluate**:
+   - Build a small evaluation harness (as in the v1 project) to measure cosine, degeneracy, and domain drift.
+---
+## License
+- **Code**: MIT (per Aparecium repositories).
+- **Weights**: MIT, same as the code, unless explicitly overridden.
+---
+## Citation
+If you use this model or the Aparecium codebase, please cite:
+> Aparecium v2: Pooled MPNet Embedding Reversal for Crypto Tweets
+> SentiChain (Aparecium project)
+You may also reference the v1 baseline model card:
+[`SentiChain/aparecium-seq2seq-reverser`](https://huggingface.co/SentiChain/aparecium-seq2seq-reverser).