|
|
--- |
|
|
language: en |
|
|
license: mit |
|
|
library_name: pytorch |
|
|
tags: |
|
|
- transformer-decoder |
|
|
- seq2seq |
|
|
- embeddings |
|
|
- mpnet |
|
|
- text-reconstruction |
|
|
- crypto |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
### Aparecium Baseline Model Card |
|
|
|
|
|
#### Summary |
|
|
- **Task**: Reconstruct natural language posts from token‑level MPNet embeddings (reverse embedding). |
|
|
- **Focus**: Crypto domain, with equities as auxiliary domain. |
|
|
- **Checkpoint**: Baseline model trained with a phased schedule and early stopping. |
|
|
- **Data**: 1.0M synthetic posts (500k crypto + 500k equities), programmatically generated via OpenAI API. No real social‑media content used. |
|
|
- **Input contract**: token‑level MPNet matrix of shape `(seq_len, 768)`, not a pooled vector. |
|
|
|
|
|
--- |
|
|
|
|
|
### Intended use |
|
|
- Research and engineering use for studying reversibility of embedding spaces and for building diagnostics/tools around embedding interpretability. |
|
|
- Not intended to reconstruct private or sensitive content; reconstruction accuracy depends on embedding fidelity and domain match. |
|
|
|
|
|
--- |
|
|
|
|
|
### Model architecture |
|
|
- Encoder side: External; we assume MPNet family encoder (default: `sentence-transformers/all-mpnet-base-v2`) to produce token‑level embeddings. |
|
|
- Decoder: Transformer decoder consuming the MPNet memory: |
|
|
- d_model: 768 |
|
|
- Decoder layers: 2 |
|
|
- Attention heads: 8 |
|
|
- FFN dim: 2048 |
|
|
- Token and positional embeddings; GELU activations |
|
|
- Decoding: |
|
|
- Supports greedy, sampling, and beam search. |
|
|
- Optional embedding‑aware rescoring (cosine similarity between the candidate’s re‑embedded sentence and the pooled MPNet target). |
|
|
- Optional lightweight constraints for hashtag/cashtag/URL continuity. |
|
|
|
|
|
Recommended inference defaults: |
|
|
- `num_beams=8` |
|
|
- `length_penalty_alpha=0.6` |
|
|
- `lambda_sim=0.6` |
|
|
- `rescore_every_k=4`, `rescore_top_m=8` |
|
|
- `beta=10.0` |
|
|
- `enable_constraints=True` |
|
|
- `deterministic=True` |
|
|
|
|
|
--- |
|
|
|
|
|
### Training data and provenance |
|
|
- 1,000,000 synthetic posts total: |
|
|
- 500,000 crypto‑domain posts |
|
|
- 500,000 equities‑domain posts |
|
|
- All posts were programmatically generated via the OpenAI API (synthetic). No real social‑media content was used. |
|
|
- Embeddings: |
|
|
- Token‑level MPNet (default: `sentence-transformers/all-mpnet-base-v2`). |
|
|
- Cached to SQLite to avoid recomputation and allow resumable training. |
|
|
|
|
|
--- |
|
|
|
|
|
### Training procedure (baseline regimen) |
|
|
- Domain emphasis: 80% crypto / 20% equities per training phase. |
|
|
- Phased training (10% of available chunks per phase), evaluate after each phase: |
|
|
- In‑sample: small subset from the phase’s chunks |
|
|
- Out‑of‑sample: small hold‑out from both domains (not seen in the phase) |
|
|
- Early‑stop condition: stop if out‑of‑sample cosine degrades relative to prior phase. |
|
|
- Optimizer: AdamW |
|
|
- Learning rate (baseline finetune): 5e‑5 |
|
|
- Batch size: 16 |
|
|
- Input `max_source_length`: 256 |
|
|
- Target `max_target_length`: 128 |
|
|
- Checkpointing: every 2,000 steps and at phase end. |
|
|
|
|
|
Notes |
|
|
- Training used early stopping based on out‑of‑sample cosine. |
|
|
|
|
|
--- |
|
|
|
|
|
### Evaluation protocol (for the metrics below) |
|
|
- Sample size: 1,000 examples per domain drawn from cached embedding databases. |
|
|
- Decode config: `num_beams=8`, `length_penalty_alpha=0.6`, `lambda_sim=0.6`, `rescore_every_k=4`, `rescore_top_m=8`, `beta=10.0`, `enable_constraints=True`, `deterministic=True`. |
|
|
- Metrics: |
|
|
- `cosine_mean/median/p10/p90`: cosine between pooled MPNet embedding of generated text and the pooled MPNet target vector (higher is better). |
|
|
- `score_norm_mean`: length‑penalized language model score (more positive is better; negative values are common for log‑scores). |
|
|
- `degenerate_pct`: % of clearly degenerate generations (very short/blank/only hashtags). |
|
|
- `domain_drift_pct`: % of equity‑like terms in crypto outputs (or crypto‑like terms in equities outputs). Heuristic text filter; intended as a rough indicator only. |
|
|
|
|
|
Results (current `models/baseline` checkpoint) |
|
|
- Crypto (n=1000) |
|
|
- cosine_mean: 0.681 |
|
|
- cosine_median: 0.843 |
|
|
- cosine_p10: 0.000 |
|
|
- cosine_p90: 0.984 |
|
|
- score_norm_mean: −1.977 |
|
|
- degenerate_pct: 5.2% |
|
|
- domain_drift_pct: 0.0% |
|
|
- Equities (n=1000) |
|
|
- cosine_mean: 0.778 |
|
|
- cosine_median: 0.901 |
|
|
- cosine_p10: 0.326 |
|
|
- cosine_p90: 0.986 |
|
|
- score_norm_mean: −1.344 |
|
|
- degenerate_pct: 2.2% |
|
|
- domain_drift_pct: 4.4% |
|
|
|
|
|
Interpretation |
|
|
- The model reconstructs many posts with strong embedding alignment (p90 ≈ 0.98 cosine in both domains). |
|
|
- Equities shows higher average/median cosine and lower degeneracy than crypto, consistent with the auxiliary‑domain role and data characteristics. |
|
|
- A small fraction of degenerate outputs exists in both domains (crypto ~5.2%, equities ~2.2%). |
|
|
- Domain drift is minimal from crypto→equities (0.0%) and present at a modest rate from equities→crypto (~4.4%) under the chosen heuristic. |
|
|
|
|
|
--- |
|
|
|
|
|
### Input contract and usage |
|
|
- **Input**: MPNet token‑level matrix `(seq_len × 768)` for a single post. Do not pass a pooled vector. |
|
|
- **Tokenizer/model alignment** matters: use the same MPNet tokenizer/model version that produced the embeddings. |
|
|
|
|
|
--- |
|
|
|
|
|
### Limitations and responsible use |
|
|
- Reconstruction is not guaranteed to match the original post text; it optimizes alignment within the MPNet embedding space and LM scoring. |
|
|
- The model can produce generic or incomplete outputs (see `degenerate_pct`). |
|
|
- Domain drift can occur depending on decode settings (see `domain_drift_pct`). |
|
|
- Data are synthetic programmatic generations, not real social‑media posts. Domain semantics may differ from real‑world distributions. |
|
|
- Do not use for reconstructing sensitive/private content or for attempting to de‑anonymize embedding corpora. This model is a research/diagnostic tool. |
|
|
|
|
|
--- |
|
|
|
|
|
### Reproducibility (high‑level) |
|
|
- Prepare embedding caches (not included): build local token‑level MPNet embedding caches for your corpora (e.g., via a data prep script) and store them in your own paths. |
|
|
- Baseline training: iterative 10% phases, 80:20 (crypto:equities), LR=5e‑5, BS=16, early‑stop on out‑of‑sample cosine degradation. |
|
|
- Evaluation: 1,000 samples/domain with the decode settings shown above. |
|
|
- The released checkpoint corresponds to the latest non‑degrading phase under early‑stopping. |
|
|
|
|
|
--- |
|
|
|
|
|
### License |
|
|
- Code: MIT (per repository). |
|
|
- Model weights: same as code unless declared otherwise upon release. |
|
|
|
|
|
--- |
|
|
|
|
|
### Citation |
|
|
If you use this model or codebase, please cite the Aparecium project and this baseline report. |
|
|
|
|
|
|
|
|
|