File size: 6,470 Bytes
97cad85 592ad51 97cad85 279a90b f69735f 97cad85 f69735f 279a90b f69735f 279a90b f69735f 279a90b f69735f 279a90b f69735f 279a90b f69735f 279a90b f69735f 279a90b f69735f 279a90b f69735f 97cad85 279a90b f69735f 279a90b f69735f 279a90b f69735f 279a90b f69735f 279a90b f69735f 279a90b f69735f 279a90b f69735f 279a90b f69735f 97cad85 f69735f 97cad85 279a90b f69735f 279a90b f69735f 279a90b f69735f 279a90b f69735f 279a90b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
---
language: en
license: mit
library_name: pytorch
tags:
- transformer-decoder
- seq2seq
- embeddings
- mpnet
- text-reconstruction
- crypto
pipeline_tag: text-generation
---
### Aparecium Baseline Model Card
#### Summary
- **Task**: Reconstruct natural language posts from token‑level MPNet embeddings (reverse embedding).
- **Focus**: Crypto domain, with equities as auxiliary domain.
- **Checkpoint**: Baseline model trained with a phased schedule and early stopping.
- **Data**: 1.0M synthetic posts (500k crypto + 500k equities), programmatically generated via OpenAI API. No real social‑media content used.
- **Input contract**: token‑level MPNet matrix of shape `(seq_len, 768)`, not a pooled vector.
---
### Intended use
- Research and engineering use for studying reversibility of embedding spaces and for building diagnostics/tools around embedding interpretability.
- Not intended to reconstruct private or sensitive content; reconstruction accuracy depends on embedding fidelity and domain match.
---
### Model architecture
- Encoder side: External; we assume MPNet family encoder (default: `sentence-transformers/all-mpnet-base-v2`) to produce token‑level embeddings.
- Decoder: Transformer decoder consuming the MPNet memory:
- d_model: 768
- Decoder layers: 2
- Attention heads: 8
- FFN dim: 2048
- Token and positional embeddings; GELU activations
- Decoding:
- Supports greedy, sampling, and beam search.
- Optional embedding‑aware rescoring (cosine similarity between the candidate’s re‑embedded sentence and the pooled MPNet target).
- Optional lightweight constraints for hashtag/cashtag/URL continuity.
Recommended inference defaults:
- `num_beams=8`
- `length_penalty_alpha=0.6`
- `lambda_sim=0.6`
- `rescore_every_k=4`, `rescore_top_m=8`
- `beta=10.0`
- `enable_constraints=True`
- `deterministic=True`
---
### Training data and provenance
- 1,000,000 synthetic posts total:
- 500,000 crypto‑domain posts
- 500,000 equities‑domain posts
- All posts were programmatically generated via the OpenAI API (synthetic). No real social‑media content was used.
- Embeddings:
- Token‑level MPNet (default: `sentence-transformers/all-mpnet-base-v2`).
- Cached to SQLite to avoid recomputation and allow resumable training.
---
### Training procedure (baseline regimen)
- Domain emphasis: 80% crypto / 20% equities per training phase.
- Phased training (10% of available chunks per phase), evaluate after each phase:
- In‑sample: small subset from the phase’s chunks
- Out‑of‑sample: small hold‑out from both domains (not seen in the phase)
- Early‑stop condition: stop if out‑of‑sample cosine degrades relative to prior phase.
- Optimizer: AdamW
- Learning rate (baseline finetune): 5e‑5
- Batch size: 16
- Input `max_source_length`: 256
- Target `max_target_length`: 128
- Checkpointing: every 2,000 steps and at phase end.
Notes
- Training used early stopping based on out‑of‑sample cosine.
---
### Evaluation protocol (for the metrics below)
- Sample size: 1,000 examples per domain drawn from cached embedding databases.
- Decode config: `num_beams=8`, `length_penalty_alpha=0.6`, `lambda_sim=0.6`, `rescore_every_k=4`, `rescore_top_m=8`, `beta=10.0`, `enable_constraints=True`, `deterministic=True`.
- Metrics:
- `cosine_mean/median/p10/p90`: cosine between pooled MPNet embedding of generated text and the pooled MPNet target vector (higher is better).
- `score_norm_mean`: length‑penalized language model score (more positive is better; negative values are common for log‑scores).
- `degenerate_pct`: % of clearly degenerate generations (very short/blank/only hashtags).
- `domain_drift_pct`: % of equity‑like terms in crypto outputs (or crypto‑like terms in equities outputs). Heuristic text filter; intended as a rough indicator only.
Results (current `models/baseline` checkpoint)
- Crypto (n=1000)
- cosine_mean: 0.681
- cosine_median: 0.843
- cosine_p10: 0.000
- cosine_p90: 0.984
- score_norm_mean: −1.977
- degenerate_pct: 5.2%
- domain_drift_pct: 0.0%
- Equities (n=1000)
- cosine_mean: 0.778
- cosine_median: 0.901
- cosine_p10: 0.326
- cosine_p90: 0.986
- score_norm_mean: −1.344
- degenerate_pct: 2.2%
- domain_drift_pct: 4.4%
Interpretation
- The model reconstructs many posts with strong embedding alignment (p90 ≈ 0.98 cosine in both domains).
- Equities shows higher average/median cosine and lower degeneracy than crypto, consistent with the auxiliary‑domain role and data characteristics.
- A small fraction of degenerate outputs exists in both domains (crypto ~5.2%, equities ~2.2%).
- Domain drift is minimal from crypto→equities (0.0%) and present at a modest rate from equities→crypto (~4.4%) under the chosen heuristic.
---
### Input contract and usage
- **Input**: MPNet token‑level matrix `(seq_len × 768)` for a single post. Do not pass a pooled vector.
- **Tokenizer/model alignment** matters: use the same MPNet tokenizer/model version that produced the embeddings.
---
### Limitations and responsible use
- Reconstruction is not guaranteed to match the original post text; it optimizes alignment within the MPNet embedding space and LM scoring.
- The model can produce generic or incomplete outputs (see `degenerate_pct`).
- Domain drift can occur depending on decode settings (see `domain_drift_pct`).
- Data are synthetic programmatic generations, not real social‑media posts. Domain semantics may differ from real‑world distributions.
- Do not use for reconstructing sensitive/private content or for attempting to de‑anonymize embedding corpora. This model is a research/diagnostic tool.
---
### Reproducibility (high‑level)
- Prepare embedding caches (not included): build local token‑level MPNet embedding caches for your corpora (e.g., via a data prep script) and store them in your own paths.
- Baseline training: iterative 10% phases, 80:20 (crypto:equities), LR=5e‑5, BS=16, early‑stop on out‑of‑sample cosine degradation.
- Evaluation: 1,000 samples/domain with the decode settings shown above.
- The released checkpoint corresponds to the latest non‑degrading phase under early‑stopping.
---
### License
- Code: MIT (per repository).
- Model weights: same as code unless declared otherwise upon release.
---
### Citation
If you use this model or codebase, please cite the Aparecium project and this baseline report.
|