File size: 6,470 Bytes
97cad85
 
 
 
 
 
 
 
 
 
 
592ad51
97cad85
 
 
279a90b
f69735f
 
 
97cad85
f69735f
 
279a90b
f69735f
279a90b
f69735f
 
 
279a90b
f69735f
279a90b
f69735f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
279a90b
f69735f
279a90b
f69735f
 
 
 
 
 
 
 
279a90b
f69735f
279a90b
f69735f
 
 
 
 
 
 
 
 
 
 
 
 
 
97cad85
279a90b
f69735f
279a90b
f69735f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
279a90b
f69735f
279a90b
f69735f
 
 
279a90b
f69735f
279a90b
f69735f
 
 
 
 
 
279a90b
f69735f
279a90b
f69735f
97cad85
f69735f
 
97cad85
279a90b
f69735f
279a90b
f69735f
 
 
279a90b
f69735f
279a90b
f69735f
 
279a90b
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
language: en
license: mit
library_name: pytorch
tags:
  - transformer-decoder
  - seq2seq
  - embeddings
  - mpnet
  - text-reconstruction
  - crypto
pipeline_tag: text-generation
---

### Aparecium Baseline Model Card

#### Summary
- **Task**: Reconstruct natural language posts from token‑level MPNet embeddings (reverse embedding).
- **Focus**: Crypto domain, with equities as auxiliary domain.
- **Checkpoint**: Baseline model trained with a phased schedule and early stopping.
- **Data**: 1.0M synthetic posts (500k crypto + 500k equities), programmatically generated via OpenAI API. No real social‑media content used.
- **Input contract**: token‑level MPNet matrix of shape `(seq_len, 768)`, not a pooled vector.

---

### Intended use
- Research and engineering use for studying reversibility of embedding spaces and for building diagnostics/tools around embedding interpretability.
- Not intended to reconstruct private or sensitive content; reconstruction accuracy depends on embedding fidelity and domain match.

---

### Model architecture
- Encoder side: External; we assume MPNet family encoder (default: `sentence-transformers/all-mpnet-base-v2`) to produce token‑level embeddings.
- Decoder: Transformer decoder consuming the MPNet memory:
  - d_model: 768
  - Decoder layers: 2
  - Attention heads: 8
  - FFN dim: 2048
  - Token and positional embeddings; GELU activations
- Decoding:
  - Supports greedy, sampling, and beam search.
  - Optional embedding‑aware rescoring (cosine similarity between the candidate’s re‑embedded sentence and the pooled MPNet target).
  - Optional lightweight constraints for hashtag/cashtag/URL continuity.

Recommended inference defaults:
- `num_beams=8`
- `length_penalty_alpha=0.6`
- `lambda_sim=0.6`
- `rescore_every_k=4`, `rescore_top_m=8`
- `beta=10.0`
- `enable_constraints=True`
- `deterministic=True`

---

### Training data and provenance
- 1,000,000 synthetic posts total:
  - 500,000 crypto‑domain posts
  - 500,000 equities‑domain posts
- All posts were programmatically generated via the OpenAI API (synthetic). No real social‑media content was used.
- Embeddings:
  - Token‑level MPNet (default: `sentence-transformers/all-mpnet-base-v2`).
  - Cached to SQLite to avoid recomputation and allow resumable training.

---

### Training procedure (baseline regimen)
- Domain emphasis: 80% crypto / 20% equities per training phase.
- Phased training (10% of available chunks per phase), evaluate after each phase:
  - In‑sample: small subset from the phase’s chunks
  - Out‑of‑sample: small hold‑out from both domains (not seen in the phase)
  - Early‑stop condition: stop if out‑of‑sample cosine degrades relative to prior phase.
- Optimizer: AdamW
- Learning rate (baseline finetune): 5e‑5
- Batch size: 16
- Input `max_source_length`: 256
- Target `max_target_length`: 128
- Checkpointing: every 2,000 steps and at phase end.

Notes
- Training used early stopping based on out‑of‑sample cosine.

---

### Evaluation protocol (for the metrics below)
- Sample size: 1,000 examples per domain drawn from cached embedding databases.
- Decode config: `num_beams=8`, `length_penalty_alpha=0.6`, `lambda_sim=0.6`, `rescore_every_k=4`, `rescore_top_m=8`, `beta=10.0`, `enable_constraints=True`, `deterministic=True`.
- Metrics:
  - `cosine_mean/median/p10/p90`: cosine between pooled MPNet embedding of generated text and the pooled MPNet target vector (higher is better).
  - `score_norm_mean`: length‑penalized language model score (more positive is better; negative values are common for log‑scores).
  - `degenerate_pct`: % of clearly degenerate generations (very short/blank/only hashtags).
  - `domain_drift_pct`: % of equity‑like terms in crypto outputs (or crypto‑like terms in equities outputs). Heuristic text filter; intended as a rough indicator only.

Results (current `models/baseline` checkpoint)
- Crypto (n=1000)
  - cosine_mean: 0.681
  - cosine_median: 0.843
  - cosine_p10: 0.000
  - cosine_p90: 0.984
  - score_norm_mean: −1.977
  - degenerate_pct: 5.2%
  - domain_drift_pct: 0.0%
- Equities (n=1000)
  - cosine_mean: 0.778
  - cosine_median: 0.901
  - cosine_p10: 0.326
  - cosine_p90: 0.986
  - score_norm_mean: −1.344
  - degenerate_pct: 2.2%
  - domain_drift_pct: 4.4%

Interpretation
- The model reconstructs many posts with strong embedding alignment (p90 ≈ 0.98 cosine in both domains).
- Equities shows higher average/median cosine and lower degeneracy than crypto, consistent with the auxiliary‑domain role and data characteristics.
- A small fraction of degenerate outputs exists in both domains (crypto ~5.2%, equities ~2.2%).
- Domain drift is minimal from crypto→equities (0.0%) and present at a modest rate from equities→crypto (~4.4%) under the chosen heuristic.

---

### Input contract and usage
- **Input**: MPNet token‑level matrix `(seq_len × 768)` for a single post. Do not pass a pooled vector.
- **Tokenizer/model alignment** matters: use the same MPNet tokenizer/model version that produced the embeddings.

---

### Limitations and responsible use
- Reconstruction is not guaranteed to match the original post text; it optimizes alignment within the MPNet embedding space and LM scoring.
- The model can produce generic or incomplete outputs (see `degenerate_pct`).
- Domain drift can occur depending on decode settings (see `domain_drift_pct`).
- Data are synthetic programmatic generations, not real social‑media posts. Domain semantics may differ from real‑world distributions.
- Do not use for reconstructing sensitive/private content or for attempting to de‑anonymize embedding corpora. This model is a research/diagnostic tool.

---

### Reproducibility (high‑level)
- Prepare embedding caches (not included): build local token‑level MPNet embedding caches for your corpora (e.g., via a data prep script) and store them in your own paths.
- Baseline training: iterative 10% phases, 80:20 (crypto:equities), LR=5e‑5, BS=16, early‑stop on out‑of‑sample cosine degradation.
- Evaluation: 1,000 samples/domain with the decode settings shown above.
- The released checkpoint corresponds to the latest non‑degrading phase under early‑stopping.

---

### License
- Code: MIT (per repository).
- Model weights: same as code unless declared otherwise upon release.

---

### Citation
If you use this model or codebase, please cite the Aparecium project and this baseline report.