ruitao-edward-chen commited on
Commit
f69735f
·
1 Parent(s): 81729ef

Overwrite with new baseline checkpoint, tokenizer, and model card

Browse files
.gitattributes CHANGED
@@ -1,35 +1 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
  *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
- *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  *.pt filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -1,93 +1,141 @@
1
- ---
2
- license: mit
3
- language:
4
- - en
5
- ---
6
- # Aparecium Seq2Seq Reverser Model
7
-
8
- This model is part of the [Aparecium](https://github.com/SentiChain/aparecium) project, designed to reveal text from embedding vectors, particularly for SentiChain embeddings.
9
-
10
- ## Model Description
11
 
12
- The Seq2Seq Reverser model is a specialized sequence-to-sequence model trained to reconstruct original text from embedding vectors, with a particular focus on crypto market-related content.
 
 
 
 
 
13
 
14
- ### Training Data
15
-
16
- - **Dataset Size**: 10,000 sentences
17
- - **Data Source**: Generated using OpenAI's API
18
- - **Domain**: Cryptocurrency market events and related content
19
- - **Language**: English
20
 
21
- ### Limitations
 
 
22
 
23
- ⚠️ **Important Note**: This model is specifically trained on cryptocurrency market-related content. Its performance may be significantly limited when:
24
- - Processing text from other domains
25
- - Handling general-purpose text
26
- - Working with technical content unrelated to crypto markets
27
 
28
- ### Model Architecture
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
- The model uses a sequence-to-sequence architecture with:
31
- - Transformer decoder with 2 layers
32
- - 8 attention heads
33
- - 768-dimensional embeddings
34
- - 2048-dimensional feed-forward networks
35
- - Specialized tokenizer for crypto market terminology
36
- - Optimized for embedding vector reconstruction
37
 
38
- ## Usage
 
 
 
 
 
 
 
39
 
40
- The model can be used through the Aparecium Python package:
41
 
42
- ```python
43
- from aparecium import Seq2SeqReverser
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
- # Load the pre-trained model from Hugging Face Hub
46
- reverser = Seq2SeqReverser.from_pretrained("SentiChain/aparecium-seq2seq-reverser")
47
 
48
- # Generate text from embedding vectors
49
- recovered_text = reverser.generate_text(source_rep)
50
- print(recovered_text)
51
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
- ### Installation
54
 
55
- ```bash
56
- pip install aparecium
57
- ```
58
 
59
- ## Performance and Limitations
60
 
61
- The model performs best on:
62
- - Cryptocurrency market news and updates
63
- - Trading-related content
64
- - Market analysis text
65
- - Blockchain technology discussions
 
66
 
67
- Performance may degrade on:
68
- - General news articles
69
- - Technical documentation
70
- - Social media content
71
- - Non-financial text
72
 
73
- ## License
 
 
 
 
 
 
74
 
75
- This model is released under the MIT License.
76
 
77
- ## Citation
 
 
78
 
79
- If you use this model in your research, please cite:
80
 
81
- ```bibtex
82
- @software{aparecium2025,
83
- author = {Chen, Edward},
84
- title = {Aparecium: Text Reconstruction from Embedding Vectors},
85
- year = {2025},
86
- publisher = {GitHub},
87
- url = {https://github.com/SentiChain/aparecium}
88
- }
89
- ```
90
 
91
- ## Contact
92
 
93
- For issues and questions, please use the [GitHub issue tracker](https://github.com/SentiChain/aparecium/issues).
 
1
+ ### Aparecium Baseline (Crypto‑focused) — Model Card
 
 
 
 
 
 
 
 
 
2
 
3
+ #### Summary
4
+ - **Task**: Reconstruct natural language posts from token‑level MPNet embeddings (reverse embedding).
5
+ - **Focus**: Crypto domain, with equities as auxiliary domain.
6
+ - **Current checkpoint**: `models/baseline` reflects Phase 3 (early stop triggered after Phase 3 due to out‑of‑sample drop). Phase 2 performed best; consider publishing the Phase 2 checkpoint if available.
7
+ - **Data**: 1.0M synthetic posts (500k crypto + 500k equities), programmatically generated via OpenAI API. No real social‑media content used.
8
+ - **Input contract**: token‑level MPNet matrix of shape `(seq_len, 768)`, not a pooled vector.
9
 
10
+ ---
 
 
 
 
 
11
 
12
+ ### Intended use
13
+ - Research and engineering use for studying reversibility of embedding spaces and for building diagnostics/tools around embedding interpretability.
14
+ - Not intended to reconstruct private or sensitive content; reconstruction accuracy depends on embedding fidelity and domain match.
15
 
16
+ ---
 
 
 
17
 
18
+ ### Model architecture
19
+ - Encoder side: External; we assume MPNet family encoder (default: `sentence-transformers/all-mpnet-base-v2`) to produce token‑level embeddings.
20
+ - Decoder: Transformer decoder consuming the MPNet memory:
21
+ - d_model: 768
22
+ - Decoder layers: 2
23
+ - Attention heads: 8
24
+ - FFN dim: 2048
25
+ - Token and positional embeddings; GELU activations
26
+ - Decoding:
27
+ - Supports greedy, sampling, and beam search.
28
+ - Optional embedding‑aware rescoring (cosine similarity between the candidate’s re‑embedded sentence and the pooled MPNet target).
29
+ - Optional lightweight constraints for hashtag/cashtag/URL continuity.
30
+
31
+ Recommended inference defaults:
32
+ - `num_beams=8`
33
+ - `length_penalty_alpha=0.6`
34
+ - `lambda_sim=0.6`
35
+ - `rescore_every_k=4`, `rescore_top_m=8`
36
+ - `beta=10.0`
37
+ - `enable_constraints=True`
38
+ - `deterministic=True`
39
 
40
+ ---
 
 
 
 
 
 
41
 
42
+ ### Training data and provenance
43
+ - 1,000,000 synthetic posts total:
44
+ - 500,000 crypto‑domain posts
45
+ - 500,000 equities‑domain posts
46
+ - All posts were programmatically generated via the OpenAI API (synthetic). No real social‑media content was used.
47
+ - Embeddings:
48
+ - Token‑level MPNet (default: `sentence-transformers/all-mpnet-base-v2`).
49
+ - Cached to SQLite to avoid recomputation and allow resumable training.
50
 
51
+ ---
52
 
53
+ ### Training procedure (baseline regimen)
54
+ - Domain emphasis: 80% crypto / 20% equities per training phase.
55
+ - Phased training (10% of available chunks per phase), evaluate after each phase:
56
+ - In‑sample: small subset from the phase’s chunks
57
+ - Out‑of‑sample: small hold‑out from both domains (not seen in the phase)
58
+ - Early‑stop condition: stop if out‑of‑sample cosine degrades relative to prior phase.
59
+ - Optimizer: AdamW
60
+ - Learning rate (baseline finetune): 5e‑5
61
+ - Batch size: 16
62
+ - Input `max_source_length`: 256
63
+ - Target `max_target_length`: 128
64
+ - Checkpointing: every 2,000 steps and at phase end.
65
+
66
+ Notes
67
+ - In this run, Phase 1 → Phase 2 showed clear out‑of‑sample improvements; Phase 3 degraded; early stop triggered.
68
+ - Best observed checkpoint: Phase 2 (if retained). The directory currently contains Phase 3; consider re‑exporting Phase 2.
69
 
70
+ ---
 
71
 
72
+ ### Evaluation protocol (for the metrics below)
73
+ - Sample size: 1,000 examples per domain drawn from cached embedding databases.
74
+ - Decode config: `num_beams=8`, `length_penalty_alpha=0.6`, `lambda_sim=0.6`, `rescore_every_k=4`, `rescore_top_m=8`, `beta=10.0`, `enable_constraints=True`, `deterministic=True`.
75
+ - Metrics:
76
+ - `cosine_mean/median/p10/p90`: cosine between pooled MPNet embedding of generated text and the pooled MPNet target vector (higher is better).
77
+ - `score_norm_mean`: length‑penalized language model score (more positive is better; negative values are common for log‑scores).
78
+ - `degenerate_pct`: % of clearly degenerate generations (very short/blank/only hashtags).
79
+ - `domain_drift_pct`: % of equity‑like terms in crypto outputs (or crypto‑like terms in equities outputs). Heuristic text filter; intended as a rough indicator only.
80
+
81
+ Results (current `models/baseline` checkpoint)
82
+ - Crypto (n=1000)
83
+ - cosine_mean: 0.681
84
+ - cosine_median: 0.843
85
+ - cosine_p10: 0.000
86
+ - cosine_p90: 0.984
87
+ - score_norm_mean: −1.977
88
+ - degenerate_pct: 5.2%
89
+ - domain_drift_pct: 0.0%
90
+ - Equities (n=1000)
91
+ - cosine_mean: 0.778
92
+ - cosine_median: 0.901
93
+ - cosine_p10: 0.326
94
+ - cosine_p90: 0.986
95
+ - score_norm_mean: −1.344
96
+ - degenerate_pct: 2.2%
97
+ - domain_drift_pct: 4.4%
98
+
99
+ Interpretation
100
+ - The model reconstructs many posts with strong embedding alignment (p90 ≈ 0.98 cosine in both domains).
101
+ - Equities shows higher average/median cosine and lower degeneracy than crypto, consistent with the auxiliary‑domain role and data characteristics.
102
+ - A small fraction of degenerate outputs exists in both domains (crypto ~5.2%, equities ~2.2%).
103
+ - Domain drift is minimal from crypto→equities (0.0%) and present at a modest rate from equities→crypto (~4.4%) under the chosen heuristic.
104
 
105
+ ---
106
 
107
+ ### Input contract and usage
108
+ - **Input**: MPNet token‑level matrix `(seq_len × 768)` for a single post. Do not pass a pooled vector.
109
+ - **Tokenizer/model alignment** matters: use the same MPNet tokenizer/model version that produced the embeddings.
110
 
111
+ ---
112
 
113
+ ### Limitations and responsible use
114
+ - Reconstruction is not guaranteed to match the original post text; it optimizes alignment within the MPNet embedding space and LM scoring.
115
+ - The model can produce generic or incomplete outputs (see `degenerate_pct`).
116
+ - Domain drift can occur depending on decode settings (see `domain_drift_pct`).
117
+ - Data are synthetic programmatic generations, not real social‑media posts. Domain semantics may differ from real‑world distributions.
118
+ - Do not use for reconstructing sensitive/private content or for attempting to de‑anonymize embedding corpora. This model is a research/diagnostic tool.
119
 
120
+ ---
 
 
 
 
121
 
122
+ ### Reproducibility (high‑level)
123
+ - Prepare caches:
124
+ - crypto: `data/pipeline/aparecium_crypto_500k.db`
125
+ - equities: `data/pipeline/aparecium_equities_500k.db`
126
+ - Baseline training: iterative 10% phases, 80:20 (crypto:equities), LR=5e‑5, BS=16, early‑stop on out‑of‑sample cosine degradation.
127
+ - Evaluation: 1,000 samples/domain with the decode settings shown above.
128
+ - Best observed baseline: Phase 2 (early‑stop triggered after Phase 3). The directory currently contains Phase 3 unless a Phase 2 copy is retained.
129
 
130
+ ---
131
 
132
+ ### License
133
+ - Code: MIT (per repository).
134
+ - Model weights: same as code unless declared otherwise upon release.
135
 
136
+ ---
137
 
138
+ ### Citation
139
+ If you use this model or codebase, please cite the Aparecium project and this baseline report.
 
 
 
 
 
 
 
140
 
 
141
 
 
reverser_seq2seq_state.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b8842c3b99a2746a1a086adcdb86b1934ff146d616e62517584c001021184405
3
- size 252291890
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7e77d93e56c50d95f25a7301f0b5431307cd7b0ee05830f071cbd7c116ef6888
3
+ size 252292530
special_tokens_map.json CHANGED
@@ -1,51 +1,51 @@
1
- {
2
- "bos_token": {
3
- "content": "<s>",
4
- "lstrip": false,
5
- "normalized": false,
6
- "rstrip": false,
7
- "single_word": false
8
- },
9
- "cls_token": {
10
- "content": "<s>",
11
- "lstrip": false,
12
- "normalized": false,
13
- "rstrip": false,
14
- "single_word": false
15
- },
16
- "eos_token": {
17
- "content": "</s>",
18
- "lstrip": false,
19
- "normalized": false,
20
- "rstrip": false,
21
- "single_word": false
22
- },
23
- "mask_token": {
24
- "content": "<mask>",
25
- "lstrip": true,
26
- "normalized": false,
27
- "rstrip": false,
28
- "single_word": false
29
- },
30
- "pad_token": {
31
- "content": "<pad>",
32
- "lstrip": false,
33
- "normalized": false,
34
- "rstrip": false,
35
- "single_word": false
36
- },
37
- "sep_token": {
38
- "content": "</s>",
39
- "lstrip": false,
40
- "normalized": false,
41
- "rstrip": false,
42
- "single_word": false
43
- },
44
- "unk_token": {
45
- "content": "[UNK]",
46
- "lstrip": false,
47
- "normalized": false,
48
- "rstrip": false,
49
- "single_word": false
50
- }
51
- }
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "[UNK]",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json CHANGED
@@ -2,7 +2,7 @@
2
  "version": "1.0",
3
  "truncation": {
4
  "direction": "Right",
5
- "max_length": 256,
6
  "strategy": "LongestFirst",
7
  "stride": 0
8
  },
 
2
  "version": "1.0",
3
  "truncation": {
4
  "direction": "Right",
5
+ "max_length": 128,
6
  "strategy": "LongestFirst",
7
  "stride": 0
8
  },
tokenizer_config.json CHANGED
@@ -1,73 +1,73 @@
1
- {
2
- "added_tokens_decoder": {
3
- "0": {
4
- "content": "<s>",
5
- "lstrip": false,
6
- "normalized": false,
7
- "rstrip": false,
8
- "single_word": false,
9
- "special": true
10
- },
11
- "1": {
12
- "content": "<pad>",
13
- "lstrip": false,
14
- "normalized": false,
15
- "rstrip": false,
16
- "single_word": false,
17
- "special": true
18
- },
19
- "2": {
20
- "content": "</s>",
21
- "lstrip": false,
22
- "normalized": false,
23
- "rstrip": false,
24
- "single_word": false,
25
- "special": true
26
- },
27
- "3": {
28
- "content": "<unk>",
29
- "lstrip": false,
30
- "normalized": true,
31
- "rstrip": false,
32
- "single_word": false,
33
- "special": true
34
- },
35
- "104": {
36
- "content": "[UNK]",
37
- "lstrip": false,
38
- "normalized": false,
39
- "rstrip": false,
40
- "single_word": false,
41
- "special": true
42
- },
43
- "30526": {
44
- "content": "<mask>",
45
- "lstrip": true,
46
- "normalized": false,
47
- "rstrip": false,
48
- "single_word": false,
49
- "special": true
50
- }
51
- },
52
- "bos_token": "<s>",
53
- "clean_up_tokenization_spaces": false,
54
- "cls_token": "<s>",
55
- "do_lower_case": true,
56
- "eos_token": "</s>",
57
- "extra_special_tokens": {},
58
- "mask_token": "<mask>",
59
- "max_length": 128,
60
- "model_max_length": 512,
61
- "pad_to_multiple_of": null,
62
- "pad_token": "<pad>",
63
- "pad_token_type_id": 0,
64
- "padding_side": "right",
65
- "sep_token": "</s>",
66
- "stride": 0,
67
- "strip_accents": null,
68
- "tokenize_chinese_chars": true,
69
- "tokenizer_class": "MPNetTokenizer",
70
- "truncation_side": "right",
71
- "truncation_strategy": "longest_first",
72
- "unk_token": "[UNK]"
73
- }
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "104": {
36
+ "content": "[UNK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "30526": {
44
+ "content": "<mask>",
45
+ "lstrip": true,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ }
51
+ },
52
+ "bos_token": "<s>",
53
+ "clean_up_tokenization_spaces": false,
54
+ "cls_token": "<s>",
55
+ "do_lower_case": true,
56
+ "eos_token": "</s>",
57
+ "extra_special_tokens": {},
58
+ "mask_token": "<mask>",
59
+ "max_length": 128,
60
+ "model_max_length": 512,
61
+ "pad_to_multiple_of": null,
62
+ "pad_token": "<pad>",
63
+ "pad_token_type_id": 0,
64
+ "padding_side": "right",
65
+ "sep_token": "</s>",
66
+ "stride": 0,
67
+ "strip_accents": null,
68
+ "tokenize_chinese_chars": true,
69
+ "tokenizer_class": "MPNetTokenizer",
70
+ "truncation_side": "right",
71
+ "truncation_strategy": "longest_first",
72
+ "unk_token": "[UNK]"
73
+ }