ruitao-edward-chen
commited on
Commit
·
f69735f
1
Parent(s):
81729ef
Overwrite with new baseline checkpoint, tokenizer, and model card
Browse files- .gitattributes +0 -34
- README.md +118 -70
- reverser_seq2seq_state.pt +2 -2
- special_tokens_map.json +51 -51
- tokenizer.json +1 -1
- tokenizer_config.json +73 -73
.gitattributes
CHANGED
|
@@ -1,35 +1 @@
|
|
| 1 |
-
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
-
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
-
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
-
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
-
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
-
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
-
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
-
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
-
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
-
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
-
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
-
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
-
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
-
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
-
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
-
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
-
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
-
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
-
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
-
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
-
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
-
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
-
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
-
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
-
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
-
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
-
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
-
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
-
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
-
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
-
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
-
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
-
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
-
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
*.pt filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
|
@@ -1,93 +1,141 @@
|
|
| 1 |
-
|
| 2 |
-
license: mit
|
| 3 |
-
language:
|
| 4 |
-
- en
|
| 5 |
-
---
|
| 6 |
-
# Aparecium Seq2Seq Reverser Model
|
| 7 |
-
|
| 8 |
-
This model is part of the [Aparecium](https://github.com/SentiChain/aparecium) project, designed to reveal text from embedding vectors, particularly for SentiChain embeddings.
|
| 9 |
-
|
| 10 |
-
## Model Description
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
- **Dataset Size**: 10,000 sentences
|
| 17 |
-
- **Data Source**: Generated using OpenAI's API
|
| 18 |
-
- **Domain**: Cryptocurrency market events and related content
|
| 19 |
-
- **Language**: English
|
| 20 |
|
| 21 |
-
###
|
|
|
|
|
|
|
| 22 |
|
| 23 |
-
|
| 24 |
-
- Processing text from other domains
|
| 25 |
-
- Handling general-purpose text
|
| 26 |
-
- Working with technical content unrelated to crypto markets
|
| 27 |
|
| 28 |
-
### Model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
-
|
| 31 |
-
- Transformer decoder with 2 layers
|
| 32 |
-
- 8 attention heads
|
| 33 |
-
- 768-dimensional embeddings
|
| 34 |
-
- 2048-dimensional feed-forward networks
|
| 35 |
-
- Specialized tokenizer for crypto market terminology
|
| 36 |
-
- Optimized for embedding vector reconstruction
|
| 37 |
|
| 38 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
-
|
| 41 |
|
| 42 |
-
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
-
|
| 46 |
-
reverser = Seq2SeqReverser.from_pretrained("SentiChain/aparecium-seq2seq-reverser")
|
| 47 |
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
-
|
| 54 |
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
|
| 59 |
-
|
| 60 |
|
| 61 |
-
|
| 62 |
-
-
|
| 63 |
-
-
|
| 64 |
-
-
|
| 65 |
-
-
|
|
|
|
| 66 |
|
| 67 |
-
|
| 68 |
-
- General news articles
|
| 69 |
-
- Technical documentation
|
| 70 |
-
- Social media content
|
| 71 |
-
- Non-financial text
|
| 72 |
|
| 73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
-
|
| 76 |
|
| 77 |
-
|
|
|
|
|
|
|
| 78 |
|
| 79 |
-
|
| 80 |
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
author = {Chen, Edward},
|
| 84 |
-
title = {Aparecium: Text Reconstruction from Embedding Vectors},
|
| 85 |
-
year = {2025},
|
| 86 |
-
publisher = {GitHub},
|
| 87 |
-
url = {https://github.com/SentiChain/aparecium}
|
| 88 |
-
}
|
| 89 |
-
```
|
| 90 |
|
| 91 |
-
## Contact
|
| 92 |
|
| 93 |
-
For issues and questions, please use the [GitHub issue tracker](https://github.com/SentiChain/aparecium/issues).
|
|
|
|
| 1 |
+
### Aparecium Baseline (Crypto‑focused) — Model Card
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
+
#### Summary
|
| 4 |
+
- **Task**: Reconstruct natural language posts from token‑level MPNet embeddings (reverse embedding).
|
| 5 |
+
- **Focus**: Crypto domain, with equities as auxiliary domain.
|
| 6 |
+
- **Current checkpoint**: `models/baseline` reflects Phase 3 (early stop triggered after Phase 3 due to out‑of‑sample drop). Phase 2 performed best; consider publishing the Phase 2 checkpoint if available.
|
| 7 |
+
- **Data**: 1.0M synthetic posts (500k crypto + 500k equities), programmatically generated via OpenAI API. No real social‑media content used.
|
| 8 |
+
- **Input contract**: token‑level MPNet matrix of shape `(seq_len, 768)`, not a pooled vector.
|
| 9 |
|
| 10 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
+
### Intended use
|
| 13 |
+
- Research and engineering use for studying reversibility of embedding spaces and for building diagnostics/tools around embedding interpretability.
|
| 14 |
+
- Not intended to reconstruct private or sensitive content; reconstruction accuracy depends on embedding fidelity and domain match.
|
| 15 |
|
| 16 |
+
---
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
+
### Model architecture
|
| 19 |
+
- Encoder side: External; we assume MPNet family encoder (default: `sentence-transformers/all-mpnet-base-v2`) to produce token‑level embeddings.
|
| 20 |
+
- Decoder: Transformer decoder consuming the MPNet memory:
|
| 21 |
+
- d_model: 768
|
| 22 |
+
- Decoder layers: 2
|
| 23 |
+
- Attention heads: 8
|
| 24 |
+
- FFN dim: 2048
|
| 25 |
+
- Token and positional embeddings; GELU activations
|
| 26 |
+
- Decoding:
|
| 27 |
+
- Supports greedy, sampling, and beam search.
|
| 28 |
+
- Optional embedding‑aware rescoring (cosine similarity between the candidate’s re‑embedded sentence and the pooled MPNet target).
|
| 29 |
+
- Optional lightweight constraints for hashtag/cashtag/URL continuity.
|
| 30 |
+
|
| 31 |
+
Recommended inference defaults:
|
| 32 |
+
- `num_beams=8`
|
| 33 |
+
- `length_penalty_alpha=0.6`
|
| 34 |
+
- `lambda_sim=0.6`
|
| 35 |
+
- `rescore_every_k=4`, `rescore_top_m=8`
|
| 36 |
+
- `beta=10.0`
|
| 37 |
+
- `enable_constraints=True`
|
| 38 |
+
- `deterministic=True`
|
| 39 |
|
| 40 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
+
### Training data and provenance
|
| 43 |
+
- 1,000,000 synthetic posts total:
|
| 44 |
+
- 500,000 crypto‑domain posts
|
| 45 |
+
- 500,000 equities‑domain posts
|
| 46 |
+
- All posts were programmatically generated via the OpenAI API (synthetic). No real social‑media content was used.
|
| 47 |
+
- Embeddings:
|
| 48 |
+
- Token‑level MPNet (default: `sentence-transformers/all-mpnet-base-v2`).
|
| 49 |
+
- Cached to SQLite to avoid recomputation and allow resumable training.
|
| 50 |
|
| 51 |
+
---
|
| 52 |
|
| 53 |
+
### Training procedure (baseline regimen)
|
| 54 |
+
- Domain emphasis: 80% crypto / 20% equities per training phase.
|
| 55 |
+
- Phased training (10% of available chunks per phase), evaluate after each phase:
|
| 56 |
+
- In‑sample: small subset from the phase’s chunks
|
| 57 |
+
- Out‑of‑sample: small hold‑out from both domains (not seen in the phase)
|
| 58 |
+
- Early‑stop condition: stop if out‑of‑sample cosine degrades relative to prior phase.
|
| 59 |
+
- Optimizer: AdamW
|
| 60 |
+
- Learning rate (baseline finetune): 5e‑5
|
| 61 |
+
- Batch size: 16
|
| 62 |
+
- Input `max_source_length`: 256
|
| 63 |
+
- Target `max_target_length`: 128
|
| 64 |
+
- Checkpointing: every 2,000 steps and at phase end.
|
| 65 |
+
|
| 66 |
+
Notes
|
| 67 |
+
- In this run, Phase 1 → Phase 2 showed clear out‑of‑sample improvements; Phase 3 degraded; early stop triggered.
|
| 68 |
+
- Best observed checkpoint: Phase 2 (if retained). The directory currently contains Phase 3; consider re‑exporting Phase 2.
|
| 69 |
|
| 70 |
+
---
|
|
|
|
| 71 |
|
| 72 |
+
### Evaluation protocol (for the metrics below)
|
| 73 |
+
- Sample size: 1,000 examples per domain drawn from cached embedding databases.
|
| 74 |
+
- Decode config: `num_beams=8`, `length_penalty_alpha=0.6`, `lambda_sim=0.6`, `rescore_every_k=4`, `rescore_top_m=8`, `beta=10.0`, `enable_constraints=True`, `deterministic=True`.
|
| 75 |
+
- Metrics:
|
| 76 |
+
- `cosine_mean/median/p10/p90`: cosine between pooled MPNet embedding of generated text and the pooled MPNet target vector (higher is better).
|
| 77 |
+
- `score_norm_mean`: length‑penalized language model score (more positive is better; negative values are common for log‑scores).
|
| 78 |
+
- `degenerate_pct`: % of clearly degenerate generations (very short/blank/only hashtags).
|
| 79 |
+
- `domain_drift_pct`: % of equity‑like terms in crypto outputs (or crypto‑like terms in equities outputs). Heuristic text filter; intended as a rough indicator only.
|
| 80 |
+
|
| 81 |
+
Results (current `models/baseline` checkpoint)
|
| 82 |
+
- Crypto (n=1000)
|
| 83 |
+
- cosine_mean: 0.681
|
| 84 |
+
- cosine_median: 0.843
|
| 85 |
+
- cosine_p10: 0.000
|
| 86 |
+
- cosine_p90: 0.984
|
| 87 |
+
- score_norm_mean: −1.977
|
| 88 |
+
- degenerate_pct: 5.2%
|
| 89 |
+
- domain_drift_pct: 0.0%
|
| 90 |
+
- Equities (n=1000)
|
| 91 |
+
- cosine_mean: 0.778
|
| 92 |
+
- cosine_median: 0.901
|
| 93 |
+
- cosine_p10: 0.326
|
| 94 |
+
- cosine_p90: 0.986
|
| 95 |
+
- score_norm_mean: −1.344
|
| 96 |
+
- degenerate_pct: 2.2%
|
| 97 |
+
- domain_drift_pct: 4.4%
|
| 98 |
+
|
| 99 |
+
Interpretation
|
| 100 |
+
- The model reconstructs many posts with strong embedding alignment (p90 ≈ 0.98 cosine in both domains).
|
| 101 |
+
- Equities shows higher average/median cosine and lower degeneracy than crypto, consistent with the auxiliary‑domain role and data characteristics.
|
| 102 |
+
- A small fraction of degenerate outputs exists in both domains (crypto ~5.2%, equities ~2.2%).
|
| 103 |
+
- Domain drift is minimal from crypto→equities (0.0%) and present at a modest rate from equities→crypto (~4.4%) under the chosen heuristic.
|
| 104 |
|
| 105 |
+
---
|
| 106 |
|
| 107 |
+
### Input contract and usage
|
| 108 |
+
- **Input**: MPNet token‑level matrix `(seq_len × 768)` for a single post. Do not pass a pooled vector.
|
| 109 |
+
- **Tokenizer/model alignment** matters: use the same MPNet tokenizer/model version that produced the embeddings.
|
| 110 |
|
| 111 |
+
---
|
| 112 |
|
| 113 |
+
### Limitations and responsible use
|
| 114 |
+
- Reconstruction is not guaranteed to match the original post text; it optimizes alignment within the MPNet embedding space and LM scoring.
|
| 115 |
+
- The model can produce generic or incomplete outputs (see `degenerate_pct`).
|
| 116 |
+
- Domain drift can occur depending on decode settings (see `domain_drift_pct`).
|
| 117 |
+
- Data are synthetic programmatic generations, not real social‑media posts. Domain semantics may differ from real‑world distributions.
|
| 118 |
+
- Do not use for reconstructing sensitive/private content or for attempting to de‑anonymize embedding corpora. This model is a research/diagnostic tool.
|
| 119 |
|
| 120 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
|
| 122 |
+
### Reproducibility (high‑level)
|
| 123 |
+
- Prepare caches:
|
| 124 |
+
- crypto: `data/pipeline/aparecium_crypto_500k.db`
|
| 125 |
+
- equities: `data/pipeline/aparecium_equities_500k.db`
|
| 126 |
+
- Baseline training: iterative 10% phases, 80:20 (crypto:equities), LR=5e‑5, BS=16, early‑stop on out‑of‑sample cosine degradation.
|
| 127 |
+
- Evaluation: 1,000 samples/domain with the decode settings shown above.
|
| 128 |
+
- Best observed baseline: Phase 2 (early‑stop triggered after Phase 3). The directory currently contains Phase 3 unless a Phase 2 copy is retained.
|
| 129 |
|
| 130 |
+
---
|
| 131 |
|
| 132 |
+
### License
|
| 133 |
+
- Code: MIT (per repository).
|
| 134 |
+
- Model weights: same as code unless declared otherwise upon release.
|
| 135 |
|
| 136 |
+
---
|
| 137 |
|
| 138 |
+
### Citation
|
| 139 |
+
If you use this model or codebase, please cite the Aparecium project and this baseline report.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
|
|
|
|
| 141 |
|
|
|
reverser_seq2seq_state.pt
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7e77d93e56c50d95f25a7301f0b5431307cd7b0ee05830f071cbd7c116ef6888
|
| 3 |
+
size 252292530
|
special_tokens_map.json
CHANGED
|
@@ -1,51 +1,51 @@
|
|
| 1 |
-
{
|
| 2 |
-
"bos_token": {
|
| 3 |
-
"content": "<s>",
|
| 4 |
-
"lstrip": false,
|
| 5 |
-
"normalized": false,
|
| 6 |
-
"rstrip": false,
|
| 7 |
-
"single_word": false
|
| 8 |
-
},
|
| 9 |
-
"cls_token": {
|
| 10 |
-
"content": "<s>",
|
| 11 |
-
"lstrip": false,
|
| 12 |
-
"normalized": false,
|
| 13 |
-
"rstrip": false,
|
| 14 |
-
"single_word": false
|
| 15 |
-
},
|
| 16 |
-
"eos_token": {
|
| 17 |
-
"content": "</s>",
|
| 18 |
-
"lstrip": false,
|
| 19 |
-
"normalized": false,
|
| 20 |
-
"rstrip": false,
|
| 21 |
-
"single_word": false
|
| 22 |
-
},
|
| 23 |
-
"mask_token": {
|
| 24 |
-
"content": "<mask>",
|
| 25 |
-
"lstrip": true,
|
| 26 |
-
"normalized": false,
|
| 27 |
-
"rstrip": false,
|
| 28 |
-
"single_word": false
|
| 29 |
-
},
|
| 30 |
-
"pad_token": {
|
| 31 |
-
"content": "<pad>",
|
| 32 |
-
"lstrip": false,
|
| 33 |
-
"normalized": false,
|
| 34 |
-
"rstrip": false,
|
| 35 |
-
"single_word": false
|
| 36 |
-
},
|
| 37 |
-
"sep_token": {
|
| 38 |
-
"content": "</s>",
|
| 39 |
-
"lstrip": false,
|
| 40 |
-
"normalized": false,
|
| 41 |
-
"rstrip": false,
|
| 42 |
-
"single_word": false
|
| 43 |
-
},
|
| 44 |
-
"unk_token": {
|
| 45 |
-
"content": "[UNK]",
|
| 46 |
-
"lstrip": false,
|
| 47 |
-
"normalized": false,
|
| 48 |
-
"rstrip": false,
|
| 49 |
-
"single_word": false
|
| 50 |
-
}
|
| 51 |
-
}
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"bos_token": {
|
| 3 |
+
"content": "<s>",
|
| 4 |
+
"lstrip": false,
|
| 5 |
+
"normalized": false,
|
| 6 |
+
"rstrip": false,
|
| 7 |
+
"single_word": false
|
| 8 |
+
},
|
| 9 |
+
"cls_token": {
|
| 10 |
+
"content": "<s>",
|
| 11 |
+
"lstrip": false,
|
| 12 |
+
"normalized": false,
|
| 13 |
+
"rstrip": false,
|
| 14 |
+
"single_word": false
|
| 15 |
+
},
|
| 16 |
+
"eos_token": {
|
| 17 |
+
"content": "</s>",
|
| 18 |
+
"lstrip": false,
|
| 19 |
+
"normalized": false,
|
| 20 |
+
"rstrip": false,
|
| 21 |
+
"single_word": false
|
| 22 |
+
},
|
| 23 |
+
"mask_token": {
|
| 24 |
+
"content": "<mask>",
|
| 25 |
+
"lstrip": true,
|
| 26 |
+
"normalized": false,
|
| 27 |
+
"rstrip": false,
|
| 28 |
+
"single_word": false
|
| 29 |
+
},
|
| 30 |
+
"pad_token": {
|
| 31 |
+
"content": "<pad>",
|
| 32 |
+
"lstrip": false,
|
| 33 |
+
"normalized": false,
|
| 34 |
+
"rstrip": false,
|
| 35 |
+
"single_word": false
|
| 36 |
+
},
|
| 37 |
+
"sep_token": {
|
| 38 |
+
"content": "</s>",
|
| 39 |
+
"lstrip": false,
|
| 40 |
+
"normalized": false,
|
| 41 |
+
"rstrip": false,
|
| 42 |
+
"single_word": false
|
| 43 |
+
},
|
| 44 |
+
"unk_token": {
|
| 45 |
+
"content": "[UNK]",
|
| 46 |
+
"lstrip": false,
|
| 47 |
+
"normalized": false,
|
| 48 |
+
"rstrip": false,
|
| 49 |
+
"single_word": false
|
| 50 |
+
}
|
| 51 |
+
}
|
tokenizer.json
CHANGED
|
@@ -2,7 +2,7 @@
|
|
| 2 |
"version": "1.0",
|
| 3 |
"truncation": {
|
| 4 |
"direction": "Right",
|
| 5 |
-
"max_length":
|
| 6 |
"strategy": "LongestFirst",
|
| 7 |
"stride": 0
|
| 8 |
},
|
|
|
|
| 2 |
"version": "1.0",
|
| 3 |
"truncation": {
|
| 4 |
"direction": "Right",
|
| 5 |
+
"max_length": 128,
|
| 6 |
"strategy": "LongestFirst",
|
| 7 |
"stride": 0
|
| 8 |
},
|
tokenizer_config.json
CHANGED
|
@@ -1,73 +1,73 @@
|
|
| 1 |
-
{
|
| 2 |
-
"added_tokens_decoder": {
|
| 3 |
-
"0": {
|
| 4 |
-
"content": "<s>",
|
| 5 |
-
"lstrip": false,
|
| 6 |
-
"normalized": false,
|
| 7 |
-
"rstrip": false,
|
| 8 |
-
"single_word": false,
|
| 9 |
-
"special": true
|
| 10 |
-
},
|
| 11 |
-
"1": {
|
| 12 |
-
"content": "<pad>",
|
| 13 |
-
"lstrip": false,
|
| 14 |
-
"normalized": false,
|
| 15 |
-
"rstrip": false,
|
| 16 |
-
"single_word": false,
|
| 17 |
-
"special": true
|
| 18 |
-
},
|
| 19 |
-
"2": {
|
| 20 |
-
"content": "</s>",
|
| 21 |
-
"lstrip": false,
|
| 22 |
-
"normalized": false,
|
| 23 |
-
"rstrip": false,
|
| 24 |
-
"single_word": false,
|
| 25 |
-
"special": true
|
| 26 |
-
},
|
| 27 |
-
"3": {
|
| 28 |
-
"content": "<unk>",
|
| 29 |
-
"lstrip": false,
|
| 30 |
-
"normalized": true,
|
| 31 |
-
"rstrip": false,
|
| 32 |
-
"single_word": false,
|
| 33 |
-
"special": true
|
| 34 |
-
},
|
| 35 |
-
"104": {
|
| 36 |
-
"content": "[UNK]",
|
| 37 |
-
"lstrip": false,
|
| 38 |
-
"normalized": false,
|
| 39 |
-
"rstrip": false,
|
| 40 |
-
"single_word": false,
|
| 41 |
-
"special": true
|
| 42 |
-
},
|
| 43 |
-
"30526": {
|
| 44 |
-
"content": "<mask>",
|
| 45 |
-
"lstrip": true,
|
| 46 |
-
"normalized": false,
|
| 47 |
-
"rstrip": false,
|
| 48 |
-
"single_word": false,
|
| 49 |
-
"special": true
|
| 50 |
-
}
|
| 51 |
-
},
|
| 52 |
-
"bos_token": "<s>",
|
| 53 |
-
"clean_up_tokenization_spaces": false,
|
| 54 |
-
"cls_token": "<s>",
|
| 55 |
-
"do_lower_case": true,
|
| 56 |
-
"eos_token": "</s>",
|
| 57 |
-
"extra_special_tokens": {},
|
| 58 |
-
"mask_token": "<mask>",
|
| 59 |
-
"max_length": 128,
|
| 60 |
-
"model_max_length": 512,
|
| 61 |
-
"pad_to_multiple_of": null,
|
| 62 |
-
"pad_token": "<pad>",
|
| 63 |
-
"pad_token_type_id": 0,
|
| 64 |
-
"padding_side": "right",
|
| 65 |
-
"sep_token": "</s>",
|
| 66 |
-
"stride": 0,
|
| 67 |
-
"strip_accents": null,
|
| 68 |
-
"tokenize_chinese_chars": true,
|
| 69 |
-
"tokenizer_class": "MPNetTokenizer",
|
| 70 |
-
"truncation_side": "right",
|
| 71 |
-
"truncation_strategy": "longest_first",
|
| 72 |
-
"unk_token": "[UNK]"
|
| 73 |
-
}
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"0": {
|
| 4 |
+
"content": "<s>",
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": false,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"1": {
|
| 12 |
+
"content": "<pad>",
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": false,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
},
|
| 19 |
+
"2": {
|
| 20 |
+
"content": "</s>",
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"normalized": false,
|
| 23 |
+
"rstrip": false,
|
| 24 |
+
"single_word": false,
|
| 25 |
+
"special": true
|
| 26 |
+
},
|
| 27 |
+
"3": {
|
| 28 |
+
"content": "<unk>",
|
| 29 |
+
"lstrip": false,
|
| 30 |
+
"normalized": true,
|
| 31 |
+
"rstrip": false,
|
| 32 |
+
"single_word": false,
|
| 33 |
+
"special": true
|
| 34 |
+
},
|
| 35 |
+
"104": {
|
| 36 |
+
"content": "[UNK]",
|
| 37 |
+
"lstrip": false,
|
| 38 |
+
"normalized": false,
|
| 39 |
+
"rstrip": false,
|
| 40 |
+
"single_word": false,
|
| 41 |
+
"special": true
|
| 42 |
+
},
|
| 43 |
+
"30526": {
|
| 44 |
+
"content": "<mask>",
|
| 45 |
+
"lstrip": true,
|
| 46 |
+
"normalized": false,
|
| 47 |
+
"rstrip": false,
|
| 48 |
+
"single_word": false,
|
| 49 |
+
"special": true
|
| 50 |
+
}
|
| 51 |
+
},
|
| 52 |
+
"bos_token": "<s>",
|
| 53 |
+
"clean_up_tokenization_spaces": false,
|
| 54 |
+
"cls_token": "<s>",
|
| 55 |
+
"do_lower_case": true,
|
| 56 |
+
"eos_token": "</s>",
|
| 57 |
+
"extra_special_tokens": {},
|
| 58 |
+
"mask_token": "<mask>",
|
| 59 |
+
"max_length": 128,
|
| 60 |
+
"model_max_length": 512,
|
| 61 |
+
"pad_to_multiple_of": null,
|
| 62 |
+
"pad_token": "<pad>",
|
| 63 |
+
"pad_token_type_id": 0,
|
| 64 |
+
"padding_side": "right",
|
| 65 |
+
"sep_token": "</s>",
|
| 66 |
+
"stride": 0,
|
| 67 |
+
"strip_accents": null,
|
| 68 |
+
"tokenize_chinese_chars": true,
|
| 69 |
+
"tokenizer_class": "MPNetTokenizer",
|
| 70 |
+
"truncation_side": "right",
|
| 71 |
+
"truncation_strategy": "longest_first",
|
| 72 |
+
"unk_token": "[UNK]"
|
| 73 |
+
}
|