ruitao-edward-chen

Model card: set pipeline_tag to text-generation

592ad51 4 months ago

6.47 kB

	---
	language: en
	license: mit
	library_name: pytorch
	tags:
	- transformer-decoder
	- seq2seq
	- embeddings
	- mpnet
	- text-reconstruction
	- crypto
	pipeline_tag: text-generation
	---

	### Aparecium Baseline Model Card

	#### Summary
	- Task: Reconstruct natural language posts from token‑level MPNet embeddings (reverse embedding).
	- Focus: Crypto domain, with equities as auxiliary domain.
	- Checkpoint: Baseline model trained with a phased schedule and early stopping.
	- Data: 1.0M synthetic posts (500k crypto + 500k equities), programmatically generated via OpenAI API. No real social‑media content used.
	- Input contract: token‑level MPNet matrix of shape `(seq_len, 768)`, not a pooled vector.

	---

	### Intended use
	- Research and engineering use for studying reversibility of embedding spaces and for building diagnostics/tools around embedding interpretability.
	- Not intended to reconstruct private or sensitive content; reconstruction accuracy depends on embedding fidelity and domain match.

	---

	### Model architecture
	- Encoder side: External; we assume MPNet family encoder (default: `sentence-transformers/all-mpnet-base-v2`) to produce token‑level embeddings.
	- Decoder: Transformer decoder consuming the MPNet memory:
	- d_model: 768
	- Decoder layers: 2
	- Attention heads: 8
	- FFN dim: 2048
	- Token and positional embeddings; GELU activations
	- Decoding:
	- Supports greedy, sampling, and beam search.
	- Optional embedding‑aware rescoring (cosine similarity between the candidate’s re‑embedded sentence and the pooled MPNet target).
	- Optional lightweight constraints for hashtag/cashtag/URL continuity.

	Recommended inference defaults:
	- `num_beams=8`
	- `length_penalty_alpha=0.6`
	- `lambda_sim=0.6`
	- `rescore_every_k=4`, `rescore_top_m=8`
	- `beta=10.0`
	- `enable_constraints=True`
	- `deterministic=True`

	---

	### Training data and provenance
	- 1,000,000 synthetic posts total:
	- 500,000 crypto‑domain posts
	- 500,000 equities‑domain posts
	- All posts were programmatically generated via the OpenAI API (synthetic). No real social‑media content was used.
	- Embeddings:
	- Token‑level MPNet (default: `sentence-transformers/all-mpnet-base-v2`).
	- Cached to SQLite to avoid recomputation and allow resumable training.

	---

	### Training procedure (baseline regimen)
	- Domain emphasis: 80% crypto / 20% equities per training phase.
	- Phased training (10% of available chunks per phase), evaluate after each phase:
	- In‑sample: small subset from the phase’s chunks
	- Out‑of‑sample: small hold‑out from both domains (not seen in the phase)
	- Early‑stop condition: stop if out‑of‑sample cosine degrades relative to prior phase.
	- Optimizer: AdamW
	- Learning rate (baseline finetune): 5e‑5
	- Batch size: 16
	- Input `max_source_length`: 256
	- Target `max_target_length`: 128
	- Checkpointing: every 2,000 steps and at phase end.

	Notes
	- Training used early stopping based on out‑of‑sample cosine.

	---

	### Evaluation protocol (for the metrics below)
	- Sample size: 1,000 examples per domain drawn from cached embedding databases.
	- Decode config: `num_beams=8`, `length_penalty_alpha=0.6`, `lambda_sim=0.6`, `rescore_every_k=4`, `rescore_top_m=8`, `beta=10.0`, `enable_constraints=True`, `deterministic=True`.
	- Metrics:
	- `cosine_mean/median/p10/p90`: cosine between pooled MPNet embedding of generated text and the pooled MPNet target vector (higher is better).
	- `score_norm_mean`: length‑penalized language model score (more positive is better; negative values are common for log‑scores).
	- `degenerate_pct`: % of clearly degenerate generations (very short/blank/only hashtags).
	- `domain_drift_pct`: % of equity‑like terms in crypto outputs (or crypto‑like terms in equities outputs). Heuristic text filter; intended as a rough indicator only.

	Results (current `models/baseline` checkpoint)
	- Crypto (n=1000)
	- cosine_mean: 0.681
	- cosine_median: 0.843
	- cosine_p10: 0.000
	- cosine_p90: 0.984
	- score_norm_mean: −1.977
	- degenerate_pct: 5.2%
	- domain_drift_pct: 0.0%
	- Equities (n=1000)
	- cosine_mean: 0.778
	- cosine_median: 0.901
	- cosine_p10: 0.326
	- cosine_p90: 0.986
	- score_norm_mean: −1.344
	- degenerate_pct: 2.2%
	- domain_drift_pct: 4.4%

	Interpretation
	- The model reconstructs many posts with strong embedding alignment (p90 ≈ 0.98 cosine in both domains).
	- Equities shows higher average/median cosine and lower degeneracy than crypto, consistent with the auxiliary‑domain role and data characteristics.
	- A small fraction of degenerate outputs exists in both domains (crypto ~5.2%, equities ~2.2%).
	- Domain drift is minimal from crypto→equities (0.0%) and present at a modest rate from equities→crypto (~4.4%) under the chosen heuristic.

	---

	### Input contract and usage
	- Input: MPNet token‑level matrix `(seq_len × 768)` for a single post. Do not pass a pooled vector.
	- Tokenizer/model alignment matters: use the same MPNet tokenizer/model version that produced the embeddings.

	---

	### Limitations and responsible use
	- Reconstruction is not guaranteed to match the original post text; it optimizes alignment within the MPNet embedding space and LM scoring.
	- The model can produce generic or incomplete outputs (see `degenerate_pct`).
	- Domain drift can occur depending on decode settings (see `domain_drift_pct`).
	- Data are synthetic programmatic generations, not real social‑media posts. Domain semantics may differ from real‑world distributions.
	- Do not use for reconstructing sensitive/private content or for attempting to de‑anonymize embedding corpora. This model is a research/diagnostic tool.

	---

	### Reproducibility (high‑level)
	- Prepare embedding caches (not included): build local token‑level MPNet embedding caches for your corpora (e.g., via a data prep script) and store them in your own paths.
	- Baseline training: iterative 10% phases, 80:20 (crypto:equities), LR=5e‑5, BS=16, early‑stop on out‑of‑sample cosine degradation.
	- Evaluation: 1,000 samples/domain with the decode settings shown above.
	- The released checkpoint corresponds to the latest non‑degrading phase under early‑stopping.

	---

	### License
	- Code: MIT (per repository).
	- Model weights: same as code unless declared otherwise upon release.

	---

	### Citation
	If you use this model or codebase, please cite the Aparecium project and this baseline report.