TexITex — Parallel Text Generation via Token Embedding Diffusion in 2D Image Space

Can we generate entire sentences in parallel by treating token embeddings as a 2D image?

TexITex (Token-Image-Token) is a research proof-of-concept that encodes token embeddings as 2D latent images and generates them all at once using image diffusion — no autoregressive decoding step by step.

📄 Read the full paper (PDF)
💻 GitHub — code + experiments

How It Works

Text → token embeddings → VQ-GAN encode → (16,16,16) latent image
                                                    ↓
                                          DiT diffusion (200 DDIM steps)
                                                    ↓
Text ← nearest-neighbour lookup ← VQ-GAN decode ← generated latent

64 tokens are arranged in a 16×16 grid of 2×2 patches. The VQ-GAN compresses each patch to a 16-channel latent. The DiT generates the full latent image in a fixed 200 steps regardless of sequence length.

Results (Phase 4-A, Epoch 200)

Metric	Value
VQ-GAN roundtrip accuracy	89.8%
Composite score — best sample	0.372
Composite score — mean (n=64)	0.104
Bigram coherence — best sample	0.831
Real-word ratio — mean	0.683
Median perplexity	197

Top Generated Outputs

Best sample (composite = 0.372, bigram = 0.831):

"a simulated adversary engagement. Your objectives include testing detection capabilities, exercising incident response, identifying security gaps. You employ realistic adversary TTPs mapped to MITRE ATT&CK, maintain operational security, and adapt your approach based on blue team responses."

Architecture

34-Channel DiT Input

Channels	Role
ch 0 — position	0→1 gradient in reading order
ch 1 — boundary	1.0 at 2×2 patch edges, prevents token bleed
ch 2–17 — self-cond	Previous DDIM step's x0 prediction (iterative refinement)
ch 18–33 — noisy latent	Current x_t from forward diffusion

Key Components

Component	Parameters	Role
VQ-GAN (tokence_big_long)	17.6M	Encode/decode token embeddings ↔ latent image
DiT (depth=12, dim=512, heads=8)	57.8M	Denoise the latent image
LSTM SequencePredictor	239.7K	Sequence-order auxiliary loss (weight=0.5)
Total	58.0M

Denoising Process

Critical Findings

LSTM sequence loss is mandatory — reducing weight from 0.5→0.2 causes complete collapse
Self-conditioning enables refinement — biggest quality jump of all phases
Token boundary channel prevents bleed — clearest visual improvement in latent space
Best checkpoint = epoch 200 (not 300 — overtraining is real)
DDIM sweet spot = 200 steps — mode-collapse cliff at ≥300 steps

Training

Hardware: Apple Mac Mini M4, 64GB unified memory (MPS backend)
Base LM: Qwen/Qwen2.5-1.5B (embedding table only — not fine-tuned)
Corpus: Cybersecurity domain (red-team TTPs + blue-team playbooks, 50K sequences)
Training time: ~2h VQ-GAN + ~22h DiT (300 epochs)

Citation

@misc{cj2026texitex,
  title  = {TexITex: Parallel Text Generation via Token Embedding Diffusion in 2D Image Space},
  author = {Jean Paul, C J},
  year   = {2026},
  url    = {https://github.com/PurpleS3Cf0X/TexITex}
}

Author: Jean Paul C J (Unaffiliated)

Downloads last month: -; Downloads are not tracked for this model. How to track