TexITex β€” Parallel Text Generation via Token Embedding Diffusion in 2D Image Space

Can we generate entire sentences in parallel by treating token embeddings as a 2D image?

TexITex (Token-Image-Token) is a research proof-of-concept that encodes token embeddings as 2D latent images and generates them all at once using image diffusion β€” no autoregressive decoding step by step.

πŸ“„ Read the full paper (PDF)
πŸ’» GitHub β€” code + experiments


How It Works

Text β†’ token embeddings β†’ VQ-GAN encode β†’ (16,16,16) latent image
                                                    ↓
                                          DiT diffusion (200 DDIM steps)
                                                    ↓
Text ← nearest-neighbour lookup ← VQ-GAN decode ← generated latent

64 tokens are arranged in a 16Γ—16 grid of 2Γ—2 patches. The VQ-GAN compresses each patch to a 16-channel latent. The DiT generates the full latent image in a fixed 200 steps regardless of sequence length.

Pipeline


Results (Phase 4-A, Epoch 200)

Results Dashboard

Metric Value
VQ-GAN roundtrip accuracy 89.8%
Composite score β€” best sample 0.372
Composite score β€” mean (n=64) 0.104
Bigram coherence β€” best sample 0.831
Real-word ratio β€” mean 0.683
Median perplexity 197

Top Generated Outputs

Top 5 Samples

Best sample (composite = 0.372, bigram = 0.831):

"a simulated adversary engagement. Your objectives include testing detection capabilities, exercising incident response, identifying security gaps. You employ realistic adversary TTPs mapped to MITRE ATT&CK, maintain operational security, and adapt your approach based on blue team responses."


Architecture

34-Channel DiT Input

Channel Layout

Channels Role
ch 0 β€” position 0β†’1 gradient in reading order
ch 1 β€” boundary 1.0 at 2Γ—2 patch edges, prevents token bleed
ch 2–17 β€” self-cond Previous DDIM step's x0 prediction (iterative refinement)
ch 18–33 β€” noisy latent Current x_t from forward diffusion

Key Components

Component Parameters Role
VQ-GAN (tokence_big_long) 17.6M Encode/decode token embeddings ↔ latent image
DiT (depth=12, dim=512, heads=8) 57.8M Denoise the latent image
LSTM SequencePredictor 239.7K Sequence-order auxiliary loss (weight=0.5)
Total 58.0M

Denoising Process

Denoising


Critical Findings

  1. LSTM sequence loss is mandatory β€” reducing weight from 0.5β†’0.2 causes complete collapse
  2. Self-conditioning enables refinement β€” biggest quality jump of all phases
  3. Token boundary channel prevents bleed β€” clearest visual improvement in latent space
  4. Best checkpoint = epoch 200 (not 300 β€” overtraining is real)
  5. DDIM sweet spot = 200 steps β€” mode-collapse cliff at β‰₯300 steps

Training

  • Hardware: Apple Mac Mini M4, 64GB unified memory (MPS backend)
  • Base LM: Qwen/Qwen2.5-1.5B (embedding table only β€” not fine-tuned)
  • Corpus: Cybersecurity domain (red-team TTPs + blue-team playbooks, 50K sequences)
  • Training time: ~2h VQ-GAN + ~22h DiT (300 epochs)

Citation

@misc{cj2026texitex,
  title  = {TexITex: Parallel Text Generation via Token Embedding Diffusion in 2D Image Space},
  author = {Jean Paul, C J},
  year   = {2026},
  url    = {https://github.com/PurpleS3Cf0X/TexITex}
}

Author: Jean Paul C J (Unaffiliated)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support