TexITex β Parallel Text Generation via Token Embedding Diffusion in 2D Image Space
Can we generate entire sentences in parallel by treating token embeddings as a 2D image?
TexITex (Token-Image-Token) is a research proof-of-concept that encodes token embeddings as 2D latent images and generates them all at once using image diffusion β no autoregressive decoding step by step.
π Read the full paper (PDF)
π» GitHub β code + experiments
How It Works
Text β token embeddings β VQ-GAN encode β (16,16,16) latent image
β
DiT diffusion (200 DDIM steps)
β
Text β nearest-neighbour lookup β VQ-GAN decode β generated latent
64 tokens are arranged in a 16Γ16 grid of 2Γ2 patches. The VQ-GAN compresses each patch to a 16-channel latent. The DiT generates the full latent image in a fixed 200 steps regardless of sequence length.
Results (Phase 4-A, Epoch 200)
| Metric | Value |
|---|---|
| VQ-GAN roundtrip accuracy | 89.8% |
| Composite score β best sample | 0.372 |
| Composite score β mean (n=64) | 0.104 |
| Bigram coherence β best sample | 0.831 |
| Real-word ratio β mean | 0.683 |
| Median perplexity | 197 |
Top Generated Outputs
Best sample (composite = 0.372, bigram = 0.831):
"a simulated adversary engagement. Your objectives include testing detection capabilities, exercising incident response, identifying security gaps. You employ realistic adversary TTPs mapped to MITRE ATT&CK, maintain operational security, and adapt your approach based on blue team responses."
Architecture
34-Channel DiT Input
| Channels | Role |
|---|---|
| ch 0 β position | 0β1 gradient in reading order |
| ch 1 β boundary | 1.0 at 2Γ2 patch edges, prevents token bleed |
| ch 2β17 β self-cond | Previous DDIM step's x0 prediction (iterative refinement) |
| ch 18β33 β noisy latent | Current x_t from forward diffusion |
Key Components
| Component | Parameters | Role |
|---|---|---|
| VQ-GAN (tokence_big_long) | 17.6M | Encode/decode token embeddings β latent image |
| DiT (depth=12, dim=512, heads=8) | 57.8M | Denoise the latent image |
| LSTM SequencePredictor | 239.7K | Sequence-order auxiliary loss (weight=0.5) |
| Total | 58.0M |
Denoising Process
Critical Findings
- LSTM sequence loss is mandatory β reducing weight from 0.5β0.2 causes complete collapse
- Self-conditioning enables refinement β biggest quality jump of all phases
- Token boundary channel prevents bleed β clearest visual improvement in latent space
- Best checkpoint = epoch 200 (not 300 β overtraining is real)
- DDIM sweet spot = 200 steps β mode-collapse cliff at β₯300 steps
Training
- Hardware: Apple Mac Mini M4, 64GB unified memory (MPS backend)
- Base LM: Qwen/Qwen2.5-1.5B (embedding table only β not fine-tuned)
- Corpus: Cybersecurity domain (red-team TTPs + blue-team playbooks, 50K sequences)
- Training time: ~2h VQ-GAN + ~22h DiT (300 epochs)
Citation
@misc{cj2026texitex,
title = {TexITex: Parallel Text Generation via Token Embedding Diffusion in 2D Image Space},
author = {Jean Paul, C J},
year = {2026},
url = {https://github.com/PurpleS3Cf0X/TexITex}
}
Author: Jean Paul C J (Unaffiliated)




