File size: 3,984 Bytes
8630d90 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | ---
license: apache-2.0
language:
- en
tags:
- diffusion
- text-generation
- non-autoregressive
- token-embedding
- cybersecurity
- DiT
- VQ-GAN
pipeline_tag: text-generation
---
# TexITex β Parallel Text Generation via Token Embedding Diffusion in 2D Image Space
> **Can we generate entire sentences in parallel by treating token embeddings as a 2D image?**
TexITex (Token-Image-Token) is a research proof-of-concept that encodes token embeddings
as 2D latent images and generates them all at once using image diffusion β no autoregressive
decoding step by step.
π **[Read the full paper (PDF)](paper.pdf)**
π» **[GitHub β code + experiments](https://github.com/PurpleS3Cf0X/TexITex)**
---
## How It Works
```
Text β token embeddings β VQ-GAN encode β (16,16,16) latent image
β
DiT diffusion (200 DDIM steps)
β
Text β nearest-neighbour lookup β VQ-GAN decode β generated latent
```
64 tokens are arranged in a **16Γ16 grid** of 2Γ2 patches. The VQ-GAN compresses
each patch to a 16-channel latent. The DiT generates the full latent image in a
fixed 200 steps regardless of sequence length.

---
## Results (Phase 4-A, Epoch 200)

| Metric | Value |
|--------|-------|
| VQ-GAN roundtrip accuracy | **89.8%** |
| Composite score β best sample | **0.372** |
| Composite score β mean (n=64) | 0.104 |
| Bigram coherence β best sample | **0.831** |
| Real-word ratio β mean | 0.683 |
| Median perplexity | 197 |
### Top Generated Outputs

**Best sample** (composite = 0.372, bigram = 0.831):
> *"a simulated adversary engagement. Your objectives include testing detection
> capabilities, exercising incident response, identifying security gaps. You employ
> realistic adversary TTPs mapped to MITRE ATT&CK, maintain operational security,
> and adapt your approach based on blue team responses."*
---
## Architecture
### 34-Channel DiT Input

| Channels | Role |
|----------|------|
| ch 0 β position | 0β1 gradient in reading order |
| ch 1 β boundary | 1.0 at 2Γ2 patch edges, prevents token bleed |
| ch 2β17 β self-cond | Previous DDIM step's x0 prediction (iterative refinement) |
| ch 18β33 β noisy latent | Current x_t from forward diffusion |
### Key Components
| Component | Parameters | Role |
|-----------|-----------|------|
| VQ-GAN (tokence_big_long) | 17.6M | Encode/decode token embeddings β latent image |
| DiT (depth=12, dim=512, heads=8) | 57.8M | Denoise the latent image |
| LSTM SequencePredictor | 239.7K | Sequence-order auxiliary loss (weight=0.5) |
| **Total** | **58.0M** | |
### Denoising Process

---
## Critical Findings
1. **LSTM sequence loss is mandatory** β reducing weight from 0.5β0.2 causes complete collapse
2. **Self-conditioning enables refinement** β biggest quality jump of all phases
3. **Token boundary channel prevents bleed** β clearest visual improvement in latent space
4. **Best checkpoint = epoch 200** (not 300 β overtraining is real)
5. **DDIM sweet spot = 200 steps** β mode-collapse cliff at β₯300 steps
---
## Training
- **Hardware**: Apple Mac Mini M4, 64GB unified memory (MPS backend)
- **Base LM**: Qwen/Qwen2.5-1.5B (embedding table only β not fine-tuned)
- **Corpus**: Cybersecurity domain (red-team TTPs + blue-team playbooks, 50K sequences)
- **Training time**: ~2h VQ-GAN + ~22h DiT (300 epochs)
---
## Citation
```bibtex
@misc{cj2026texitex,
title = {TexITex: Parallel Text Generation via Token Embedding Diffusion in 2D Image Space},
author = {Jean Paul, C J},
year = {2026},
url = {https://github.com/PurpleS3Cf0X/TexITex}
}
```
---
*Author: Jean Paul C J (Unaffiliated)*
|