---
license: apache-2.0
language:
- en
tags:
- diffusion
- text-generation
- non-autoregressive
- token-embedding
- cybersecurity
- DiT
- VQ-GAN
pipeline_tag: text-generation
---

# TexITex — Parallel Text Generation via Token Embedding Diffusion in 2D Image Space

> **Can we generate entire sentences in parallel by treating token embeddings as a 2D image?**

TexITex (Token-Image-Token) is a research proof-of-concept that encodes token embeddings
as 2D latent images and generates them all at once using image diffusion — no autoregressive
decoding step by step.

📄 **[Read the full paper (PDF)](paper.pdf)**  
💻 **[GitHub — code + experiments](https://github.com/PurpleS3Cf0X/TexITex)**

---

## How It Works

```
Text → token embeddings → VQ-GAN encode → (16,16,16) latent image
                                                    ↓
                                          DiT diffusion (200 DDIM steps)
                                                    ↓
Text ← nearest-neighbour lookup ← VQ-GAN decode ← generated latent
```

64 tokens are arranged in a **16×16 grid** of 2×2 patches. The VQ-GAN compresses
each patch to a 16-channel latent. The DiT generates the full latent image in a
fixed 200 steps regardless of sequence length.

![Pipeline](fig_pipeline.png)

---

## Results (Phase 4-A, Epoch 200)

![Results Dashboard](fig_results_dashboard.png)

| Metric | Value |
|--------|-------|
| VQ-GAN roundtrip accuracy | **89.8%** |
| Composite score — best sample | **0.372** |
| Composite score — mean (n=64) | 0.104 |
| Bigram coherence — best sample | **0.831** |
| Real-word ratio — mean | 0.683 |
| Median perplexity | 197 |

### Top Generated Outputs

![Top 5 Samples](fig_top5_samples.png)

**Best sample** (composite = 0.372, bigram = 0.831):
> *"a simulated adversary engagement. Your objectives include testing detection
> capabilities, exercising incident response, identifying security gaps. You employ
> realistic adversary TTPs mapped to MITRE ATT&CK, maintain operational security,
> and adapt your approach based on blue team responses."*

---

## Architecture

### 34-Channel DiT Input

![Channel Layout](fig_channels.png)

| Channels | Role |
|----------|------|
| ch 0 — position | 0→1 gradient in reading order |
| ch 1 — boundary | 1.0 at 2×2 patch edges, prevents token bleed |
| ch 2–17 — self-cond | Previous DDIM step's x0 prediction (iterative refinement) |
| ch 18–33 — noisy latent | Current x_t from forward diffusion |

### Key Components

| Component | Parameters | Role |
|-----------|-----------|------|
| VQ-GAN (tokence_big_long) | 17.6M | Encode/decode token embeddings ↔ latent image |
| DiT (depth=12, dim=512, heads=8) | 57.8M | Denoise the latent image |
| LSTM SequencePredictor | 239.7K | Sequence-order auxiliary loss (weight=0.5) |
| **Total** | **58.0M** | |

### Denoising Process

![Denoising](fig_denoising.png)

---

## Critical Findings

1. **LSTM sequence loss is mandatory** — reducing weight from 0.5→0.2 causes complete collapse
2. **Self-conditioning enables refinement** — biggest quality jump of all phases
3. **Token boundary channel prevents bleed** — clearest visual improvement in latent space
4. **Best checkpoint = epoch 200** (not 300 — overtraining is real)
5. **DDIM sweet spot = 200 steps** — mode-collapse cliff at ≥300 steps

---

## Training

- **Hardware**: Apple Mac Mini M4, 64GB unified memory (MPS backend)
- **Base LM**: Qwen/Qwen2.5-1.5B (embedding table only — not fine-tuned)
- **Corpus**: Cybersecurity domain (red-team TTPs + blue-team playbooks, 50K sequences)
- **Training time**: ~2h VQ-GAN + ~22h DiT (300 epochs)

---

## Citation

```bibtex
@misc{cj2026texitex,
  title  = {TexITex: Parallel Text Generation via Token Embedding Diffusion in 2D Image Space},
  author = {Jean Paul, C J},
  year   = {2026},
  url    = {https://github.com/PurpleS3Cf0X/TexITex}
}
```

---

*Author: Jean Paul C J (Unaffiliated)*