File size: 3,984 Bytes
8630d90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
license: apache-2.0
language:
- en
tags:
- diffusion
- text-generation
- non-autoregressive
- token-embedding
- cybersecurity
- DiT
- VQ-GAN
pipeline_tag: text-generation
---

# TexITex β€” Parallel Text Generation via Token Embedding Diffusion in 2D Image Space

> **Can we generate entire sentences in parallel by treating token embeddings as a 2D image?**

TexITex (Token-Image-Token) is a research proof-of-concept that encodes token embeddings
as 2D latent images and generates them all at once using image diffusion β€” no autoregressive
decoding step by step.

πŸ“„ **[Read the full paper (PDF)](paper.pdf)**  
πŸ’» **[GitHub β€” code + experiments](https://github.com/PurpleS3Cf0X/TexITex)**

---

## How It Works

```
Text β†’ token embeddings β†’ VQ-GAN encode β†’ (16,16,16) latent image
                                                    ↓
                                          DiT diffusion (200 DDIM steps)
                                                    ↓
Text ← nearest-neighbour lookup ← VQ-GAN decode ← generated latent
```

64 tokens are arranged in a **16Γ—16 grid** of 2Γ—2 patches. The VQ-GAN compresses
each patch to a 16-channel latent. The DiT generates the full latent image in a
fixed 200 steps regardless of sequence length.

![Pipeline](fig_pipeline.png)

---

## Results (Phase 4-A, Epoch 200)

![Results Dashboard](fig_results_dashboard.png)

| Metric | Value |
|--------|-------|
| VQ-GAN roundtrip accuracy | **89.8%** |
| Composite score β€” best sample | **0.372** |
| Composite score β€” mean (n=64) | 0.104 |
| Bigram coherence β€” best sample | **0.831** |
| Real-word ratio β€” mean | 0.683 |
| Median perplexity | 197 |

### Top Generated Outputs

![Top 5 Samples](fig_top5_samples.png)

**Best sample** (composite = 0.372, bigram = 0.831):
> *"a simulated adversary engagement. Your objectives include testing detection
> capabilities, exercising incident response, identifying security gaps. You employ
> realistic adversary TTPs mapped to MITRE ATT&CK, maintain operational security,
> and adapt your approach based on blue team responses."*

---

## Architecture

### 34-Channel DiT Input

![Channel Layout](fig_channels.png)

| Channels | Role |
|----------|------|
| ch 0 β€” position | 0β†’1 gradient in reading order |
| ch 1 β€” boundary | 1.0 at 2Γ—2 patch edges, prevents token bleed |
| ch 2–17 β€” self-cond | Previous DDIM step's x0 prediction (iterative refinement) |
| ch 18–33 β€” noisy latent | Current x_t from forward diffusion |

### Key Components

| Component | Parameters | Role |
|-----------|-----------|------|
| VQ-GAN (tokence_big_long) | 17.6M | Encode/decode token embeddings ↔ latent image |
| DiT (depth=12, dim=512, heads=8) | 57.8M | Denoise the latent image |
| LSTM SequencePredictor | 239.7K | Sequence-order auxiliary loss (weight=0.5) |
| **Total** | **58.0M** | |

### Denoising Process

![Denoising](fig_denoising.png)

---

## Critical Findings

1. **LSTM sequence loss is mandatory** β€” reducing weight from 0.5β†’0.2 causes complete collapse
2. **Self-conditioning enables refinement** β€” biggest quality jump of all phases
3. **Token boundary channel prevents bleed** β€” clearest visual improvement in latent space
4. **Best checkpoint = epoch 200** (not 300 β€” overtraining is real)
5. **DDIM sweet spot = 200 steps** β€” mode-collapse cliff at β‰₯300 steps

---

## Training

- **Hardware**: Apple Mac Mini M4, 64GB unified memory (MPS backend)
- **Base LM**: Qwen/Qwen2.5-1.5B (embedding table only β€” not fine-tuned)
- **Corpus**: Cybersecurity domain (red-team TTPs + blue-team playbooks, 50K sequences)
- **Training time**: ~2h VQ-GAN + ~22h DiT (300 epochs)

---

## Citation

```bibtex
@misc{cj2026texitex,
  title  = {TexITex: Parallel Text Generation via Token Embedding Diffusion in 2D Image Space},
  author = {Jean Paul, C J},
  year   = {2026},
  url    = {https://github.com/PurpleS3Cf0X/TexITex}
}
```

---

*Author: Jean Paul C J (Unaffiliated)*