| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - diffusion |
| - text-generation |
| - non-autoregressive |
| - token-embedding |
| - cybersecurity |
| - DiT |
| - VQ-GAN |
| pipeline_tag: text-generation |
| --- |
| |
| # TexITex β Parallel Text Generation via Token Embedding Diffusion in 2D Image Space |
|
|
| > **Can we generate entire sentences in parallel by treating token embeddings as a 2D image?** |
|
|
| TexITex (Token-Image-Token) is a research proof-of-concept that encodes token embeddings |
| as 2D latent images and generates them all at once using image diffusion β no autoregressive |
| decoding step by step. |
|
|
| π **[Read the full paper (PDF)](paper.pdf)** |
| π» **[GitHub β code + experiments](https://github.com/PurpleS3Cf0X/TexITex)** |
|
|
| --- |
|
|
| ## How It Works |
|
|
| ``` |
| Text β token embeddings β VQ-GAN encode β (16,16,16) latent image |
| β |
| DiT diffusion (200 DDIM steps) |
| β |
| Text β nearest-neighbour lookup β VQ-GAN decode β generated latent |
| ``` |
|
|
| 64 tokens are arranged in a **16Γ16 grid** of 2Γ2 patches. The VQ-GAN compresses |
| each patch to a 16-channel latent. The DiT generates the full latent image in a |
| fixed 200 steps regardless of sequence length. |
|
|
|  |
|
|
| --- |
|
|
| ## Results (Phase 4-A, Epoch 200) |
|
|
|  |
|
|
| | Metric | Value | |
| |--------|-------| |
| | VQ-GAN roundtrip accuracy | **89.8%** | |
| | Composite score β best sample | **0.372** | |
| | Composite score β mean (n=64) | 0.104 | |
| | Bigram coherence β best sample | **0.831** | |
| | Real-word ratio β mean | 0.683 | |
| | Median perplexity | 197 | |
|
|
| ### Top Generated Outputs |
|
|
|  |
|
|
| **Best sample** (composite = 0.372, bigram = 0.831): |
| > *"a simulated adversary engagement. Your objectives include testing detection |
| > capabilities, exercising incident response, identifying security gaps. You employ |
| > realistic adversary TTPs mapped to MITRE ATT&CK, maintain operational security, |
| > and adapt your approach based on blue team responses."* |
|
|
| --- |
|
|
| ## Architecture |
|
|
| ### 34-Channel DiT Input |
|
|
|  |
|
|
| | Channels | Role | |
| |----------|------| |
| | ch 0 β position | 0β1 gradient in reading order | |
| | ch 1 β boundary | 1.0 at 2Γ2 patch edges, prevents token bleed | |
| | ch 2β17 β self-cond | Previous DDIM step's x0 prediction (iterative refinement) | |
| | ch 18β33 β noisy latent | Current x_t from forward diffusion | |
| |
| ### Key Components |
| |
| | Component | Parameters | Role | |
| |-----------|-----------|------| |
| | VQ-GAN (tokence_big_long) | 17.6M | Encode/decode token embeddings β latent image | |
| | DiT (depth=12, dim=512, heads=8) | 57.8M | Denoise the latent image | |
| | LSTM SequencePredictor | 239.7K | Sequence-order auxiliary loss (weight=0.5) | |
| | **Total** | **58.0M** | | |
| |
| ### Denoising Process |
| |
|  |
| |
| --- |
| |
| ## Critical Findings |
| |
| 1. **LSTM sequence loss is mandatory** β reducing weight from 0.5β0.2 causes complete collapse |
| 2. **Self-conditioning enables refinement** β biggest quality jump of all phases |
| 3. **Token boundary channel prevents bleed** β clearest visual improvement in latent space |
| 4. **Best checkpoint = epoch 200** (not 300 β overtraining is real) |
| 5. **DDIM sweet spot = 200 steps** β mode-collapse cliff at β₯300 steps |
| |
| --- |
| |
| ## Training |
| |
| - **Hardware**: Apple Mac Mini M4, 64GB unified memory (MPS backend) |
| - **Base LM**: Qwen/Qwen2.5-1.5B (embedding table only β not fine-tuned) |
| - **Corpus**: Cybersecurity domain (red-team TTPs + blue-team playbooks, 50K sequences) |
| - **Training time**: ~2h VQ-GAN + ~22h DiT (300 epochs) |
| |
| --- |
| |
| ## Citation |
| |
| ```bibtex |
| @misc{cj2026texitex, |
| title = {TexITex: Parallel Text Generation via Token Embedding Diffusion in 2D Image Space}, |
| author = {Jean Paul, C J}, |
| year = {2026}, |
| url = {https://github.com/PurpleS3Cf0X/TexITex} |
| } |
| ``` |
| |
| --- |
| |
| *Author: Jean Paul C J (Unaffiliated)* |
| |
| |