langflow-lm1b / README.md
nealchen's picture
Upload ./README.md with huggingface_hub
d5d395d verified
---
datasets:
- dvruette/lm1b
papers:
- arxiv: 2604.11748
language:
- en
library_name: transformers
license: apache-2.0
metrics:
- perplexity
pipeline_tag: text-generation
---
# LangFlow
LangFlow is a continuous diffusion language model that operates in embedding space. Unlike discrete diffusion models (MDLM, SEDD, DUO), LangFlow performs diffusion directly on continuous token embeddings, enabling smoother denoising dynamics.
For more details, please see our paper: [LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling](https://arxiv.org/abs/2604.11748).
## Using LangFlow
To use the pre-trained model for text generation, use the following snippet:
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForMaskedLM.from_pretrained('Continuous-Rivals-Discrete/langflow-lm1b', trust_remote_code=True)
# Generate samples
samples = model.generate_samples(num_samples=5, num_steps=128)
texts = tokenizer.batch_decode(samples, skip_special_tokens=True)
for text in texts:
print(text)
```
## Model Details
- **Architecture**: DiT (Diffusion Transformer) backbone with adaptive layer normalization
- **Context Length**: 128 tokens
- **Parameters**: ~130M parameters (similar to GPT-2 small)
- **Training**: 1M steps on LM1B corpus
- **Tokenizer**: bert-base-uncased tokenizer (30,522 vocab size)
## Citation
```
@article{chen2026langflow,
title={LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling},
author={Chen, Yuxin and Liang, Chumeng and Sui, Hangke and Guo, Ruihan and Cheng, Chaoran and You, Jiaxuan and Liu, Ge},
journal={arXiv preprint arXiv:2604.11748},
year={2026}
}
```
## Model Card Contact
Chumeng Liang (chumengl@illinois.edu)