ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer
Abstract
ByteFlow Net presents a tokenizer-free hierarchical architecture that enables language models to learn adaptive segmentation of raw byte streams through compression-driven methods while maintaining a static computation graph.
Modern language models still rely on fixed, pre-defined subword tokenizations. Once a tokenizer is trained, the LM can only operate at this fixed level of granularity, which often leads to brittle and counterintuitive behaviors even in otherwise strong reasoning models. We introduce ByteFlow Net, a new hierarchical architecture that removes tokenizers entirely and instead enables models to learn their own segmentation of raw byte streams into semantically meaningful units. ByteFlow Net performs compression-driven segmentation based on the coding rate of latent representations, yielding adaptive boundaries while preserving a static computation graph via Top-K selection. Unlike prior self-tokenizing methods that depend on brittle heuristics with human-designed inductive biases, ByteFlow Net adapts its internal representation granularity to the input itself. Experiments demonstrate that this compression-based chunking strategy yields substantial performance gains, with ByteFlow Net outperforming both BPE-based Transformers and previous byte-level architectures. These results suggest that end-to-end, tokenizer-free modeling is not only feasible but also more effective, opening a path toward more adaptive and information-grounded language models.
Community
We propose ByteFlow Net, a new architecture challenges one of the most entrenched assumptions in modern language models: the need for a fixed tokenizer. Instead of relying on BPE or SentencePiece, the paper proposes learning directly from raw bytes while dynamically forming tokens inside the model through an information-theoretic compression principle. The key mechanism, called coding-rate chunking, groups byte representations when doing so reduces representational cost, effectively allowing the model to discover its own segmentation of text during training. This adaptive hierarchy combines a local byte encoder, a chunking module that promotes informative byte spans into higher-level units, and a global transformer operating on these learned segments, enabling the model to allocate compute where information density is highest. Experiments show improved bits-per-byte and competitive downstream performance compared to traditional tokenized Transformers, suggesting that tokenization may not be a necessary preprocessing step at all. More broadly, ByteFlow reframes tokenization as a learned compression problem, pointing toward future LLMs that operate fully end-to-end on raw data while dynamically discovering the structure of language.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Distilling Token-Trained Models into Byte-Level Models (2026)
- You Can Learn Tokenization End-to-End with Reinforcement Learning (2026)
- Proxy Compression for Language Modeling (2026)
- Adaptive Loops and Memory in Transformers: Think Harder or Know More? (2026)
- An Information-Theoretic Perspective on LLM Tokenizers (2026)
- ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation (2026)
- Context Compression via Explicit Information Transmission (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper