arxiv:2603.03583

ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

Published on Mar 3

· Submitted by

Chunyuan Deng on Mar 10

Amazon

Upvote

Authors:

Abstract

ByteFlow Net presents a tokenizer-free hierarchical architecture that enables language models to learn adaptive segmentation of raw byte streams through compression-driven methods while maintaining a static computation graph.

AI-generated summary

Modern language models still rely on fixed, pre-defined subword tokenizations. Once a tokenizer is trained, the LM can only operate at this fixed level of granularity, which often leads to brittle and counterintuitive behaviors even in otherwise strong reasoning models. We introduce ByteFlow Net, a new hierarchical architecture that removes tokenizers entirely and instead enables models to learn their own segmentation of raw byte streams into semantically meaningful units. ByteFlow Net performs compression-driven segmentation based on the coding rate of latent representations, yielding adaptive boundaries while preserving a static computation graph via Top-K selection. Unlike prior self-tokenizing methods that depend on brittle heuristics with human-designed inductive biases, ByteFlow Net adapts its internal representation granularity to the input itself. Experiments demonstrate that this compression-based chunking strategy yields substantial performance gains, with ByteFlow Net outperforming both BPE-based Transformers and previous byte-level architectures. These results suggest that end-to-end, tokenizer-free modeling is not only feasible but also more effective, opening a path toward more adaptive and information-grounded language models.

View arXiv page View PDF Add to collection

Community

CharlesDDDD

Paper submitter about 4 hours ago

We propose ByteFlow Net, a new architecture challenges one of the most entrenched assumptions in modern language models: the need for a fixed tokenizer. Instead of relying on BPE or SentencePiece, the paper proposes learning directly from raw bytes while dynamically forming tokens inside the model through an information-theoretic compression principle. The key mechanism, called coding-rate chunking, groups byte representations when doing so reduces representational cost, effectively allowing the model to discover its own segmentation of text during training. This adaptive hierarchy combines a local byte encoder, a chunking module that promotes informative byte spans into higher-level units, and a global transformer operating on these learned segments, enabling the model to allocate compute where information density is highest. Experiments show improved bits-per-byte and competitive downstream performance compared to traditional tokenized Transformers, suggesting that tokenization may not be a necessary preprocessing step at all. More broadly, ByteFlow reframes tokenization as a learned compression problem, pointing toward future LLMs that operate fully end-to-end on raw data while dynamically discovering the structure of language.