🌙 Project Luna (150M)

An experimental language model exploring syllable-based, phonetic tokenization as an alternative to BPE.

Project Luna is an experimental transformer language model that replaces standard BPE tokenization with a syllable-based phonetic representation. Instead of operating over opaque subword tokens, Luna processes text using linguistically meaningful units such as syllables and phonetic components (onset, nucleus, coda).

This design introduces a strong inductive bias toward compositional structure, trading surface-level fluency for explicit construction of linguistic forms.

🧠 What Makes Luna Different?

Aspect	BPE Models (GPT, Mistral, etc.)	Luna
Tokenization	Statistical subwords	Linguistically meaningful syllables
Representation	Single-token identity	Multi-feature phonetic structure
Unknown forms	Often fragmented or skipped	Constructed from phonetic parts
Errors	Repetition or incoherence	Structured phonetic/morphological

For example, when prompted with “mortal,” Luna may generate phrases such as “more dead than men,” reflecting a decomposition of meaning rather than retrieval of a memorized surface form.

Rather than prioritizing immediate surface fluency, Luna emphasizes structural representation. As a result, improvements with training manifest as more stable construction of words and concepts, rather than increased memorization of frequent token sequences.

📊 Key Results

Test taken from 1 epoch training loop.

Metric	Luna	Mistral BPE	Δ
BPC (weighted avg, all 8 heads)	0.760	1.667	+54%
BPC (syllable-only)	1.387	1.667	+17%
LAMBADA (first token)	15.10%	13.18%	+14%
Parameters	150.5M	151.9M	~Same
Val Loss (4 epochs)	1.4798	—	—

The syllable-head BPC provides the fairest direct comparison to conventional BPE models, as it predicts the main linguistic unit. The weighted average benefits from near-deterministic auxiliary heads and is included to illustrate the multi-task training signal.

Trained on 1.17B tokens from FineWeb-Edu

🔬 Architecture

Nine-Feature Token Representation

Each token is represented by 9 linguistic features:

Feature	Description	Vocab Size
`syllable_id`	Whole syllable	~32K
`onset_id`	Consonants before vowel	~1.5K
`nucleus_id`	Vowel core	~500
`coda_id`	Consonants after vowel	~2K
`position`	Word position (start/mid/end/single)	4
`is_capitalized`	Capitalization flag	2
`token_type`	Type (syllable/number/punct/special)	4
`has_space_after`	Word boundary marker	2
`is_word_end`	End of word flag	2

Dual-Stream Embedding

Input → ┬─ Semantic Stream (syllable embedding) ──┬─ Gated Fusion → Transformer
        └─ Phonetic Stream (onset+nucleus+coda) ──┘

Multi-Head Output (8 Prediction Heads)

The model predicts all 8 features simultaneously, providing richer training signal than single-head BPE models.

🎯 Emergent Behaviors

The following behaviors are qualitative observations from early and mid-training checkpoints. They are not benchmark claims, but recurring patterns that differ from those typically observed in small BPE-based models.

Construction vs Retrieval

When Luna fails to produce an exact surface form, it often generates outputs that follow recognizable phonetic or morphological patterns rather than collapsing into random or repetitive noise.

This behavior suggests that errors arise from uncertainty in phonetic assembly rather than from missing semantic associations.

Prompt	BPE Would Say	Luna Said
"Socrates is mortal"	"Socrates is mortal"	"more dead than men", "mor man"
"Gravity"	Wikipedia citation	"grav waves trav through space and time"
"Friendship"	"friendship"	"friendsship" (constructed)

Phonetic Word Construction

Luna synthesizes linguistically plausible forms:

"namessake" — plural morphology preserved
"dif-fi-cult" — accurate syllable decomposition
"mor man" — phonetic approximation of "mortal"

Domain Knowledge (Despite Articulation Struggles)

Biology → circulatory → heart, lungs ✅
Astronomy → moon → orbit around earth ✅
Color theory → blue + yellow → green ✅
Physics → gravity → waves, spacetime ✅

⚠️ Current Limitations

Limitation	Description
Stuttering/Loops	Repeats syllables under uncertainty ("mu mu mu")
Symbolic Gap	No arithmetic reasoning (numbers = phonetic tokens)
Fluency	Fragmented output compared to BPE models
Scale	Only tested at 150M parameters

These limitations are most likely related to model scale and training duration, though further experiments are required to separate scaling effects from architectural constraints. In addition, increasing repetition_penalty to 1.3+ is recommended to mitigate some syllable-level loops.

🚀 Usage

Installation

pip install torch numpy pyphen

Quick Generation

# Note: This requires model.py and generate_text.py from the GitHub repo
# You cannot import this directly from 'transformers' yet!
from model import Luna, LunaConfig
from generate_text import load_model, generate

# Load model
model, tokenizer, _ = load_model(
    checkpoint_path="./model_best.pt",
    data_dir="./data_fineweb_1b"
)

# Generate
output = generate(
    model=model,
    tokenizer=tokenizer,
    prompt="The scientist discovered that",
    max_new_tokens=100,
    temperature=0.7,
    repetition_penalty=1.3
)
print(output)

Recommended Generation Settings

Setting	Value	Notes
`temperature`	0.7	Lower = more coherent
`top_k`	50	Limits vocabulary sampling
`top_p`	0.9	Nucleus sampling
`repetition_penalty`	1.3	Reduces loops

📁 Files

File	Description
`model_best.pt`	Best checkpoint (val_loss: 1.4798)
`config.json`	Dataset configuration
`vocab.json`	Syllable vocabularies

🔮 Future Work

Scale to 1B+ parameters — May resolve stuttering
Add symbolic reasoning heads — Numeric/arithmetic capability
Multilingual training — Test cross-linguistic transfer
Fine-tuning for chat — LoRA adaptation for dialogue

Project Luna demonstrates that syllable-based, phonetic tokenization is a viable alternative to BPE for training transformer language models. At small scale, this approach sacrifices surface fluency but produces structured, interpretable error patterns that are suggestive of compositional generalization.

Further work is required to evaluate how these properties scale with larger models, longer training, and multilingual data. Luna is intended as a research prototype, not a production-ready language model.

📄 License

Apache License 2.0

👤 Author

Jakub Sykała (2026)

An independent research prototype developed by a single author. This work represents an initial exploration into non-BPE architectures.

📚 Citation

If you use this code, architecture, or ideas in research, please cite:

Jakub Sykala (2026). Project Luna: Syllable-Based Multi-Feature Tokenization for Transformer Language Models. https://github.com/JMSykala/ProjectLuna

🔗 Links

GitHub: github.com/JMSykala/ProjectLuna
Full Evaluation Report:

Downloads last month: -; Downloads are not tracked for this model. How to track

JMSykala
/

Luna-150M