🌙 Project Luna (150M)
An experimental language model exploring syllable-based, phonetic tokenization as an alternative to BPE.
Project Luna is an experimental transformer language model that replaces standard BPE tokenization with a syllable-based phonetic representation. Instead of operating over opaque subword tokens, Luna processes text using linguistically meaningful units such as syllables and phonetic components (onset, nucleus, coda).
This design introduces a strong inductive bias toward compositional structure, trading surface-level fluency for explicit construction of linguistic forms.
🧠 What Makes Luna Different?
| Aspect | BPE Models (GPT, Mistral, etc.) | Luna |
|---|---|---|
| Tokenization | Statistical subwords | Linguistically meaningful syllables |
| Representation | Single-token identity | Multi-feature phonetic structure |
| Unknown forms | Often fragmented or skipped | Constructed from phonetic parts |
| Errors | Repetition or incoherence | Structured phonetic/morphological |
For example, when prompted with “mortal,” Luna may generate phrases such as “more dead than men,” reflecting a decomposition of meaning rather than retrieval of a memorized surface form.
Rather than prioritizing immediate surface fluency, Luna emphasizes structural representation. As a result, improvements with training manifest as more stable construction of words and concepts, rather than increased memorization of frequent token sequences.
📊 Key Results
Test taken from 1 epoch training loop.
| Metric | Luna | Mistral BPE | Δ |
|---|---|---|---|
| BPC (weighted avg, all 8 heads) | 0.760 | 1.667 | +54% |
| BPC (syllable-only) | 1.387 | 1.667 | +17% |
| LAMBADA (first token) | 15.10% | 13.18% | +14% |
| Parameters | 150.5M | 151.9M | ~Same |
| Val Loss (4 epochs) | 1.4798 | — | — |
The syllable-head BPC provides the fairest direct comparison to conventional BPE models, as it predicts the main linguistic unit. The weighted average benefits from near-deterministic auxiliary heads and is included to illustrate the multi-task training signal.
Trained on 1.17B tokens from FineWeb-Edu
🔬 Architecture
Nine-Feature Token Representation
Each token is represented by 9 linguistic features:
| Feature | Description | Vocab Size |
|---|---|---|
syllable_id |
Whole syllable | ~32K |
onset_id |
Consonants before vowel | ~1.5K |
nucleus_id |
Vowel core | ~500 |
coda_id |
Consonants after vowel | ~2K |
position |
Word position (start/mid/end/single) | 4 |
is_capitalized |
Capitalization flag | 2 |
token_type |
Type (syllable/number/punct/special) | 4 |
has_space_after |
Word boundary marker | 2 |
is_word_end |
End of word flag | 2 |
Dual-Stream Embedding
Input → ┬─ Semantic Stream (syllable embedding) ──┬─ Gated Fusion → Transformer
└─ Phonetic Stream (onset+nucleus+coda) ──┘
Multi-Head Output (8 Prediction Heads)
The model predicts all 8 features simultaneously, providing richer training signal than single-head BPE models.
🎯 Emergent Behaviors
The following behaviors are qualitative observations from early and mid-training checkpoints. They are not benchmark claims, but recurring patterns that differ from those typically observed in small BPE-based models.
Construction vs Retrieval
When Luna fails to produce an exact surface form, it often generates outputs that follow recognizable phonetic or morphological patterns rather than collapsing into random or repetitive noise.
This behavior suggests that errors arise from uncertainty in phonetic assembly rather than from missing semantic associations.
| Prompt | BPE Would Say | Luna Said |
|---|---|---|
| "Socrates is mortal" | "Socrates is mortal" | "more dead than men", "mor man" |
| "Gravity" | Wikipedia citation | "grav waves trav through space and time" |
| "Friendship" | "friendship" | "friendsship" (constructed) |
Phonetic Word Construction
Luna synthesizes linguistically plausible forms:
- "namessake" — plural morphology preserved
- "dif-fi-cult" — accurate syllable decomposition
- "mor man" — phonetic approximation of "mortal"
Domain Knowledge (Despite Articulation Struggles)
- Biology → circulatory → heart, lungs ✅
- Astronomy → moon → orbit around earth ✅
- Color theory → blue + yellow → green ✅
- Physics → gravity → waves, spacetime ✅
⚠️ Current Limitations
| Limitation | Description |
|---|---|
| Stuttering/Loops | Repeats syllables under uncertainty ("mu mu mu") |
| Symbolic Gap | No arithmetic reasoning (numbers = phonetic tokens) |
| Fluency | Fragmented output compared to BPE models |
| Scale | Only tested at 150M parameters |
These limitations are most likely related to model scale and training duration, though further experiments are required to separate scaling effects from architectural constraints. In addition, increasing repetition_penalty to 1.3+ is recommended to mitigate some syllable-level loops.
🚀 Usage
Installation
pip install torch numpy pyphen
Quick Generation
# Note: This requires model.py and generate_text.py from the GitHub repo
# You cannot import this directly from 'transformers' yet!
from model import Luna, LunaConfig
from generate_text import load_model, generate
# Load model
model, tokenizer, _ = load_model(
checkpoint_path="./model_best.pt",
data_dir="./data_fineweb_1b"
)
# Generate
output = generate(
model=model,
tokenizer=tokenizer,
prompt="The scientist discovered that",
max_new_tokens=100,
temperature=0.7,
repetition_penalty=1.3
)
print(output)
Recommended Generation Settings
| Setting | Value | Notes |
|---|---|---|
temperature |
0.7 | Lower = more coherent |
top_k |
50 | Limits vocabulary sampling |
top_p |
0.9 | Nucleus sampling |
repetition_penalty |
1.3 | Reduces loops |
📁 Files
| File | Description |
|---|---|
model_best.pt |
Best checkpoint (val_loss: 1.4798) |
config.json |
Dataset configuration |
vocab.json |
Syllable vocabularies |
🔮 Future Work
- Scale to 1B+ parameters — May resolve stuttering
- Add symbolic reasoning heads — Numeric/arithmetic capability
- Multilingual training — Test cross-linguistic transfer
- Fine-tuning for chat — LoRA adaptation for dialogue
Project Luna demonstrates that syllable-based, phonetic tokenization is a viable alternative to BPE for training transformer language models. At small scale, this approach sacrifices surface fluency but produces structured, interpretable error patterns that are suggestive of compositional generalization.
Further work is required to evaluate how these properties scale with larger models, longer training, and multilingual data. Luna is intended as a research prototype, not a production-ready language model.
📄 License
Apache License 2.0
👤 Author
Jakub Sykała (2026)
An independent research prototype developed by a single author. This work represents an initial exploration into non-BPE architectures.
📚 Citation
If you use this code, architecture, or ideas in research, please cite:
Jakub Sykala (2026). Project Luna: Syllable-Based Multi-Feature Tokenization for Transformer Language Models. https://github.com/JMSykala/ProjectLuna
🔗 Links
- GitHub: github.com/JMSykala/ProjectLuna
- Full Evaluation Report: