🌙 Project Luna (150M)

An experimental language model exploring syllable-based, phonetic tokenization as an alternative to BPE.

Project Luna is an experimental transformer language model that replaces standard BPE tokenization with a syllable-based phonetic representation. Instead of operating over opaque subword tokens, Luna processes text using linguistically meaningful units such as syllables and phonetic components (onset, nucleus, coda).

This design introduces a strong inductive bias toward compositional structure, trading surface-level fluency for explicit construction of linguistic forms.

🧠 What Makes Luna Different?

Aspect BPE Models (GPT, Mistral, etc.) Luna
Tokenization Statistical subwords Linguistically meaningful syllables
Representation Single-token identity Multi-feature phonetic structure
Unknown forms Often fragmented or skipped Constructed from phonetic parts
Errors Repetition or incoherence Structured phonetic/morphological

For example, when prompted with “mortal,” Luna may generate phrases such as “more dead than men,” reflecting a decomposition of meaning rather than retrieval of a memorized surface form.

Rather than prioritizing immediate surface fluency, Luna emphasizes structural representation. As a result, improvements with training manifest as more stable construction of words and concepts, rather than increased memorization of frequent token sequences.

📊 Key Results

Test taken from 1 epoch training loop.

Metric Luna Mistral BPE Δ
BPC (weighted avg, all 8 heads) 0.760 1.667 +54%
BPC (syllable-only) 1.387 1.667 +17%
LAMBADA (first token) 15.10% 13.18% +14%
Parameters 150.5M 151.9M ~Same
Val Loss (4 epochs) 1.4798

The syllable-head BPC provides the fairest direct comparison to conventional BPE models, as it predicts the main linguistic unit. The weighted average benefits from near-deterministic auxiliary heads and is included to illustrate the multi-task training signal.

Trained on 1.17B tokens from FineWeb-Edu

🔬 Architecture

Nine-Feature Token Representation

Each token is represented by 9 linguistic features:

Feature Description Vocab Size
syllable_id Whole syllable ~32K
onset_id Consonants before vowel ~1.5K
nucleus_id Vowel core ~500
coda_id Consonants after vowel ~2K
position Word position (start/mid/end/single) 4
is_capitalized Capitalization flag 2
token_type Type (syllable/number/punct/special) 4
has_space_after Word boundary marker 2
is_word_end End of word flag 2

Dual-Stream Embedding

Input → ┬─ Semantic Stream (syllable embedding) ──┬─ Gated Fusion → Transformer
        └─ Phonetic Stream (onset+nucleus+coda) ──┘

Multi-Head Output (8 Prediction Heads)

The model predicts all 8 features simultaneously, providing richer training signal than single-head BPE models.

🎯 Emergent Behaviors

The following behaviors are qualitative observations from early and mid-training checkpoints. They are not benchmark claims, but recurring patterns that differ from those typically observed in small BPE-based models.

Construction vs Retrieval

When Luna fails to produce an exact surface form, it often generates outputs that follow recognizable phonetic or morphological patterns rather than collapsing into random or repetitive noise.

This behavior suggests that errors arise from uncertainty in phonetic assembly rather than from missing semantic associations.

Prompt BPE Would Say Luna Said
"Socrates is mortal" "Socrates is mortal" "more dead than men", "mor man"
"Gravity" Wikipedia citation "grav waves trav through space and time"
"Friendship" "friendship" "friendsship" (constructed)

Phonetic Word Construction

Luna synthesizes linguistically plausible forms:

  • "namessake" — plural morphology preserved
  • "dif-fi-cult" — accurate syllable decomposition
  • "mor man" — phonetic approximation of "mortal"

Domain Knowledge (Despite Articulation Struggles)

  • Biology → circulatory → heart, lungs ✅
  • Astronomy → moon → orbit around earth ✅
  • Color theory → blue + yellow → green ✅
  • Physics → gravity → waves, spacetime ✅

⚠️ Current Limitations

Limitation Description
Stuttering/Loops Repeats syllables under uncertainty ("mu mu mu")
Symbolic Gap No arithmetic reasoning (numbers = phonetic tokens)
Fluency Fragmented output compared to BPE models
Scale Only tested at 150M parameters

These limitations are most likely related to model scale and training duration, though further experiments are required to separate scaling effects from architectural constraints. In addition, increasing repetition_penalty to 1.3+ is recommended to mitigate some syllable-level loops.

🚀 Usage

Installation

pip install torch numpy pyphen

Quick Generation

# Note: This requires model.py and generate_text.py from the GitHub repo
# You cannot import this directly from 'transformers' yet!
from model import Luna, LunaConfig
from generate_text import load_model, generate

# Load model
model, tokenizer, _ = load_model(
    checkpoint_path="./model_best.pt",
    data_dir="./data_fineweb_1b"
)

# Generate
output = generate(
    model=model,
    tokenizer=tokenizer,
    prompt="The scientist discovered that",
    max_new_tokens=100,
    temperature=0.7,
    repetition_penalty=1.3
)
print(output)

Recommended Generation Settings

Setting Value Notes
temperature 0.7 Lower = more coherent
top_k 50 Limits vocabulary sampling
top_p 0.9 Nucleus sampling
repetition_penalty 1.3 Reduces loops

📁 Files

File Description
model_best.pt Best checkpoint (val_loss: 1.4798)
config.json Dataset configuration
vocab.json Syllable vocabularies

🔮 Future Work

  1. Scale to 1B+ parameters — May resolve stuttering
  2. Add symbolic reasoning heads — Numeric/arithmetic capability
  3. Multilingual training — Test cross-linguistic transfer
  4. Fine-tuning for chat — LoRA adaptation for dialogue

Project Luna demonstrates that syllable-based, phonetic tokenization is a viable alternative to BPE for training transformer language models. At small scale, this approach sacrifices surface fluency but produces structured, interpretable error patterns that are suggestive of compositional generalization.

Further work is required to evaluate how these properties scale with larger models, longer training, and multilingual data. Luna is intended as a research prototype, not a production-ready language model.

📄 License

Apache License 2.0

👤 Author

Jakub Sykała (2026)

An independent research prototype developed by a single author. This work represents an initial exploration into non-BPE architectures.

📚 Citation

If you use this code, architecture, or ideas in research, please cite:

Jakub Sykala (2026). Project Luna: Syllable-Based Multi-Feature Tokenization for Transformer Language Models. https://github.com/JMSykala/ProjectLuna

🔗 Links

Evaluation Report

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train JMSykala/Luna-150M