- NeuroLex: Hierarchical Phonotactic Character Transformer for Creative Name Generation
- Why LLMs Fail at Creative Naming
- Our Solution: NeuroLex Architecture
- Architecture Overview
- Key Innovation: Why This Works
- Training Data (All Free, Streamable)
- Quick Start
- Installation
- Training (Free Colab)
- Files
- Controllable Generation Parameters
- Two-Stage Training
- Research References
- License
- Generated by ML Intern
- Usage
- Why LLMs Fail at Creative Naming
NeuroLex: Hierarchical Phonotactic Character Transformer for Creative Name Generation
A novel, domain-specific AI architecture that generates truly creative, pronounceable brand names, YouTube channel names, and social media handles across 25+ languages.
Why LLMs Fail at Creative Naming
- Token-level vocabulary trap: LLMs generate from a fixed vocabulary of ~32K-128K subwords. They CANNOT create truly novel morphemes β only recombine known pieces like "Lumi" + "vibe" = "Lumivibe" (generic, predictable).
- RLHF alignment kills diversity: Models trained with RLHF gravitate toward "safe", average-acceptable outputs. (Paper: "Creativity Has Left the Chat", arxiv:2406.05587)
- Top-p sampling eliminates rare forms: Standard sampling filters mathematically remove low-frequency creative words from candidates. (Paper: "Lost in Sampling", arxiv:2605.27268)
- Morphological blindness: BPE tokenization fragments novel words unpredictably, destroying morphological coherence. (Paper: "Counting the Bugs in ChatGPT's Wugs", arxiv:2310.15113)
Our Solution: NeuroLex Architecture
A purpose-built character-level generative model (~8M parameters) that:
- Generates character-by-character β can create ANY word, not just known tokens
- Learns phonotactic constraints from 25+ languages (what combinations of sounds are valid/pleasing)
- Uses sound symbolism (sharp consonants β tech brands, round vowels β warm/friendly brands)
- Controllable generation via category, language, length, and style/vibe selectors
- Trains on free Google Colab in ~30 minutes with streaming datasets
Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONDITION ENCODER β
β [Category] + [Language] + [Length] + [Vibe/Style] β
β β Condition Vector (d_model=256) β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β cross-attention
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PHONOTACTIC CHARACTER DECODER (Causal Transformer) β
β - Vocab: 259 (256 UTF-8 bytes + 3 specials) β
β - Positional: Sinusoidal Encoding β
β - 6 layers, 8 heads, d_model=256, d_ff=1024 β
β - Cross-attention to condition at each layer β
β - Generates char-by-char until <EOS> β
ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MULTI-SIGNAL REWARD SCORING β
β R1: Phonotactic validity (char n-gram log-prob) β
β R2: Sound symbolism alignment (bouba/kiki scores) β
β R3: Novelty (1 - max Jaccard overlap with dict) β
β R4: Memorability (length, rhythm, pronounceability) β
β R5: Length conformity to target β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Innovation: Why This Works
Character-level = infinite vocabulary: Unlike GPT/Llama that pick from fixed tokens, we generate one character at a time. This means we can create "Zyphra", "Kolvani", "Ashtrex" β words that never existed before.
Phonotactic prior from real languages: We train on IPA pronunciations from 25 languages, so the model learns what sound sequences are pronounceable and pleasant in each language family.
Sound symbolism as inductive bias: Cross-linguistic research (27 languages, arxiv:2512.12245) proves that certain sounds universally evoke certain feelings:
- Sharp/tech: p, t, k, s, x + vowels i, e
- Warm/friendly: m, n, l, b + vowels o, u, a
- We encode this as a learnable reward signal.
Conditional control: Category embeddings let you specify "tech startup" vs "food brand" vs "gaming channel" and get appropriately-vibed names.
Training Data (All Free, Streamable)
| Dataset | What | Size | Languages |
|---|---|---|---|
omneity-labs/ipa-dict |
WordβIPA pronunciation pairs | 74.6 MB | 25 |
wikimedia/wikipedia |
Article titles (proper nouns) | Streamed | 20+ |
AdamLucek/youtube-titles |
YouTube channel names | 1.8 MB | EN |
bigpictureio/companies-2023-q4-sm |
Company/brand names | Streamed | 100+ countries |
Quick Start
from model import NeuroLexModel, NeuroLexConfig
from dataset import CharTokenizer, CATEGORIES, LANGUAGES, VIBES
from generate import NeuroLexGenerator, generate_names
# After training:
gen = NeuroLexGenerator("checkpoints/neurolex_trained.pt")
names = gen.generate(category="technology", language="en", vibe="sharp", length_range=(5, 9), num_names=10)
Installation
pip install torch numpy datasets
Training (Free Colab)
Open neurolex_train.ipynb in Google Colab (free tier T4 GPU) and run all cells. Training completes in ~30 minutes.
Files
| File | Description |
|---|---|
model.py |
NeuroLex architecture (Condition Encoder + Character Decoder) |
dataset.py |
Streaming multilingual dataset pipeline |
rewards.py |
Multi-signal reward scoring functions |
train.py |
Full training script |
generate.py |
Inference/generation with controls |
neurolex_train.ipynb |
Complete Colab notebook (ready to run) |
__init__.py |
Package init |
requirements.txt |
Dependencies |
Controllable Generation Parameters
| Parameter | Options | Effect |
|---|---|---|
| Category | technology, food, gaming, fashion, music, sports, education, health, finance, travel, entertainment, science, art, fitness, ai, crypto, luxury, comedy, podcast, etc. | Guides the semantic domain |
| Language | en, fr, de, es, ja, ko, zh, ar, ru, pt, it, nl, pl, sv, tr, fa, hi, he, vi, fi + 40 more | Targets specific phonotactic rules |
| Vibe | sharp, warm, elegant, playful, powerful, mystical, minimal, exotic, retro, futuristic, natural, urban, academic, rebellious, cosmic, neutral | Controls sound-symbolic feel |
| Length | 4-20 characters | Target output length |
| Temperature | 0.5-1.2 | Controls creativity vs safety |
Two-Stage Training
Stage 1 β Character Language Modeling (~20 min): The model learns phonotactic patterns by predicting the next character in real words from 25 languages. This teaches it what character sequences are natural and pronounceable.
Stage 2 β Reward-Weighted Fine-tuning (~10 min): The model is fine-tuned with a reward signal that scores generated names on phonotactic validity, sound symbolism alignment, novelty, and memorability.
Research References
- Hierarchical Autoregressive Transformers (arxiv:2501.10322, DeepMind 2025)
- Sound Symbolism across 27 Languages (arxiv:2512.12245, 2025)
- ByT5: Token-free byte-level models (arxiv:2105.13626, 2021)
- Lost in Sampling: Word Coverage Score (arxiv:2605.27268, 2025)
- Creativity Has Left the Chat (arxiv:2406.05587, 2024)
- T-FREE Tokenizer-Free LLMs (arxiv:2406.19223, 2024)
- Phoneme-Based Baby Llamas (arxiv:2410.01487, 2024)
- Kiki or Bouba? Sound Symbolism (arxiv:2310.16781, 2023)
- Counting the Bugs in ChatGPT's Wugs (arxiv:2310.15113, 2023)
License
MIT
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "krystv/neurolex-creative-name-generator"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.