NeuroLex v3: Morpheme-Aware Creative Name Generator

A novel, domain-specific AI architecture that generates truly creative, pronounceable brand names, YouTube channel names, and social media handles across 25+ languages.

~2.7M parameters | Trains in ~30 min on free Colab T4 | 25+ languages

🚀 Quick Start

Open neurolex_train.ipynb in Google Colab (free tier T4 GPU) and run all cells. Training completes in ~30 minutes. No authentication needed — all datasets are public.

Why LLMs Fail at Creative Naming

Failure Mode	Paper	Impact
BPE vocabulary trap — can only recombine known tokens	Wug Test (arxiv:2310.15113)	Can't create truly novel morphemes
RLHF kills diversity — alignment creates attractor states	Creativity Has Left the Chat (arxiv:2406.05587)	Outputs are generic, predictable
Sampling prunes novelty — top-p/k removes rare forms	Lost in Sampling (arxiv:2605.27268)	Creative words unreachable
Analogical memorization — morphology via pattern matching, not rules	arxiv:2411.07990	Fails on novel morphological forms
No phonotactic awareness — doesn't model sound-feel mappings	Sound Symbolism (arxiv:2512.12245)	Can't target specific vibes

Our Solution: Hybrid Morpheme+Character Transformer

Architecture Overview (v3)

Input: <BOS> <STRATEGY> <CATEGORY> <VIBE> [generated tokens...]

┌────────────────────────────────────────────────────────────┐
│ HYBRID TOKENIZER                                           │
│ Priority: Control tokens > Morphemes (longest) > UTF-8 bytes│
│ Vocab: 256 bytes + ~200 morphemes + 35 control tokens      │
└────────────────────────────────┬───────────────────────────┘
                                 ↓
┌────────────────────────────────────────────────────────────┐
│ CAUSAL TRANSFORMER (6 layers, 256-dim, 8 heads)            │
│ • Token embedding (weight-tied with output)                 │
│ • Sinusoidal positional encoding                           │
│ • Pre-norm residual blocks                                  │
│ • Multi-head causal self-attention                          │
│ • GELU feed-forward (1024-dim)                              │
│ • Final LayerNorm → tied linear output                      │
└────────────────────────────────┬───────────────────────────┘
                                 ↓
┌────────────────────────────────────────────────────────────┐
│ GENERATION (top-k + nucleus sampling)                       │
│ • Temperature-controlled creativity                         │
│ • Top-k filtering (diversity)                               │
│ • Nucleus (top-p) filtering (coherence)                     │
│ • EOS detection for variable-length output                  │
└────────────────────────────────────────────────────────────┘

Key Design Choices (Research-Backed)

Choice	Rationale	Evidence
Byte-level + morphemes	Infinite vocabulary + efficient morpheme learning	ByT5 (arxiv:2105.13626) beats token-level on morphological tasks
In-sequence control tokens	Better gradient flow than cross-attention conditioning	Neologism Learning (arxiv:2510.08506, ICLR 2025)
Causal LM (not enc-dec)	Simpler, proven for controlled generation	GPT-style architecture
Morfessor morpheme discovery	Unsupervised extraction of productive patterns	MDL-based segmentation
Sound symbolism labels	Universal cross-linguistic signal	27-language study (arxiv:2512.12245)
Weight-tied embeddings	30% param reduction, better generalization	Press & Wolf 2017

Controllable Generation

Strategy (how the name is created)

Token	Method	Example
`<BLEND>`	Combine two meaningful morphemes	Cloudify, Datavex, Nexaflow
`<MORPH>`	Add productive suffixes/prefixes	Boldster, Craftium, Questify
`<PHONETIC>`	Novel sound combinations	Zyphra, Kolvex, Lumara
`<CLIP>`	Short, clipped forms	Nex, Zyp, Flox, Drex
`<CROSSLANG>`	Mix language roots	Kazeflow, Blitzcraft, Terranova

Category (20 domains)

<TECH>, <FOOD>, <GAMING>, <FASHION>, <MUSIC>, <HEALTH>, <FINANCE>, <TRAVEL>, <SCIENCE>, <ART>, <FITNESS>, <LUXURY>, <SOCIAL>, <CRYPTO>, <AI>, <ECO>, <KIDS>, <SPORTS>, <EDUCATION>, <GENERAL>

Vibe (10 sound-symbolic feels)

<SHARP>, <WARM>, <ELEGANT>, <PLAYFUL>, <POWERFUL>, <MYSTICAL>, <MINIMAL>, <FUTURISTIC>, <NATURAL>, <COSMIC>

Training Data

Dataset	Purpose	Size	Languages
`omneity-labs/ipa-dict`	Phonotactic patterns	5.3M words	25
`AdamLucek/youtube-titles`	Real creative names	~50 channels	EN
Curated brand examples	Quality signal (200x weighted)	65 examples	Multi

All datasets load via streaming or are small enough for Colab's 12GB RAM.

v3 Changes (Fixes from v2)

Bug Fixes

Morfessor API — Fixed load_data([[w]]) → load_data([(1, w) for w in words]). The API expects (count, word) tuples.
Tokenizer decode — Fixed control token skipping in decode using proper reverse lookup dicts instead of fragile index comparisons.
Memory streaming — IPA dict now loads via streaming to avoid 78MB download blocking.

Architecture Improvements

Expanded curated examples — 65 brand examples (was 30) covering all 5 strategies with proper category/vibe diversity.
Top-p (nucleus) sampling — Added alongside top-k for better generation quality.
Proper save/load — Model saves config alongside weights for clean reloading.
Quality analysis cell — Added generation metrics (uniqueness, length distribution, morpheme usage, V/C ratio).
Comparison generation — New cell to compare all strategy×vibe combinations side-by-side.

Sound Symbolism: Why Names "Feel" Right

Cross-linguistic research proves universal patterns in how sounds map to feelings:

SHARP/TECH: p, t, k, s, z, x, f, h, c + vowels i, e
  → "Apex", "Zyphra", "Kolvex" (precise, cutting-edge)

WARM/FRIENDLY: m, n, l, b, d, g, w, r, y + vowels o, u, a  
  → "Moluna", "Bloom", "Lumara" (approachable, organic)

We use these mappings to automatically label training data with vibe tags, so the model learns sound→feel correlations directly.

Files

File	Description
`neurolex_train.ipynb`	Complete Colab notebook — run this!
`model.py`	Architecture (Condition Encoder + Character Decoder)
`dataset.py`	Streaming multilingual dataset pipeline
`rewards.py`	Multi-signal reward scoring
`train.py`	Training script
`generate.py`	Inference/generation
`requirements.txt`	Dependencies

Research References

Hierarchical Autoregressive Transformers (arxiv:2501.10322, DeepMind 2025)
Sound Symbolism across 27 Languages (arxiv:2512.12245, 2025)
ByT5: Token-free byte-level models (arxiv:2105.13626, 2021)
Neologism Learning for Controllability (arxiv:2510.08506, ICLR 2025)
Lost in Sampling: Word Coverage Score (arxiv:2605.27268, 2025)
Creativity Has Left the Chat (arxiv:2406.05587, 2024)
T-FREE Tokenizer-Free LLMs (arxiv:2406.19223, 2024)
Kiki or Bouba? Sound Symbolism (arxiv:2310.16781, 2023)
Counting the Bugs in ChatGPT's Wugs (arxiv:2310.15113, 2023)

License

MIT

Generated with ML Intern

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for krystv/neurolex-creative-name-generator

Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS)

Paper • 2605.27268 • Published May 26 • 13

Adversarially Probing Cross-Family Sound Symbolism in 27 Languages

Paper • 2512.12245 • Published Dec 13, 2025

Neologism Learning for Controllability and Self-Verbalization

Paper • 2510.08506 • Published Oct 9, 2025

Hierarchical Autoregressive Transformers: Combining Byte-~and Word-Level Processing for Robust, Adaptable Language Models

Paper • 2501.10322 • Published Jan 17, 2025 • 2

Derivational Morphology Reveals Analogical Generalization in Large Language Models

Paper • 2411.07990 • Published Nov 12, 2024