NeuroLex: Hierarchical Phonotactic Character Transformer for Creative Name Generation

A novel, domain-specific AI architecture that generates truly creative, pronounceable brand names, YouTube channel names, and social media handles across 25+ languages.

Why LLMs Fail at Creative Naming

  1. Token-level vocabulary trap: LLMs generate from a fixed vocabulary of ~32K-128K subwords. They CANNOT create truly novel morphemes β€” only recombine known pieces like "Lumi" + "vibe" = "Lumivibe" (generic, predictable).
  2. RLHF alignment kills diversity: Models trained with RLHF gravitate toward "safe", average-acceptable outputs. (Paper: "Creativity Has Left the Chat", arxiv:2406.05587)
  3. Top-p sampling eliminates rare forms: Standard sampling filters mathematically remove low-frequency creative words from candidates. (Paper: "Lost in Sampling", arxiv:2605.27268)
  4. Morphological blindness: BPE tokenization fragments novel words unpredictably, destroying morphological coherence. (Paper: "Counting the Bugs in ChatGPT's Wugs", arxiv:2310.15113)

Our Solution: NeuroLex Architecture

A purpose-built character-level generative model (~8M parameters) that:

  • Generates character-by-character β†’ can create ANY word, not just known tokens
  • Learns phonotactic constraints from 25+ languages (what combinations of sounds are valid/pleasing)
  • Uses sound symbolism (sharp consonants β†’ tech brands, round vowels β†’ warm/friendly brands)
  • Controllable generation via category, language, length, and style/vibe selectors
  • Trains on free Google Colab in ~30 minutes with streaming datasets

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CONDITION ENCODER                                    β”‚
β”‚ [Category] + [Language] + [Length] + [Vibe/Style]   β”‚
β”‚ β†’ Condition Vector (d_model=256)                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       ↓ cross-attention
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PHONOTACTIC CHARACTER DECODER (Causal Transformer)   β”‚
β”‚ - Vocab: 259 (256 UTF-8 bytes + 3 specials)          β”‚
β”‚ - Positional: Sinusoidal Encoding                    β”‚
β”‚ - 6 layers, 8 heads, d_model=256, d_ff=1024          β”‚
β”‚ - Cross-attention to condition at each layer          β”‚
β”‚ - Generates char-by-char until <EOS>                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ MULTI-SIGNAL REWARD SCORING                          β”‚
β”‚ R1: Phonotactic validity (char n-gram log-prob)      β”‚
β”‚ R2: Sound symbolism alignment (bouba/kiki scores)    β”‚
β”‚ R3: Novelty (1 - max Jaccard overlap with dict)      β”‚
β”‚ R4: Memorability (length, rhythm, pronounceability)  β”‚
β”‚ R5: Length conformity to target                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Innovation: Why This Works

  1. Character-level = infinite vocabulary: Unlike GPT/Llama that pick from fixed tokens, we generate one character at a time. This means we can create "Zyphra", "Kolvani", "Ashtrex" β€” words that never existed before.

  2. Phonotactic prior from real languages: We train on IPA pronunciations from 25 languages, so the model learns what sound sequences are pronounceable and pleasant in each language family.

  3. Sound symbolism as inductive bias: Cross-linguistic research (27 languages, arxiv:2512.12245) proves that certain sounds universally evoke certain feelings:

    • Sharp/tech: p, t, k, s, x + vowels i, e
    • Warm/friendly: m, n, l, b + vowels o, u, a
    • We encode this as a learnable reward signal.
  4. Conditional control: Category embeddings let you specify "tech startup" vs "food brand" vs "gaming channel" and get appropriately-vibed names.

Training Data (All Free, Streamable)

Dataset What Size Languages
omneity-labs/ipa-dict Word→IPA pronunciation pairs 74.6 MB 25
wikimedia/wikipedia Article titles (proper nouns) Streamed 20+
AdamLucek/youtube-titles YouTube channel names 1.8 MB EN
bigpictureio/companies-2023-q4-sm Company/brand names Streamed 100+ countries

Quick Start

from model import NeuroLexModel, NeuroLexConfig
from dataset import CharTokenizer, CATEGORIES, LANGUAGES, VIBES
from generate import NeuroLexGenerator, generate_names

# After training:
gen = NeuroLexGenerator("checkpoints/neurolex_trained.pt")
names = gen.generate(category="technology", language="en", vibe="sharp", length_range=(5, 9), num_names=10)

Installation

pip install torch numpy datasets

Training (Free Colab)

Open neurolex_train.ipynb in Google Colab (free tier T4 GPU) and run all cells. Training completes in ~30 minutes.

Files

File Description
model.py NeuroLex architecture (Condition Encoder + Character Decoder)
dataset.py Streaming multilingual dataset pipeline
rewards.py Multi-signal reward scoring functions
train.py Full training script
generate.py Inference/generation with controls
neurolex_train.ipynb Complete Colab notebook (ready to run)
__init__.py Package init
requirements.txt Dependencies

Controllable Generation Parameters

Parameter Options Effect
Category technology, food, gaming, fashion, music, sports, education, health, finance, travel, entertainment, science, art, fitness, ai, crypto, luxury, comedy, podcast, etc. Guides the semantic domain
Language en, fr, de, es, ja, ko, zh, ar, ru, pt, it, nl, pl, sv, tr, fa, hi, he, vi, fi + 40 more Targets specific phonotactic rules
Vibe sharp, warm, elegant, playful, powerful, mystical, minimal, exotic, retro, futuristic, natural, urban, academic, rebellious, cosmic, neutral Controls sound-symbolic feel
Length 4-20 characters Target output length
Temperature 0.5-1.2 Controls creativity vs safety

Two-Stage Training

  1. Stage 1 β€” Character Language Modeling (~20 min): The model learns phonotactic patterns by predicting the next character in real words from 25 languages. This teaches it what character sequences are natural and pronounceable.

  2. Stage 2 β€” Reward-Weighted Fine-tuning (~10 min): The model is fine-tuned with a reward signal that scores generated names on phonotactic validity, sound symbolism alignment, novelty, and memorability.

Research References

  • Hierarchical Autoregressive Transformers (arxiv:2501.10322, DeepMind 2025)
  • Sound Symbolism across 27 Languages (arxiv:2512.12245, 2025)
  • ByT5: Token-free byte-level models (arxiv:2105.13626, 2021)
  • Lost in Sampling: Word Coverage Score (arxiv:2605.27268, 2025)
  • Creativity Has Left the Chat (arxiv:2406.05587, 2024)
  • T-FREE Tokenizer-Free LLMs (arxiv:2406.19223, 2024)
  • Phoneme-Based Baby Llamas (arxiv:2410.01487, 2024)
  • Kiki or Bouba? Sound Symbolism (arxiv:2310.16781, 2023)
  • Counting the Bugs in ChatGPT's Wugs (arxiv:2310.15113, 2023)

License

MIT

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "krystv/neurolex-creative-name-generator"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for krystv/neurolex-creative-name-generator