Braille256-v3: Grade Infinity Universal Braille Model

A 27.8M parameter language model trained natively on Braille Unicode using SentencePiece Unigram tokenization for superior compression.

Key Features

SentencePiece Unigram: Likelihood-optimized tokenization (superior to BPE)
4096 Vocabulary: Learned contractions across 7 languages
Multilingual: English, French, German, Spanish, Italian, Portuguese, Dutch
Cross-Linguistic Patterns: Discovers universal Braille compressions

Learned Contractions

Token	Braille	Languages
15	⠞⠓⠑ (the)	English
22	⠟⠥⠑ (que)	Spanish/French/Portuguese
23	⠁⠝⠙ (and)	English
25	⠕⠋ (of)	English
17	⠙⠑ (de)	Spanish/French/Italian/Portuguese
18	⠇⠁ (la)	Spanish/French/Italian

Training Details

Metric	Value
Parameters	27.8M
Vocabulary	4096 (Unigram)
Training Steps	15,000
Final Loss	2.17
Training Time	3h 22m (MPS)
Corpus	32M Braille chars (7 languages)

Architecture

Hidden Size: 512
Layers: 8
Attention Heads: 8
Max Sequence Length: 1024
Tokenizer: SentencePiece Unigram

Usage

from braille_unigram_model import Braille256UnigramModel, BrailleUnigramTokenizer

model = Braille256UnigramModel.from_pretrained("ryanscottbarrett/braille256-v3")
tokenizer = BrailleUnigramTokenizer.from_pretrained("ryanscottbarrett/braille256-v3/tokenizer")

# Encode Braille text
text = "⠞⠓⠑⠀⠟⠥⠊⠉⠅⠀⠃⠗⠕⠺⠝⠀⠋⠕⠭"
tokens = tokenizer.encode(text)
print(f"Compression: {len(text)}/{len(tokens)} = {len(text)/len(tokens):.2f}x")

Research Goals

This model is part of the Grade Infinity Braille research project:

Can neural networks discover universal Braille contractions?
Do cross-linguistic patterns emerge from multilingual training?
Is SentencePiece Unigram superior to BPE for Braille?

Citation

@misc{braille256v3,
  author = {Ryan Barrett},
  title = {Braille256-v3: Grade Infinity Universal Braille Model},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/ryanscottbarrett/braille256-v3}
}

License

MIT

Downloads last month: 1