Braille256-v3: Grade Infinity Universal Braille Model
A 27.8M parameter language model trained natively on Braille Unicode using SentencePiece Unigram tokenization for superior compression.
Key Features
- SentencePiece Unigram: Likelihood-optimized tokenization (superior to BPE)
- 4096 Vocabulary: Learned contractions across 7 languages
- Multilingual: English, French, German, Spanish, Italian, Portuguese, Dutch
- Cross-Linguistic Patterns: Discovers universal Braille compressions
Learned Contractions
| Token | Braille | Languages |
|---|---|---|
| 15 | ⠞⠓⠑ (the) | English |
| 22 | ⠟⠥⠑ (que) | Spanish/French/Portuguese |
| 23 | ⠁⠝⠙ (and) | English |
| 25 | ⠕⠋ (of) | English |
| 17 | ⠙⠑ (de) | Spanish/French/Italian/Portuguese |
| 18 | ⠇⠁ (la) | Spanish/French/Italian |
Training Details
| Metric | Value |
|---|---|
| Parameters | 27.8M |
| Vocabulary | 4096 (Unigram) |
| Training Steps | 15,000 |
| Final Loss | 2.17 |
| Training Time | 3h 22m (MPS) |
| Corpus | 32M Braille chars (7 languages) |
Architecture
Hidden Size: 512
Layers: 8
Attention Heads: 8
Max Sequence Length: 1024
Tokenizer: SentencePiece Unigram
Usage
from braille_unigram_model import Braille256UnigramModel, BrailleUnigramTokenizer
model = Braille256UnigramModel.from_pretrained("ryanscottbarrett/braille256-v3")
tokenizer = BrailleUnigramTokenizer.from_pretrained("ryanscottbarrett/braille256-v3/tokenizer")
# Encode Braille text
text = "⠞⠓⠑⠀⠟⠥⠊⠉⠅⠀⠃⠗⠕⠺⠝⠀⠋⠕⠭"
tokens = tokenizer.encode(text)
print(f"Compression: {len(text)}/{len(tokens)} = {len(text)/len(tokens):.2f}x")
Research Goals
This model is part of the Grade Infinity Braille research project:
- Can neural networks discover universal Braille contractions?
- Do cross-linguistic patterns emerge from multilingual training?
- Is SentencePiece Unigram superior to BPE for Braille?
Citation
@misc{braille256v3,
author = {Ryan Barrett},
title = {Braille256-v3: Grade Infinity Universal Braille Model},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/ryanscottbarrett/braille256-v3}
}
License
MIT
- Downloads last month
- 11