Braille256-v1
A language model trained exclusively on Unicode Braille characters (U+2800-U+28FF), demonstrating emergent discovery of Braille contraction patterns.
Model Description
Braille256 is a transformer-based language model with a pure 256-token Braille vocabulary. Unlike traditional language models that use subword tokenization, Braille256 treats each Braille cell as a single token, enabling the model to learn structural patterns inherent to the Braille writing system.
Key Features
- Pure Braille Vocabulary: 256 Braille characters + 5 special tokens
- Dot-Pattern Embeddings: Custom initialization based on Braille dot patterns
- Emergent Contractions: Model independently discovers patterns similar to Grade-2 Braille
- Lightweight: ~5M parameters, runs on CPU
Architecture
| Parameter | Value |
|---|---|
| Parameters | 4,940,544 |
| Vocabulary | 261 |
| Hidden Size | 256 |
| Layers | 6 |
| Attention Heads | 4 |
| Max Sequence Length | 256 |
Emergent Patterns
The model learned to recognize common letter combinations that mirror official Grade-2 Braille contractions:
| Pattern | Meaning | Learned Frequency |
|---|---|---|
| ⠞⠓ | th | High |
| ⠞⠓⠑ | the | High |
| ⠊⠎ | is | High |
| ⠋⠕⠗ | for | High |
| ⠺⠊⠞⠓ | with | Medium |
| ⠁⠝⠙ | and | Medium |
Usage
import torch
from transformers import AutoConfig, AutoModel, AutoTokenizer
# Load model and tokenizer
model = AutoModel.from_pretrained("your-username/braille256-v1")
tokenizer = AutoTokenizer.from_pretrained("your-username/braille256-v1")
# Generate text
prompt = "⠞⠓⠑" # "the" in Braille
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))
Direct Usage
from braille256_model import Braille256Model, Braille256Config
from braille256_tokenizer import Braille256Tokenizer
model = Braille256Model.from_pretrained("path/to/model")
tokenizer = Braille256Tokenizer.from_pretrained("path/to/model")
# Encode Braille text
text = "⠞⠓⠑⠀⠟⠥⠊⠉⠅⠀⠃⠗⠕⠺⠝⠀⠋⠕⠭"
tokens = tokenizer.encode(text)
# Generate
output = model.generate(torch.tensor([tokens]), max_length=100)
print(tokenizer.decode(output[0].tolist()))
Training
The model was trained on:
- Sample corpus of English text converted to Braille
- 2,000 training steps
- Batch size 8, learning rate 5e-4
- ~55 minutes on CPU
Training Loss Curve
- Initial loss: 3.76
- Final loss: 0.0022
Limitations
- Trained on limited corpus (sample texts only)
- May generate repetitive patterns
- Not suitable for production accessibility applications without further training
- English-only training data
Intended Use
This model is intended for:
- Research into emergent linguistic patterns
- Exploring Braille representation learning
- Educational demonstrations of language model training
- Foundation for larger Braille-native models
Citation
@misc{braille256,
title={Braille256: A Language Model That Rediscovers Braille Contractions},
author={Your Name},
year={2024},
howpublished={HuggingFace Hub}
}
License
MIT License
- Downloads last month
- 11