braille-byt5-v3: Grade 2 Contracted Braille to English
The first open ML model for Grade 2 (contracted) Braille to English translation.
This model translates Unicode Braille text (U+2800โU+283F) into English, handling the 180+ contraction rules used in Grade 2 Unified English Braille (UEB). Grade 2 is used in 90โ95% of real-world Braille documents, yet no prior model supports it.
Key Results
| Dataset | Exact Match | CER | BLEU |
|---|---|---|---|
| Real-world held-out (42 samples) | 92.9% | 0.004 | 0.838 |
| Synthetic test (1,396 samples) | 89.8% | 0.019 | 0.834 |
| Liblouis back-translation baseline | 10.3% | 0.260 | โ |
The model is ~9x more accurate than liblouis back-translation on exact match and ~13x better on character error rate.
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("prasanthmj/braille-byt5-v3")
model = AutoModelForSeq2SeqLM.from_pretrained("prasanthmj/braille-byt5-v3")
# Input: task prefix + Unicode Braille
text = "translate Braille to English: โ โ บโ โ ฝโ โ โ ชโ โ โ โ ณโ นโ โ โ ฑโ โ ฎโ โ โ ฅโ โ โ โ โ โ โ โ ชโ โ "
inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
outputs = model.generate(**inputs, max_length=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Output: "Way down south where the jungle grows,"
Input Format
The model expects a task prefix followed by Unicode Braille characters:
translate Braille to English: โ โ โ โ โ โ
Each Unicode Braille character (U+2800โU+283F) represents one 6-dot Braille cell. The model processes these as raw UTF-8 bytes โ each Braille character becomes 3 bytes internally.
Model Details
| Parameter | Value |
|---|---|
| Base model | google/byt5-small |
| Parameters | 300M |
| Architecture | T5 encoder-decoder (byte-level, 259-token vocab) |
| Encoder layers | 12 |
| Decoder layers | 4 |
| Hidden size | 1472 |
| Attention heads | 6 |
| Training hardware | NVIDIA A100 80GB |
| Training precision | bf16 |
| Epochs | 10 |
| Learning rate | 1e-4 (cosine decay, 10% warmup) |
| Effective batch size | 32 (batch 8 x grad accum 4) |
| Training time | ~3 hours |
| Final train loss | 0.049 |
| Final val loss | 0.008 |
Training Data
25,138 synthetic sentence pairs generated from 5 public domain books (Project Gutenberg) using Liblouis Grade 2 forward translation:
- Moby-Dick โ Herman Melville
- Pride and Prejudice โ Jane Austen
- Alice's Adventures in Wonderland โ Lewis Carroll
- A Christmas Carol โ Charles Dickens
- The Adventures of Sherlock Holmes โ Arthur Conan Doyle
Pipeline: English text โ Liblouis UEB Grade 2 โ BRF โ cell codes (0โ63) โ Unicode Braille (U+2800โU+283F)
The model was not trained on the real-world evaluation data (jellybean held-out set), which was manually transcribed by a human. The fact that synthetic training data generalizes to real human-transcribed Braille is a key finding.
Evaluation Details
Jellybean held-out set (42 human-transcribed samples from a children's book):
- Raw exact match: 76.2% (before quote normalization)
- Normalized exact match: 92.9% (after normalizing smart quotes to ASCII)
- Only 3 errors after normalization, all minor (CER < 0.1)
- Zero hallucinations โ every prediction is correct and input-dependent
Synthetic test set (1,396 samples, same distribution as training):
- Normalized exact match: 89.8%
- 142 errors, dominated by output truncation on long sentences (avg miss length: 265 chars vs 99 chars overall)
Liblouis back-translation baseline:
- 10.3% exact match (case-insensitive), 0.260 CER
- Produces garbled output with escape sequences for many contractions
- Essentially unusable for real text compared to the ML model
Known Limitations
- Long sentence truncation. Sentences longer than ~200 characters may be truncated during generation. Use
max_length=512to mitigate. - Number encoding. Braille number indicators (โ ผ + letter) are sometimes decoded to wrong digits. Chapter headings like "CHAPTER 69" may produce wrong numbers.
- Special characters. Ligatures (ล, รฆ), currency symbols (ยฃ), and non-ASCII punctuation may not translate correctly.
- English only. Trained on UEB (Unified English Braille) Grade 2 contractions. Does not support other languages or Braille codes.
- No Nemeth (math). Mathematical Braille notation is not supported.
- Stage 2 only. This model translates Braille text (Unicode), not Braille images. A separate detection model (e.g., YOLOv8) is needed to extract cell patterns from photographs.
Why ByT5?
Previous attempts with T5-small failed:
- v1 (T5 + custom tokens): 64 randomly-initialized embeddings for Braille cells โ 2.9% accuracy
- v2 (T5 + Unicode Braille): T5's SentencePiece tokenizer maps all Braille characters to
<unk>โ 0% accuracy - v3 (ByT5 + Unicode Braille): Byte-level processing handles Unicode Braille natively (each char = 3 UTF-8 bytes with pre-trained embeddings) โ 92.9% accuracy
ByT5's byte-level architecture is uniquely suited for Braille because it requires no tokenizer modifications and has pre-trained representations for all possible byte values.
Important: ByT5 requires bf16 or fp32 precision. fp16 causes numerical overflow in attention softmax on long byte sequences.
Intended Use
- Braille OCR pipelines: Stage 2 interpretation after cell detection from images
- Accessibility tools: Help sighted users (teachers, parents) read Grade 2 Braille
- Research: Baseline for Grade 2 Braille interpretation methods
- Education: Understanding contracted Braille patterns
Citation
If you use this model, please cite the repository:
@software{braille_byt5_v3_2026,
title={braille-byt5-v3: Grade 2 Contracted Braille to English Translation},
author={Prasanth Janardhanan},
year={2026},
url={https://github.com/braille-reader/braille-transcriber}
}
License
MIT License โ free for any use including commercial applications.
Links
- GitHub: braille-transcriber
- Base model: google/byt5-small
- Evaluation report: evaluation-report-v3.md
- Downloads last month
- 25
Model tree for prasanthmj/braille-byt5-v3
Base model
google/byt5-smallSpace using prasanthmj/braille-byt5-v3 1
Evaluation results
- Exact Match (real-world, normalized)self-reported92.900
- CER (real-world)self-reported0.004
- Exact Match (synthetic test, normalized)self-reported89.800
- CER (synthetic test)self-reported0.019