braille-byt5-v3: Grade 2 Contracted Braille to English

The first open ML model for Grade 2 (contracted) Braille to English translation.

This model translates Unicode Braille text (U+2800–U+283F) into English, handling the 180+ contraction rules used in Grade 2 Unified English Braille (UEB). Grade 2 is used in 90–95% of real-world Braille documents, yet no prior model supports it.

Key Results

Dataset	Exact Match	CER	BLEU
Real-world held-out (42 samples)	92.9%	0.004	0.838
Synthetic test (1,396 samples)	89.8%	0.019	0.834
Liblouis back-translation baseline	10.3%	0.260	—

The model is ~9x more accurate than liblouis back-translation on exact match and ~13x better on character error rate.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("prasanthmj/braille-byt5-v3")
model = AutoModelForSeq2SeqLM.from_pretrained("prasanthmj/braille-byt5-v3")

# Input: task prefix + Unicode Braille
text = "translate Braille to English: ⠠⠺⠁⠽⠀⠙⠪⠝⠀⠎⠳⠹⠀⠐⠱⠀⠮⠀⠚⠥⠝⠛⠇⠑⠀⠛⠗⠪⠎⠂"
inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
outputs = model.generate(**inputs, max_length=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Output: "Way down south where the jungle grows,"

Input Format

The model expects a task prefix followed by Unicode Braille characters:

translate Braille to English: ⠠⠓⠑⠇⠇⠕

Each Unicode Braille character (U+2800–U+283F) represents one 6-dot Braille cell. The model processes these as raw UTF-8 bytes — each Braille character becomes 3 bytes internally.

Model Details

Parameter	Value
Base model	google/byt5-small
Parameters	300M
Architecture	T5 encoder-decoder (byte-level, 259-token vocab)
Encoder layers	12
Decoder layers	4
Hidden size	1472
Attention heads	6
Training hardware	NVIDIA A100 80GB
Training precision	bf16
Epochs	10
Learning rate	1e-4 (cosine decay, 10% warmup)
Effective batch size	32 (batch 8 x grad accum 4)
Training time	~3 hours
Final train loss	0.049
Final val loss	0.008

Training Data

25,138 synthetic sentence pairs generated from 5 public domain books (Project Gutenberg) using Liblouis Grade 2 forward translation:

Moby-Dick — Herman Melville
Pride and Prejudice — Jane Austen
Alice's Adventures in Wonderland — Lewis Carroll
A Christmas Carol — Charles Dickens
The Adventures of Sherlock Holmes — Arthur Conan Doyle

Pipeline: English text → Liblouis UEB Grade 2 → BRF → cell codes (0–63) → Unicode Braille (U+2800–U+283F)

The model was not trained on the real-world evaluation data (jellybean held-out set), which was manually transcribed by a human. The fact that synthetic training data generalizes to real human-transcribed Braille is a key finding.

Evaluation Details

Jellybean held-out set (42 human-transcribed samples from a children's book):

Raw exact match: 76.2% (before quote normalization)
Normalized exact match: 92.9% (after normalizing smart quotes to ASCII)
Only 3 errors after normalization, all minor (CER < 0.1)
Zero hallucinations — every prediction is correct and input-dependent

Synthetic test set (1,396 samples, same distribution as training):

Normalized exact match: 89.8%
142 errors, dominated by output truncation on long sentences (avg miss length: 265 chars vs 99 chars overall)

Liblouis back-translation baseline:

10.3% exact match (case-insensitive), 0.260 CER
Produces garbled output with escape sequences for many contractions
Essentially unusable for real text compared to the ML model

Known Limitations

Long sentence truncation. Sentences longer than ~200 characters may be truncated during generation. Use max_length=512 to mitigate.
Number encoding. Braille number indicators (⠼ + letter) are sometimes decoded to wrong digits. Chapter headings like "CHAPTER 69" may produce wrong numbers.
Special characters. Ligatures (œ, æ), currency symbols (£), and non-ASCII punctuation may not translate correctly.
English only. Trained on UEB (Unified English Braille) Grade 2 contractions. Does not support other languages or Braille codes.
No Nemeth (math). Mathematical Braille notation is not supported.
Stage 2 only. This model translates Braille text (Unicode), not Braille images. A separate detection model (e.g., YOLOv8) is needed to extract cell patterns from photographs.

Why ByT5?

Previous attempts with T5-small failed:

v1 (T5 + custom tokens): 64 randomly-initialized embeddings for Braille cells → 2.9% accuracy
v2 (T5 + Unicode Braille): T5's SentencePiece tokenizer maps all Braille characters to <unk> → 0% accuracy
v3 (ByT5 + Unicode Braille): Byte-level processing handles Unicode Braille natively (each char = 3 UTF-8 bytes with pre-trained embeddings) → 92.9% accuracy

ByT5's byte-level architecture is uniquely suited for Braille because it requires no tokenizer modifications and has pre-trained representations for all possible byte values.

Important: ByT5 requires bf16 or fp32 precision. fp16 causes numerical overflow in attention softmax on long byte sequences.

Intended Use

Braille OCR pipelines: Stage 2 interpretation after cell detection from images
Accessibility tools: Help sighted users (teachers, parents) read Grade 2 Braille
Research: Baseline for Grade 2 Braille interpretation methods
Education: Understanding contracted Braille patterns

Citation

If you use this model, please cite the repository:

@software{braille_byt5_v3_2026,
  title={braille-byt5-v3: Grade 2 Contracted Braille to English Translation},
  author={Prasanth Janardhanan},
  year={2026},
  url={https://github.com/braille-reader/braille-transcriber}
}

License

MIT License — free for any use including commercial applications.

Model tree for prasanthmj/braille-byt5-v3

Base model

google/byt5-small

Finetuned

(292)

this model

Space using prasanthmj/braille-byt5-v3 1

Evaluation results

Exact Match (real-world, normalized)
self-reported

92.900
CER (real-world)
self-reported

0.004
Exact Match (synthetic test, normalized)
self-reported

89.800
CER (synthetic test)
self-reported

0.019

prasanthmj
/

braille-byt5-v3