Upload folder using huggingface_hub
Browse files- README.md +102 -0
- config.json +16 -0
- pytorch_model.bin +3 -0
- tokenizer.model +3 -0
- tokenizer.vocab +0 -0
README.md
ADDED
|
@@ -0,0 +1,102 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- fr
|
| 6 |
+
- de
|
| 7 |
+
- es
|
| 8 |
+
- it
|
| 9 |
+
- pt
|
| 10 |
+
- nl
|
| 11 |
+
tags:
|
| 12 |
+
- braille
|
| 13 |
+
- accessibility
|
| 14 |
+
- language-model
|
| 15 |
+
- grade-infinity
|
| 16 |
+
- sentencepiece
|
| 17 |
+
- unigram
|
| 18 |
+
datasets:
|
| 19 |
+
- project-gutenberg
|
| 20 |
+
pipeline_tag: text-generation
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
# Braille256-v3: Grade Infinity Universal Braille Model
|
| 24 |
+
|
| 25 |
+
A **27.8M parameter** language model trained natively on Braille Unicode using **SentencePiece Unigram** tokenization for superior compression.
|
| 26 |
+
|
| 27 |
+
## Key Features
|
| 28 |
+
|
| 29 |
+
- **SentencePiece Unigram**: Likelihood-optimized tokenization (superior to BPE)
|
| 30 |
+
- **4096 Vocabulary**: Learned contractions across 7 languages
|
| 31 |
+
- **Multilingual**: English, French, German, Spanish, Italian, Portuguese, Dutch
|
| 32 |
+
- **Cross-Linguistic Patterns**: Discovers universal Braille compressions
|
| 33 |
+
|
| 34 |
+
### Learned Contractions
|
| 35 |
+
|
| 36 |
+
| Token | Braille | Languages |
|
| 37 |
+
|-------|---------|-----------|
|
| 38 |
+
| 15 | ⠞⠓⠑ (the) | English |
|
| 39 |
+
| 22 | ⠟⠥⠑ (que) | Spanish/French/Portuguese |
|
| 40 |
+
| 23 | ⠁⠝⠙ (and) | English |
|
| 41 |
+
| 25 | ⠕⠋ (of) | English |
|
| 42 |
+
| 17 | ⠙⠑ (de) | Spanish/French/Italian/Portuguese |
|
| 43 |
+
| 18 | ⠇⠁ (la) | Spanish/French/Italian |
|
| 44 |
+
|
| 45 |
+
### Training Details
|
| 46 |
+
|
| 47 |
+
| Metric | Value |
|
| 48 |
+
|--------|-------|
|
| 49 |
+
| Parameters | 27.8M |
|
| 50 |
+
| Vocabulary | 4096 (Unigram) |
|
| 51 |
+
| Training Steps | 15,000 |
|
| 52 |
+
| Final Loss | 2.17 |
|
| 53 |
+
| Training Time | 3h 22m (MPS) |
|
| 54 |
+
| Corpus | 32M Braille chars (7 languages) |
|
| 55 |
+
|
| 56 |
+
### Architecture
|
| 57 |
+
|
| 58 |
+
```
|
| 59 |
+
Hidden Size: 512
|
| 60 |
+
Layers: 8
|
| 61 |
+
Attention Heads: 8
|
| 62 |
+
Max Sequence Length: 1024
|
| 63 |
+
Tokenizer: SentencePiece Unigram
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
## Usage
|
| 67 |
+
|
| 68 |
+
```python
|
| 69 |
+
from braille_unigram_model import Braille256UnigramModel, BrailleUnigramTokenizer
|
| 70 |
+
|
| 71 |
+
model = Braille256UnigramModel.from_pretrained("ryanscottbarrett/braille256-v3")
|
| 72 |
+
tokenizer = BrailleUnigramTokenizer.from_pretrained("ryanscottbarrett/braille256-v3/tokenizer")
|
| 73 |
+
|
| 74 |
+
# Encode Braille text
|
| 75 |
+
text = "⠞⠓⠑⠀⠟⠥⠊⠉⠅⠀⠃⠗⠕⠺⠝⠀⠋⠕⠭"
|
| 76 |
+
tokens = tokenizer.encode(text)
|
| 77 |
+
print(f"Compression: {len(text)}/{len(tokens)} = {len(text)/len(tokens):.2f}x")
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
## Research Goals
|
| 81 |
+
|
| 82 |
+
This model is part of the **Grade Infinity Braille** research project:
|
| 83 |
+
|
| 84 |
+
1. Can neural networks discover universal Braille contractions?
|
| 85 |
+
2. Do cross-linguistic patterns emerge from multilingual training?
|
| 86 |
+
3. Is SentencePiece Unigram superior to BPE for Braille?
|
| 87 |
+
|
| 88 |
+
## Citation
|
| 89 |
+
|
| 90 |
+
```bibtex
|
| 91 |
+
@misc{braille256v3,
|
| 92 |
+
author = {Ryan Barrett},
|
| 93 |
+
title = {Braille256-v3: Grade Infinity Universal Braille Model},
|
| 94 |
+
year = {2024},
|
| 95 |
+
publisher = {HuggingFace},
|
| 96 |
+
url = {https://huggingface.co/ryanscottbarrett/braille256-v3}
|
| 97 |
+
}
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
## License
|
| 101 |
+
|
| 102 |
+
MIT
|
config.json
ADDED
|
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"attention_probs_dropout_prob": 0.1,
|
| 3 |
+
"bos_token_id": 2,
|
| 4 |
+
"eos_token_id": 3,
|
| 5 |
+
"hidden_dropout_prob": 0.1,
|
| 6 |
+
"hidden_size": 512,
|
| 7 |
+
"intermediate_size": 2048,
|
| 8 |
+
"layer_norm_eps": 1e-06,
|
| 9 |
+
"max_position_embeddings": 1024,
|
| 10 |
+
"model_type": "braille256_unigram",
|
| 11 |
+
"num_attention_heads": 8,
|
| 12 |
+
"num_hidden_layers": 8,
|
| 13 |
+
"pad_token_id": 0,
|
| 14 |
+
"transformers_version": "4.57.1",
|
| 15 |
+
"vocab_size": 4096
|
| 16 |
+
}
|
pytorch_model.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ba4301c0fb126d5b6559e428fb23c68830cfc8a94781018fc7f2500e3e0953c6
|
| 3 |
+
size 111412467
|
tokenizer.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:45dbe04483f0b4ca187663baa07a7cd157ccb7b10ef460cd98b7a5d2c0ed67b2
|
| 3 |
+
size 338049
|
tokenizer.vocab
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|