ryanscottbarrett commited on
Commit
9320f42
·
verified ·
1 Parent(s): d7e5ae0

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +102 -0
  2. config.json +16 -0
  3. pytorch_model.bin +3 -0
  4. tokenizer.model +3 -0
  5. tokenizer.vocab +0 -0
README.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - fr
6
+ - de
7
+ - es
8
+ - it
9
+ - pt
10
+ - nl
11
+ tags:
12
+ - braille
13
+ - accessibility
14
+ - language-model
15
+ - grade-infinity
16
+ - sentencepiece
17
+ - unigram
18
+ datasets:
19
+ - project-gutenberg
20
+ pipeline_tag: text-generation
21
+ ---
22
+
23
+ # Braille256-v3: Grade Infinity Universal Braille Model
24
+
25
+ A **27.8M parameter** language model trained natively on Braille Unicode using **SentencePiece Unigram** tokenization for superior compression.
26
+
27
+ ## Key Features
28
+
29
+ - **SentencePiece Unigram**: Likelihood-optimized tokenization (superior to BPE)
30
+ - **4096 Vocabulary**: Learned contractions across 7 languages
31
+ - **Multilingual**: English, French, German, Spanish, Italian, Portuguese, Dutch
32
+ - **Cross-Linguistic Patterns**: Discovers universal Braille compressions
33
+
34
+ ### Learned Contractions
35
+
36
+ | Token | Braille | Languages |
37
+ |-------|---------|-----------|
38
+ | 15 | ⠞⠓⠑ (the) | English |
39
+ | 22 | ⠟⠥⠑ (que) | Spanish/French/Portuguese |
40
+ | 23 | ⠁⠝⠙ (and) | English |
41
+ | 25 | ⠕⠋ (of) | English |
42
+ | 17 | ⠙⠑ (de) | Spanish/French/Italian/Portuguese |
43
+ | 18 | ⠇⠁ (la) | Spanish/French/Italian |
44
+
45
+ ### Training Details
46
+
47
+ | Metric | Value |
48
+ |--------|-------|
49
+ | Parameters | 27.8M |
50
+ | Vocabulary | 4096 (Unigram) |
51
+ | Training Steps | 15,000 |
52
+ | Final Loss | 2.17 |
53
+ | Training Time | 3h 22m (MPS) |
54
+ | Corpus | 32M Braille chars (7 languages) |
55
+
56
+ ### Architecture
57
+
58
+ ```
59
+ Hidden Size: 512
60
+ Layers: 8
61
+ Attention Heads: 8
62
+ Max Sequence Length: 1024
63
+ Tokenizer: SentencePiece Unigram
64
+ ```
65
+
66
+ ## Usage
67
+
68
+ ```python
69
+ from braille_unigram_model import Braille256UnigramModel, BrailleUnigramTokenizer
70
+
71
+ model = Braille256UnigramModel.from_pretrained("ryanscottbarrett/braille256-v3")
72
+ tokenizer = BrailleUnigramTokenizer.from_pretrained("ryanscottbarrett/braille256-v3/tokenizer")
73
+
74
+ # Encode Braille text
75
+ text = "⠞⠓⠑⠀⠟⠥⠊⠉⠅⠀⠃⠗⠕⠺⠝⠀⠋⠕⠭"
76
+ tokens = tokenizer.encode(text)
77
+ print(f"Compression: {len(text)}/{len(tokens)} = {len(text)/len(tokens):.2f}x")
78
+ ```
79
+
80
+ ## Research Goals
81
+
82
+ This model is part of the **Grade Infinity Braille** research project:
83
+
84
+ 1. Can neural networks discover universal Braille contractions?
85
+ 2. Do cross-linguistic patterns emerge from multilingual training?
86
+ 3. Is SentencePiece Unigram superior to BPE for Braille?
87
+
88
+ ## Citation
89
+
90
+ ```bibtex
91
+ @misc{braille256v3,
92
+ author = {Ryan Barrett},
93
+ title = {Braille256-v3: Grade Infinity Universal Braille Model},
94
+ year = {2024},
95
+ publisher = {HuggingFace},
96
+ url = {https://huggingface.co/ryanscottbarrett/braille256-v3}
97
+ }
98
+ ```
99
+
100
+ ## License
101
+
102
+ MIT
config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attention_probs_dropout_prob": 0.1,
3
+ "bos_token_id": 2,
4
+ "eos_token_id": 3,
5
+ "hidden_dropout_prob": 0.1,
6
+ "hidden_size": 512,
7
+ "intermediate_size": 2048,
8
+ "layer_norm_eps": 1e-06,
9
+ "max_position_embeddings": 1024,
10
+ "model_type": "braille256_unigram",
11
+ "num_attention_heads": 8,
12
+ "num_hidden_layers": 8,
13
+ "pad_token_id": 0,
14
+ "transformers_version": "4.57.1",
15
+ "vocab_size": 4096
16
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ba4301c0fb126d5b6559e428fb23c68830cfc8a94781018fc7f2500e3e0953c6
3
+ size 111412467
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:45dbe04483f0b4ca187663baa07a7cd157ccb7b10ef460cd98b7a5d2c0ed67b2
3
+ size 338049
tokenizer.vocab ADDED
The diff for this file is too large to render. See raw diff