ryanscottbarrett commited on
Commit
6677fd8
·
verified ·
1 Parent(s): 132cf0a

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +122 -0
  2. config.json +9 -0
  3. pytorch_model.bin +3 -0
  4. tokenizer.model +3 -0
README.md ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # braille256-v5: Multimodal Universal Braille Model
2
+
3
+ The first language model trained on **multimodal data encoded as 8-dot Braille**.
4
+
5
+ ## Key Innovation
6
+
7
+ This model demonstrates that **any data type** (text, images, audio, binary) can be encoded into 8-dot Braille Unicode (U+2800-U+28FF) and processed by a single unified model.
8
+
9
+ ```python
10
+ def byte_to_braille(byte: int) -> str:
11
+ """Direct 1:1 mapping: 256 bytes → 256 Braille patterns"""
12
+ return chr(0x2800 + byte)
13
+ ```
14
+
15
+ ## Model Details
16
+
17
+ | Property | Value |
18
+ |----------|-------|
19
+ | Parameters | 11.5M |
20
+ | Architecture | Transformer (4 layers, 4 heads) |
21
+ | Hidden Size | 256 |
22
+ | Vocabulary | 32,000 (SentencePiece Unigram) |
23
+ | Context Length | 512 tokens |
24
+ | Training Steps | 5,000 |
25
+ | Final Loss | 2.87 |
26
+
27
+ ## Training Data
28
+
29
+ - **Text**: 218 files from Project Gutenberg (7 languages) encoded as 8-dot Braille
30
+ - **Audio**: Synthetic WAV files encoded as Braille
31
+ - **Total**: 2M tokens from multimodal corpus
32
+
33
+ ## Compression
34
+
35
+ | Tokenizer | Vocab | Chars/Token |
36
+ |-----------|-------|-------------|
37
+ | braille256-v4 (8k) | 8,192 | 2.24 |
38
+ | **braille256-v5 (32k)** | 32,000 | **2.45** |
39
+ | GPT-4 (reference) | 100,000 | 4.31 |
40
+
41
+ ## Universal Encoding
42
+
43
+ The 8-dot Braille encoding enables:
44
+
45
+ 1. **Text** → UTF-8 bytes → Braille
46
+ 2. **Images** → Raw bytes → Braille
47
+ 3. **Audio** → WAV bytes → Braille
48
+ 4. **Any file** → Bytes → Braille
49
+
50
+ ### Modality Headers
51
+
52
+ ```
53
+ ⣿⠁ = TEXT
54
+ ⣿⠃ = IMAGE
55
+ ⣿⠇ = AUDIO
56
+ ⣿⠏ = BINARY
57
+ ```
58
+
59
+ ## Usage
60
+
61
+ ```python
62
+ import torch
63
+ import sentencepiece as spm
64
+
65
+ # Load tokenizer
66
+ sp = spm.SentencePieceProcessor()
67
+ sp.load("tokenizer.model")
68
+
69
+ # Load model
70
+ from train_multimodal_v5 import Braille256MultimodalModel, MultimodalConfig
71
+ import json
72
+
73
+ with open("config.json") as f:
74
+ config = MultimodalConfig.from_dict(json.load(f))
75
+
76
+ model = Braille256MultimodalModel(config)
77
+ model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"))
78
+ model.eval()
79
+
80
+ # Encode any data as Braille
81
+ def bytes_to_braille(data: bytes) -> str:
82
+ return ''.join(chr(0x2800 + b) for b in data)
83
+
84
+ # Generate from Braille prompt
85
+ braille_text = "⠞⠓⠑⠀⠟⠥⠊⠉⠅" # "the quick" in Braille
86
+ tokens = sp.encode(braille_text)
87
+ input_ids = torch.tensor([tokens])
88
+ output = model.generate(input_ids, max_length=50)
89
+ generated = sp.decode(output[0].tolist())
90
+ print(generated)
91
+ ```
92
+
93
+ ## Model Family
94
+
95
+ | Version | Type | Patterns | Focus |
96
+ |---------|------|----------|-------|
97
+ | v3 | 6-dot | 64 | Literary Braille |
98
+ | v4 | 8-dot | 256 | Computer Braille |
99
+ | **v5** | 8-dot | 256 | **Multimodal** |
100
+
101
+ ## Why 8-dot Braille?
102
+
103
+ - **256 patterns** = exactly 1 byte
104
+ - **Universal encoding**: Any data → Braille
105
+ - **Tactile AI**: Blind users can "feel" any data
106
+ - **Cross-modal learning**: Single representation for all modalities
107
+
108
+ ## Citation
109
+
110
+ ```bibtex
111
+ @misc{braille256v5,
112
+ title={braille256-v5: Multimodal Universal Braille Model},
113
+ author={Barrett, Ryan},
114
+ year={2024},
115
+ publisher={HuggingFace},
116
+ url={https://huggingface.co/ryanscottbarrett/braille256-v5}
117
+ }
118
+ ```
119
+
120
+ ## License
121
+
122
+ MIT
config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size": 32000,
3
+ "hidden_size": 256,
4
+ "num_layers": 4,
5
+ "num_heads": 4,
6
+ "intermediate_size": 1024,
7
+ "max_position_embeddings": 512,
8
+ "dropout": 0.1
9
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a2148874e22f8a0ce4c7ffb4a75bdc5c789440570ea6c3898e72ee334871dd00
3
+ size 45954387
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ec5b8b6fbd8985a97c74d377a83f58ff59f3860d02a343eb15146da467da40ae
3
+ size 1155082