Upload folder using huggingface_hub

Browse files

Files changed (5) hide show

README.md +102 -0
config.json +16 -0
pytorch_model.bin +3 -0
tokenizer.model +3 -0
tokenizer.vocab +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,102 @@

+---
+license: mit
+language:
+- en
+- fr
+- de
+- es
+- it
+- pt
+- nl
+tags:
+- braille
+- accessibility
+- language-model
+- grade-infinity
+- sentencepiece
+- unigram
+datasets:
+- project-gutenberg
+pipeline_tag: text-generation
+---
+# Braille256-v3: Grade Infinity Universal Braille Model
+A **27.8M parameter** language model trained natively on Braille Unicode using **SentencePiece Unigram** tokenization for superior compression.
+## Key Features
+- **SentencePiece Unigram**: Likelihood-optimized tokenization (superior to BPE)
+- **4096 Vocabulary**: Learned contractions across 7 languages
+- **Multilingual**: English, French, German, Spanish, Italian, Portuguese, Dutch
+- **Cross-Linguistic Patterns**: Discovers universal Braille compressions
+### Learned Contractions
+| Token | Braille | Languages |
+|-------|---------|-----------|
+| 15 | ⠞⠓⠑ (the) | English |
+| 22 | ⠟⠥⠑ (que) | Spanish/French/Portuguese |
+| 23 | ⠁⠝⠙ (and) | English |
+| 25 | ⠕⠋ (of) | English |
+| 17 | ⠙⠑ (de) | Spanish/French/Italian/Portuguese |
+| 18 | ⠇⠁ (la) | Spanish/French/Italian |
+### Training Details
+| Metric | Value |
+|--------|-------|
+| Parameters | 27.8M |
+| Vocabulary | 4096 (Unigram) |
+| Training Steps | 15,000 |
+| Final Loss | 2.17 |
+| Training Time | 3h 22m (MPS) |
+| Corpus | 32M Braille chars (7 languages) |
+### Architecture
+```
+Hidden Size: 512
+Layers: 8
+Attention Heads: 8
+Max Sequence Length: 1024
+Tokenizer: SentencePiece Unigram
+```
+## Usage
+```python
+from braille_unigram_model import Braille256UnigramModel, BrailleUnigramTokenizer
+model = Braille256UnigramModel.from_pretrained("ryanscottbarrett/braille256-v3")
+tokenizer = BrailleUnigramTokenizer.from_pretrained("ryanscottbarrett/braille256-v3/tokenizer")
+# Encode Braille text
+text = "⠞⠓⠑⠀⠟⠥⠊⠉⠅⠀⠃⠗⠕⠺⠝⠀⠋⠕⠭"
+tokens = tokenizer.encode(text)
+print(f"Compression: {len(text)}/{len(tokens)} = {len(text)/len(tokens):.2f}x")
+```
+## Research Goals
+This model is part of the **Grade Infinity Braille** research project:
+1. Can neural networks discover universal Braille contractions?
+2. Do cross-linguistic patterns emerge from multilingual training?
+3. Is SentencePiece Unigram superior to BPE for Braille?
+## Citation
+```bibtex
+@misc{braille256v3,
+  author = {Ryan Barrett},
+  title = {Braille256-v3: Grade Infinity Universal Braille Model},
+  year = {2024},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/ryanscottbarrett/braille256-v3}
+}
+```
+## License
+MIT

config.json ADDED Viewed

	@@ -0,0 +1,16 @@

+{
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 2,
+  "eos_token_id": 3,
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 512,
+  "intermediate_size": 2048,
+  "layer_norm_eps": 1e-06,
+  "max_position_embeddings": 1024,
+  "model_type": "braille256_unigram",
+  "num_attention_heads": 8,
+  "num_hidden_layers": 8,
+  "pad_token_id": 0,
+  "transformers_version": "4.57.1",
+  "vocab_size": 4096
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ba4301c0fb126d5b6559e428fb23c68830cfc8a94781018fc7f2500e3e0953c6
+size 111412467

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:45dbe04483f0b4ca187663baa07a7cd157ccb7b10ef460cd98b7a5d2c0ed67b2
+size 338049

tokenizer.vocab ADDED Viewed

The diff for this file is too large to render. See raw diff