| license: apache-2.0 | |
| tags: | |
| - tokenizers | |
| - BPE | |
| - sentencepiece | |
| - code-generation | |
| # cofos_tokenizer | |
| Specialized SentencePiece BPE tokenizer for the **cofos** programming and logic | |
| language model. | |
| ## Configuration | |
| - Vocabulary size: **16384** | |
| - Model type: BPE | |
| - Byte fallback: enabled | |
| - Digit splitting: enabled (digits 0-9 are guaranteed atomic) | |
| - Whitespace normalization: disabled (`identity` rule) — indentation preserved | |
| ## Special atomic tokens | |
| Keywords (`def`, `class`, `fn`, `struct`, `impl`, `return`, `async`, …), | |
| operators (`==`, `!=`, `=>`, `->`, `::`, `///`, …) and structural tags | |
| (`<python>`, `<code>`, `<explanation>`, …) are all guaranteed single tokens. | |
| ## Usage | |
| ```python | |
| import sentencepiece as spm | |
| sp = spm.SentencePieceProcessor() | |
| sp.Load("cofos_tokenizer.model") | |
| print(sp.EncodeAsPieces("def hello():\n return 42")) | |
| ``` | |