ameforge commited on
Commit
1b00789
·
verified ·
1 Parent(s): 32b08af

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +33 -0
README.md ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - tokenizers
5
+ - BPE
6
+ - sentencepiece
7
+ - code-generation
8
+ ---
9
+
10
+ # cofos_tokenizer
11
+
12
+ Specialized SentencePiece BPE tokenizer for the **cofos** programming and logic
13
+ language model.
14
+
15
+ ## Configuration
16
+ - Vocabulary size: **16384**
17
+ - Model type: BPE
18
+ - Byte fallback: enabled
19
+ - Digit splitting: enabled (digits 0-9 are guaranteed atomic)
20
+ - Whitespace normalization: disabled (`identity` rule) — indentation preserved
21
+
22
+ ## Special atomic tokens
23
+ Keywords (`def`, `class`, `fn`, `struct`, `impl`, `return`, `async`, …),
24
+ operators (`==`, `!=`, `=>`, `->`, `::`, `///`, …) and structural tags
25
+ (`<python>`, `<code>`, `<explanation>`, …) are all guaranteed single tokens.
26
+
27
+ ## Usage
28
+ ```python
29
+ import sentencepiece as spm
30
+ sp = spm.SentencePieceProcessor()
31
+ sp.Load("cofos_tokenizer.model")
32
+ print(sp.EncodeAsPieces("def hello():\n return 42"))
33
+ ```