Merlin
tokenizer
tsuberim commited on
Commit
7aea5a3
·
verified ·
1 Parent(s): e4b6feb

v0: tokenizer trained on pretraining corpus

Browse files
Files changed (1) hide show
  1. README.md +16 -62
README.md CHANGED
@@ -1,78 +1,32 @@
1
  ---
2
  license: apache-2.0
3
- language:
4
- - en
5
  tags:
6
- - tokenizer
7
- - code
8
- - bpe
9
  ---
10
 
11
  # merlin-tokenizer-v0
12
 
13
- BPE tokenizer for [Merlin](https://github.com/tsuberim/merlin) a small LM purpose-built for agentic coding on Apple Silicon.
 
14
 
15
- **v0** — trained on Python, Bash, and Markdown from The Stack dedup. Will be retrained clean (v1) after agentic traces are generated.
16
 
17
- ## Specs
18
-
19
- | Property | Value |
20
- |---|---|
21
- | Algorithm | BPE (byte-level) |
22
- | Vocab size | 32,016 |
23
- | BPE tokens | 32,000 |
24
- | Special tokens | 16 |
25
- | Backend | HuggingFace `tokenizers` (Rust) |
26
- | Compression | ~3.57 chars/token on Python |
27
-
28
- ## Special tokens
29
-
30
- Legacy tokens (original BPE training, IDs 0–13):
31
-
32
- | ID | Token |
33
- |---|---|
34
- | 0 | `<\|bos\|>` |
35
- | 1 | `<\|eos\|>` |
36
- | 2 | `<\|pad\|>` |
37
- | 3 | `<\|unk\|>` |
38
- | 4 | `<\|user\|>` |
39
- | 5 | `<\|assistant\|>` |
40
- | 6 | `<\|tool_call\|>` |
41
- | 7 | `<\|end_tool_call\|>` |
42
- | 8 | `<\|tool_result\|>` |
43
- | 9 | `<\|sep\|>` |
44
- | 10 | `<\|end\|>` |
45
- | 11 | `<\|python\|>` |
46
- | 12 | `<\|bash\|>` |
47
- | 13 | `<\|markdown\|>` |
48
-
49
- Agent protocol tokens (patched in, IDs 32000–32015):
50
-
51
- | ID | Token | Role |
52
- |---|---|---|
53
- | 32000 | `<\|task\|>` | task open |
54
- | 32001 | `<\|/task\|>` | task close |
55
- | 32002 | `<\|think\|>` | thinking open |
56
- | 32003 | `<\|/think\|>` | thinking close |
57
- | 32004 | `<\|/tool_call\|>` | tool call close |
58
- | 32005 | `<\|/tool_result\|>` | tool result close |
59
- | 32006 | `<\|spawn\|>` | spawn agent open |
60
- | 32007 | `<\|/spawn\|>` | spawn agent close |
61
- | 32008 | `<\|agent_id\|>` | agent ID open |
62
- | 32009 | `<\|/agent_id\|>` | agent ID close |
63
- | 32010 | `<\|wait\|>` | wait open |
64
- | 32011 | `<\|/wait\|>` | wait close |
65
- | 32012 | `<\|wait_result\|>` | wait result open |
66
- | 32013 | `<\|/wait_result\|>` | wait result close |
67
- | 32014 | `<\|done\|>` | done open |
68
- | 32015 | `<\|/done\|>` | done close |
69
 
70
  ## Usage
71
 
72
  ```python
73
  from tokenizers import Tokenizer
74
-
75
- tok = Tokenizer.from_pretrained("tsuberim/merlin-tokenizer-v0")
76
  ids = tok.encode("def hello(): pass").ids
77
- text = tok.decode(ids, skip_special_tokens=False)
78
  ```
 
 
 
1
  ---
2
  license: apache-2.0
 
 
3
  tags:
4
+ - merlin
5
+ - tokenizer
 
6
  ---
7
 
8
  # merlin-tokenizer-v0
9
 
10
+ BPE tokenizer for [Merlin](https://github.com/tsuberim/mllm), trained on Python, Bash, Markdown,
11
+ Stack Exchange Q&A, GitHub commits/issues, and tldr-pages.
12
 
13
+ ## Vocab
14
 
15
+ - 32,016 tokens total: 32,000 BPE + 16 special tokens
16
+ - Special tokens:
17
+ - IDs 0–13: legacy slots (unused)
18
+ - ID 32000: `<|bos|>`
19
+ - ID 32001: `<|eos|>`
20
+ - ID 32002–32013: agent protocol tokens (`<|tool_call|>`, `<|/tool_call|>`, etc.)
21
+ - ID 32014: `<|done|>`
22
+ - ID 32015: `<|pad|>`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
  ## Usage
25
 
26
  ```python
27
  from tokenizers import Tokenizer
28
+ tok = Tokenizer.from_file("tokenizer.json")
 
29
  ids = tok.encode("def hello(): pass").ids
 
30
  ```
31
+
32
+ > **Note:** Will be retrained after agentic trace generation. This is v0 — trained on pretraining corpus only.