Update README.md
Browse files
README.md
CHANGED
|
@@ -1,6 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
widget:
|
| 4 |
-
- text: "
|
| 5 |
-
example_title: "
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# My BPE Tokenizer(我的分词器)
|
| 2 |
+
|
| 3 |
+
这是一个基于 BPE 算法训练的自定义分词器,支持英文文本分词。
|
| 4 |
+
|
| 5 |
+
## 试玩:输入文本查看分词结果
|
| 6 |
---
|
| 7 |
+
task: token-classification # 指定任务类型(分词属于“token分类”)
|
| 8 |
widget:
|
| 9 |
+
- text: "Hello, this is a test of my BPE tokenizer!" # 默认测试文本
|
| 10 |
+
example_title: "基础分词示例"
|
| 11 |
+
- text: "Natural language processing is fun and useful." # 第二个示例
|
| 12 |
+
example_title: "NLP相关文本"
|
| 13 |
+
inference:
|
| 14 |
+
parameters:
|
| 15 |
+
add_special_tokens: false # 可选:是否添加特殊token(如[CLS])
|
| 16 |
---
|
| 17 |
+
|
| 18 |
+
## 分词效果说明
|
| 19 |
+
- 会将连续文本拆分为子词(如 "tokenizer" → ["token", "izer"])
|
| 20 |
+
- 支持标点符号和空格的识别(如 "," "!" 会被单独拆分)
|
| 21 |
+
|
| 22 |
+
## 使用方法(代码调用)
|
| 23 |
+
```python
|
| 24 |
+
from tokenizers import Tokenizer
|
| 25 |
+
|
| 26 |
+
# 加载分词器
|
| 27 |
+
tokenizer = Tokenizer.from_pretrained("你的用户名/my-Tokenizer")
|
| 28 |
+
|
| 29 |
+
# 分词示例
|
| 30 |
+
text = "Hello, world!"
|
| 31 |
+
output = tokenizer.encode(text)
|
| 32 |
+
print("分词结果:", output.tokens) # 输出拆分后的子词列表
|