Johnblick187 commited on
Commit
1603b99
·
verified ·
1 Parent(s): b0bca5a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -68
README.md CHANGED
@@ -1,68 +0,0 @@
1
- ---
2
- library_name: transformers
3
- tags:
4
- - tokenizers
5
- - sglang
6
- license: other
7
- license_name: grok-2
8
- license_link: https://huggingface.co/xai-org/grok-2/blob/main/LICENSE
9
- ---
10
-
11
- # Grok-2 Tokenizer
12
-
13
- A 🤗-compatible version of the **Grok-2 tokenizer** (adapted from [xai-org/grok-2](https://huggingface.co/xai-org/grok-2)).
14
-
15
- This means it can be used with Hugging Face libraries including [Transformers](https://github.com/huggingface/transformers),
16
- [Tokenizers](https://github.com/huggingface/tokenizers), and [Transformers.js](https://github.com/xenova/transformers.js).
17
-
18
- ## Motivation
19
-
20
- As Grok 2.5 aka. [xai-org/grok-2](https://github.com/xai-org/grok-2) has been recently released on the 🤗 Hub with [SGLang](https://github.com/sgl-project/sglang)
21
- native support, but the checkpoints on the Hub won't come with a Hugging Face compatible tokenizer, but rather with a `tiktoken`-based
22
- JSON export, which is [internally read and patched in SGLang](https://github.com/sgl-project/sglang/blob/fd71b11b1d96d385b09cb79c91a36f1f01293639/python/sglang/srt/tokenizer/tiktoken_tokenizer.py#L29-L108).
23
-
24
- This repository then contains the Hugging Face compatible export so that users can easily interact and play around with the Grok-2 tokenizer,
25
- besides that allowing to use it via SGLang without having to pull the repository manually from the Hub and then using a mount, to prevent from directly having
26
- to point to the tokenizer path, so that Grok-2 can be deployed as:
27
-
28
- ```bash
29
- python3 -m sglang.launch_server --model-path xai-org/grok-2 --tokenizer-path alvarobartt/grok-2-tokenizer --tp-size 8 --quantization fp8 --attention-backend triton
30
- ```
31
-
32
- Rather than the former 2-step process:
33
-
34
- ```bash
35
- hf download xai-org/grok-2 --local-dir /local/grok-2
36
-
37
- python3 -m sglang.launch_server --model-path /local/grok-2 --tokenizer-path /local/grok-2/tokenizer.tok.json --tp-size 8 --quantization fp8 --attention-backend triton
38
- ```
39
-
40
- ## Example
41
-
42
- ```py
43
- from transformers import AutoTokenizer
44
-
45
- tokenizer = AutoTokenizer.from_pretrained("alvarobartt/grok-2-tokenizer")
46
-
47
- assert tokenizer.encode("Human: What is Deep Learning?<|separator|>\n\n") == [
48
- 35406,
49
- 186,
50
- 2171,
51
- 458,
52
- 17454,
53
- 14803,
54
- 191,
55
- 1,
56
- 417,
57
- ]
58
-
59
- assert (
60
- tokenizer.apply_chat_template(
61
- [{"role": "user", "content": "What is the capital of France?"}], tokenize=False
62
- )
63
- == "Human: What is the capital of France?<|separator|>\n\n"
64
- )
65
- ```
66
-
67
- > [!NOTE]
68
- > This repository has been inspired by earlier similar work by [Xenova](https://huggingface.co/Xenova) in [`Xenova/grok-1-tokenizer`](https://huggingface.co/Xenova/grok-1-tokenizer).