File size: 773 Bytes
aceea5b b6c6300 aceea5b be6493b aceea5b b6c6300 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
# Arcade100kTokenizer
Arcade100k is a BPE tokenizer extended from OpenAI’s [`tiktoken.cl100k_base`](https://github.com/openai/tiktoken) to
include special tokens for code and individual digit-splitting.
```
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("stabilityai/arcade100k", trust_remote_code=True)
tokenizer("hello, world!", return_tensors='pt')
```
# Citation
```bibtex
@article{bellagente2024stable,
title={Stable LM 2 1.6 B Technical Report},
author={Bellagente, Marco and Tow, Jonathan and Mahan, Dakota and Phung, Duy and Zhuravinskyi, Maksym and Adithyan, Reshinth and Baicoianu, James and Brooks, Ben and Cooper, Nathan and Datta, Ashish and others},
journal={arXiv preprint arXiv:2402.17834},
year={2024}
}
```
|