| --- |
| tags: |
| - tokie |
| library_name: tokie |
| --- |
| |
| <p align="center"> |
| <img src="tokie-banner.png" alt="tokie" width="600"> |
| </p> |
|
|
| # roberta-base |
|
|
| Pre-built [tokie](https://github.com/chonkie-inc/tokie) tokenizer for [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base). |
|
|
| ## Quick Start (Python) |
|
|
| ```bash |
| pip install tokie |
| ``` |
|
|
| ```python |
| import tokie |
| |
| tokenizer = tokie.Tokenizer.from_pretrained("tokiers/roberta-base") |
| encoding = tokenizer.encode("Hello, world!") |
| print(encoding.ids) |
| print(encoding.attention_mask) |
| ``` |
|
|
| ## Quick Start (Rust) |
|
|
| ```toml |
| [dependencies] |
| tokie = { version = "0.0.4", features = ["hf"] } |
| ``` |
|
|
| ```rust |
| use tokie::Tokenizer; |
| |
| let tokenizer = Tokenizer::from_pretrained("tokiers/roberta-base").unwrap(); |
| let encoding = tokenizer.encode("Hello, world!", true); |
| println!("{:?}", encoding.ids); |
| ``` |
|
|
| ## Files |
|
|
| - `tokenizer.tkz` — tokie binary format (~10x smaller, loads in ~5ms) |
| - `tokenizer.json` — original HuggingFace tokenizer |
|
|
| ## About tokie |
|
|
| **50x faster tokenization, 10x smaller model files, 100% accurate.** |
|
|
| tokie is a drop-in replacement for HuggingFace tokenizers, built in Rust. See [GitHub](https://github.com/chonkie-inc/tokie) for benchmarks and documentation. |
|
|
| ## License |
|
|
| MIT OR Apache-2.0 (tokie library). Original model files retain their original license from [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base). |
|
|