Build GitHub


NodeJS implementation of today's most used tokenizers, with a focus on performance and versatility. Bindings over the [Rust](https://github.com/huggingface/tokenizers/tree/master/tokenizers) implementation. If you are interested in the High-level design, you can go check it there. ## Main features - Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions). - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU. - Easy to use, but also extremely versatile. - Designed for research and production. - Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token. - Does all the pre-processing: Truncate, Pad, add the special tokens your model needs. ## Installation ```bash npm install tokenizers@latest ``` ## Basic example ```ts import { Tokenizer } from "tokenizers"; const tokenizer = await Tokenizer.fromFile("tokenizer.json"); const wpEncoded = await tokenizer.encode("Who is John?"); console.log(wpEncoded.getLength()); console.log(wpEncoded.getTokens()); console.log(wpEncoded.getIds()); console.log(wpEncoded.getAttentionMask()); console.log(wpEncoded.getOffsets()); console.log(wpEncoded.getOverflowing()); console.log(wpEncoded.getSpecialTokensMask()); console.log(wpEncoded.getTypeIds()); console.log(wpEncoded.getWordIds()); ``` ## License [Apache License 2.0](../../LICENSE)