Buckets:
| # Models | |
| ## BPE[[tokenizers.models.BPE]] | |
| #### tokenizers.models.BPE[[tokenizers.models.BPE]] | |
| An implementation of the BPE (Byte-Pair Encoding) algorithm | |
| Example: | |
| ```python | |
| >>> from tokenizers.models import BPE | |
| >>> # Build an empty model (to be trained) | |
| >>> model = BPE(unk_token="<unk>") | |
| >>> # Load from vocabulary and merges files | |
| >>> model = BPE.from_file("vocab.json", "merges.txt") | |
| ``` | |
| from_filetokenizers.models.BPE.from_file[{"name": "vocab", "val": ""}, {"name": "merges", "val": ""}, {"name": "**kwargs", "val": ""}]- **vocab** (`str`) -- | |
| The path to a `vocab.json` file | |
| - **merges** (`str`) -- | |
| The path to a `merges.txt` file0[BPE](/docs/tokenizers/pr_2003/en/api/models#tokenizers.models.BPE)An instance of BPE loaded from these files | |
| Instantiate a BPE model from the given files. | |
| This method is roughly equivalent to doing: | |
| ```python | |
| vocab, merges = BPE.read_file(vocab_filename, merges_filename) | |
| bpe = BPE(vocab, merges) | |
| ``` | |
| If you don't need to keep the `vocab, merges` values lying around, | |
| this method is more optimized than manually calling | |
| `read_file()` to initialize a [BPE](/docs/tokenizers/pr_2003/en/api/models#tokenizers.models.BPE) | |
| **Parameters:** | |
| vocab (`Dict[str, int]`, *optional*) : A dictionary of string keys and their ids `{"am": 0,...}` | |
| merges (`List[Tuple[str, str]]`, *optional*) : A list of pairs of tokens (`Tuple[str, str]`) `[("a", "b"),...]` | |
| cache_capacity (`int`, *optional*) : The number of words that the BPE cache can contain. The cache allows to speed-up the process by keeping the result of the merge operations for a number of words. | |
| dropout (`float`, *optional*) : A float between 0 and 1 that represents the BPE dropout to use. | |
| unk_token (`str`, *optional*) : The unknown token to be used by the model. | |
| continuing_subword_prefix (`str`, *optional*) : The prefix to attach to subword units that don't represent a beginning of word. | |
| end_of_word_suffix (`str`, *optional*) : The suffix to attach to subword units that represent an end of word. | |
| fuse_unk (`bool`, *optional*) : Whether to fuse any subsequent unknown tokens into a single one | |
| byte_fallback (`bool`, *optional*) : Whether to use spm byte-fallback trick (defaults to False) | |
| ignore_merges (`bool`, *optional*) : Whether or not to match tokens with the vocab before using merges. | |
| **Returns:** | |
| `[BPE](/docs/tokenizers/pr_2003/en/api/models#tokenizers.models.BPE)` | |
| An instance of BPE loaded from these files | |
| #### read_file[[tokenizers.models.BPE.read_file]] | |
| Read a `vocab.json` and a `merges.txt` files | |
| This method provides a way to read and parse the content of these files, | |
| returning the relevant data structures. If you want to instantiate some BPE models | |
| from memory, this method gives you the expected input from the standard files. | |
| **Parameters:** | |
| vocab (`str`) : The path to a `vocab.json` file | |
| merges (`str`) : The path to a `merges.txt` file | |
| **Returns:** | |
| `A `Tuple` with the vocab and the merges` | |
| The vocabulary and merges loaded into memory | |
| ## Model[[tokenizers.models.Model]] | |
| #### tokenizers.models.Model[[tokenizers.models.Model]] | |
| Base class for all models | |
| The model represents the actual tokenization algorithm. This is the part that | |
| will contain and manage the learned vocabulary. | |
| This class cannot be constructed directly. Please use one of the concrete models. | |
| get_trainertokenizers.models.Model.get_trainer[]`Trainer`The Trainer used to train this model | |
| Get the associated `Trainer` | |
| Retrieve the `Trainer` associated to this | |
| [Model](/docs/tokenizers/pr_2003/en/api/models#tokenizers.models.Model). | |
| **Returns:** | |
| ``Trainer`` | |
| The Trainer used to train this model | |
| #### id_to_token[[tokenizers.models.Model.id_to_token]] | |
| Get the token associated to an ID | |
| **Parameters:** | |
| id (`int`) : An ID to convert to a token | |
| **Returns:** | |
| ``str`` | |
| The token associated to the ID | |
| #### save[[tokenizers.models.Model.save]] | |
| Save the current model | |
| Save the current model in the given folder, using the given prefix for the various | |
| files that will get created. | |
| Any file with the same name that already exists in this folder will be overwritten. | |
| **Parameters:** | |
| folder (`str`) : The path to the target folder in which to save the various files | |
| prefix (`str`, *optional*) : An optional prefix, used to prefix each file name | |
| **Returns:** | |
| ``List[str]`` | |
| The list of saved files | |
| #### token_to_id[[tokenizers.models.Model.token_to_id]] | |
| Get the ID associated to a token | |
| **Parameters:** | |
| token (`str`) : A token to convert to an ID | |
| **Returns:** | |
| ``int`` | |
| The ID associated to the token | |
| #### tokenize[[tokenizers.models.Model.tokenize]] | |
| Tokenize a sequence | |
| **Parameters:** | |
| sequence (`str`) : A sequence to tokenize | |
| **Returns:** | |
| `A `List` of `Token`` | |
| The generated tokens | |
| ## Unigram[[tokenizers.models.Unigram]] | |
| #### tokenizers.models.Unigram[[tokenizers.models.Unigram]] | |
| An implementation of the Unigram algorithm | |
| The Unigram algorithm is a subword tokenization algorithm based on unigram language | |
| models, as used in SentencePiece. It learns a vocabulary by starting with a large | |
| initial vocabulary and iteratively pruning it using the EM algorithm. | |
| Example: | |
| ```python | |
| >>> from tokenizers.models import Unigram | |
| >>> # Build an empty model (to be trained) | |
| >>> model = Unigram() | |
| >>> # Build from a vocabulary list | |
| >>> vocab = [("<unk>", 0.0), ("hello", -1.0), ("world", -1.5)] | |
| >>> model = Unigram(vocab=vocab, unk_id=0) | |
| ``` | |
| **Parameters:** | |
| vocab (`List[Tuple[str, float]]`, *optional*) : A list of vocabulary items and their log-probability scores, e.g. `[("am", -0.2442), ...]`. If not provided, an empty model is created. | |
| unk_id (`int`, *optional*) : The index of the unknown token in the vocabulary list. | |
| byte_fallback (`bool`, *optional*, defaults to `False`) : Whether to use SentencePiece byte fallback for characters not in the vocabulary. | |
| ## WordLevel[[tokenizers.models.WordLevel]] | |
| #### tokenizers.models.WordLevel[[tokenizers.models.WordLevel]] | |
| An implementation of the WordLevel algorithm | |
| Most simple tokenizer model based on mapping tokens to their corresponding id. | |
| Example: | |
| ```python | |
| >>> from tokenizers.models import WordLevel | |
| >>> # Build from a vocabulary dictionary | |
| >>> vocab = {"hello": 0, "world": 1, "<unk>": 2} | |
| >>> model = WordLevel(vocab=vocab, unk_token="<unk>") | |
| >>> # Load from file | |
| >>> model = WordLevel.from_file("vocab.json", unk_token="<unk>") | |
| ``` | |
| from_filetokenizers.models.WordLevel.from_file[{"name": "vocab", "val": ""}, {"name": "unk_token", "val": " = None"}]- **vocab** (`str`) -- | |
| The path to a `vocab.json` file0[WordLevel](/docs/tokenizers/pr_2003/en/api/models#tokenizers.models.WordLevel)An instance of WordLevel loaded from file | |
| Instantiate a WordLevel model from the given file | |
| This method is roughly equivalent to doing: | |
| ```python | |
| vocab = WordLevel.read_file(vocab_filename) | |
| wordlevel = WordLevel(vocab) | |
| ``` | |
| If you don't need to keep the `vocab` values lying around, this method is | |
| more optimized than manually calling `read_file()` to | |
| initialize a [WordLevel](/docs/tokenizers/pr_2003/en/api/models#tokenizers.models.WordLevel) | |
| **Parameters:** | |
| vocab (`str`, *optional*) : A dictionary of string keys and their ids `{"am": 0,...}` | |
| unk_token (`str`, *optional*) : The unknown token to be used by the model. | |
| **Returns:** | |
| `[WordLevel](/docs/tokenizers/pr_2003/en/api/models#tokenizers.models.WordLevel)` | |
| An instance of WordLevel loaded from file | |
| #### read_file[[tokenizers.models.WordLevel.read_file]] | |
| Read a `vocab.json` | |
| This method provides a way to read and parse the content of a vocabulary file, | |
| returning the relevant data structures. If you want to instantiate some WordLevel models | |
| from memory, this method gives you the expected input from the standard files. | |
| **Parameters:** | |
| vocab (`str`) : The path to a `vocab.json` file | |
| **Returns:** | |
| ``Dict[str, int]`` | |
| The vocabulary as a `dict` | |
| ## WordPiece[[tokenizers.models.WordPiece]] | |
| #### tokenizers.models.WordPiece[[tokenizers.models.WordPiece]] | |
| An implementation of the WordPiece algorithm | |
| Example: | |
| ```python | |
| >>> from tokenizers.models import WordPiece | |
| >>> # Build an empty model (to be trained) | |
| >>> model = WordPiece(unk_token="[UNK]") | |
| >>> # Load from a vocabulary file | |
| >>> model = WordPiece.from_file("vocab.txt") | |
| ``` | |
| from_filetokenizers.models.WordPiece.from_file[{"name": "vocab", "val": ""}, {"name": "**kwargs", "val": ""}]- **vocab** (`str`) -- | |
| The path to a `vocab.txt` file0[WordPiece](/docs/tokenizers/pr_2003/en/api/models#tokenizers.models.WordPiece)An instance of WordPiece loaded from file | |
| Instantiate a WordPiece model from the given file | |
| This method is roughly equivalent to doing: | |
| ```python | |
| vocab = WordPiece.read_file(vocab_filename) | |
| wordpiece = WordPiece(vocab) | |
| ``` | |
| If you don't need to keep the `vocab` values lying around, this method is | |
| more optimized than manually calling `read_file()` to | |
| initialize a [WordPiece](/docs/tokenizers/pr_2003/en/api/models#tokenizers.models.WordPiece) | |
| **Parameters:** | |
| vocab (`Dict[str, int]`, *optional*) : A dictionary of string keys and their ids `{"am": 0,...}` | |
| unk_token (`str`, *optional*) : The unknown token to be used by the model. | |
| max_input_chars_per_word (`int`, *optional*) : The maximum number of characters to authorize in a single word. | |
| **Returns:** | |
| `[WordPiece](/docs/tokenizers/pr_2003/en/api/models#tokenizers.models.WordPiece)` | |
| An instance of WordPiece loaded from file | |
| #### read_file[[tokenizers.models.WordPiece.read_file]] | |
| Read a `vocab.txt` file | |
| This method provides a way to read and parse the content of a standard *vocab.txt* | |
| file as used by the WordPiece Model, returning the relevant data structures. If you | |
| want to instantiate some WordPiece models from memory, this method gives you the | |
| expected input from the standard files. | |
| **Parameters:** | |
| vocab (`str`) : The path to a `vocab.txt` file | |
| **Returns:** | |
| ``Dict[str, int]`` | |
| The vocabulary as a `dict` | |
| The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website. | |
| The node API has not been documented yet. | |
Xet Storage Details
- Size:
- 10.1 kB
- Xet hash:
- 00cfbce30fd1e1b0934c98b60f0eeb9a33759ada26a1e37e59641c5458758c83
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.