2ira
/

Byte-lingua-code

Model card Files Files and versions

Byte-lingua-code / superbpe /tokenizers_superbpe /docs /source-doc-builder /api /input-sequences.mdx

2ira's picture

offline_compression_graph_code

72c0672 verified 4 months ago

history blame contribute delete

1.31 kB

	# Input Sequences

	<tokenizerslangcontent>
	<python>
	These types represent all the different kinds of sequence that can be used as input of a Tokenizer.
	Globally, any sequence can be either a string or a list of strings, according to the operating
	mode of the tokenizer: `raw text` vs `pre-tokenized`.

	## TextInputSequence[[tokenizers.TextInputSequence]]

	<code>tokenizers.TextInputSequence</code>

	A `str` that represents an input sequence

	## PreTokenizedInputSequence[[tokenizers.PreTokenizedInputSequence]]

	<code>tokenizers.PreTokenizedInputSequence</code>

	A pre-tokenized input sequence. Can be one of:
	- A `List` of `str`
	- A `Tuple` of `str`

	alias of `Union[List[str], Tuple[str]]`.

	## InputSequence[[tokenizers.InputSequence]]

	<code>tokenizers.InputSequence</code>

	Represents all the possible types of input sequences for encoding. Can be:
	- When `is_pretokenized=False`: [TextInputSequence](#tokenizers.TextInputSequence)
	- When `is_pretokenized=True`: [PreTokenizedInputSequence](#tokenizers.PreTokenizedInputSequence)

	alias of `Union[str, List[str], Tuple[str]]`.
	</python>
	<rust>
	The Rust API Reference is available directly on the [Docs.rs](https://docs.rs/tokenizers/latest/tokenizers/) website.
	</rust>
	<node>
	The node API has not been documented yet.
	</node>
	</tokenizerslangcontent>