OLMO_Smiles_aware_tokenizer / tokenizer_config.json

Add SPE tokenizer (1125 SMILES subword tokens) + <|start_of_smiles|>/<|end_of_smiles|> special tokens. Trained on 2M ZINC20 + 2M ChEMBL canonical SMILES. SPE vocab_size=1000, min_freq=4000.

f6eb1d8 verified 10 days ago

raw

history blame contribute delete

480 Bytes

	{
	"add_prefix_space": false,
	"backend": "tokenizers",
	"bos_token": null,
	"clean_up_tokenization_spaces": true,
	"eos_token": "<\|endoftext\|>",
	"errors": "replace",
	"extra_special_tokens": [
	"<\|start_of_smiles\|>",
	"<\|end_of_smiles\|>"
	],
	"is_local": false,
	"local_files_only": false,
	"model_max_length": 1000000000000000019884624838656,
	"pad_token": "<\|padding\|>",
	"tokenizer_class": "GPTNeoXTokenizer",
	"trim_offsets": true,
	"unk_token": null
	}