Nj-1111
/

Copernicus-Tokenizer

Model card Files Files and versions

Copernicus-Tokenizer / README.md

Nj-1111's picture

Upload folder using huggingface_hub

6cba08c verified 5 days ago

|

history blame contribute delete

1.67 kB

	---
	license: apache-2.0
	tags:
	- tokenizer
	- bpe
	- nlp
	- llm
	library_name: transformers
	---

	# Copernicus Tokenizer

	Domain-general BPE tokenizer trained from scratch on 3.96 million documents
	spanning natural language, code, mathematics, and scientific text.

	\| Parameter \| Value \|
	\|---\|---\|
	\| Algorithm \| Byte-Pair Encoding (BPE) \|
	\| Vocabulary size \| 32,685 \|
	\| Merges \| 32,493 \|
	\| Byte encoding \| GPT-2 byte-level (256-char alphabet) \|
	\| Min frequency \| 3 \|

	## Quick start

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("Nj-1111/Copernicus-Tokenizer")

	ids = tokenizer("Hello, world!")
	print(ids)
	```

	## Use in a training loop

	```python
	from transformers import PreTrainedTokenizerFast

	tokenizer = PreTrainedTokenizerFast.from_pretrained("Nj-1111/Copernicus-Tokenizer")

	inputs = tokenizer(
	["Hello world", "def foo(): pass"],
	truncation=True,
	max_length=2048,
	padding="max_length",
	return_tensors="pt",
	)
	```

	## Special tokens

	\| Token \| Role \|
	\|---\|---\|
	\| `<\\|endoftext\\|>` \| BOS / EOS \|
	\| `<\\|unk\\|>` \| Unknown \|
	\| `<\\|pad\\|>` \| Padding \|
	\| `<think>` / `</think>` \| Chain-of-thought delimiters \|
	\| `<\\|user\\|>` / `<\\|assistant\\|>` / `<\\|system\\|>` \| Chat roles \|
	\| `<\\|im_start\\|>` / `<\\|im_end\\|>` \| ChatML-style markers \|
	\| `<\\|tool_call\\|>` / `<\\|tool_result\\|>` \| Tool use \|

	## Training data

	\| Domain \| Source \|
	\|---\|---\|
	\| Natural language \| Wikipedia (multilingual), Common Crawl \|
	\| Code \| The Stack \|
	\| Mathematics \| MATH dataset, arXiv \|
	\| Science \| PubMed, S2ORC \|

	Training code: [github.com/Nj-1111/copernicus-tokenizer](https://github.com/Nj-1111/copernicus-tokenizer)