mjbommar
/

magic-bert-tokenizer-4k

binary-analysis

file-type-detection

Model card Files Files and versions

magic-bert-tokenizer-4k / README.md

mjbommar's picture

Fix usage example with correct repo ID

e69b18d verified about 1 month ago

|

history blame contribute delete

1.83 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- tokenizer
	- bpe
	- binary-analysis
	- file-type-detection
	- magic-bert
	---

	# Magic-BERT Tokenizer (4K vocabulary)

	A byte-level BPE tokenizer trained on binary file data for the Magic-BERT project.
	This tokenizer is designed for binary file classification and analysis tasks.

	## Features

	- Vocabulary Size: 4,096 tokens
	- Type: Byte-level BPE (Byte Pair Encoding)
	- Training Data: Binary files from diverse sources (executables, documents, archives, media files, etc.)
	- Encoding: Latin-1 (ISO-8859-1) - treats each byte as a character

	## Special Tokens

	\| Token \| ID \| Purpose \|
	\|-------\|-----\|---------\|
	\| `<\\|start\\|>` \| 0 \| Beginning of sequence (BOS) \|
	\| `<\\|end\\|>` \| 1 \| End of sequence (EOS) \|
	\| `<\\|pad\\|>` \| 2 \| Padding \|
	\| `<\\|unk\\|>` \| 3 \| Unknown token \|
	\| `<\\|cls\\|>` \| 4 \| Classification token \|
	\| `<\\|sep\\|>` \| 5 \| Separator token \|
	\| `<\\|mask\\|>` \| 6 \| Mask token (for MLM) \|

	## Usage

	```python
	from transformers import AutoTokenizer

	# Load the tokenizer
	tokenizer = AutoTokenizer.from_pretrained("mjbommar/magic-bert-tokenizer-4k")

	# Tokenize binary data (read as latin-1)
	with open("some_file.bin", "rb") as f:
	binary_data = f.read()
	text = binary_data.decode("latin-1")

	# Encode
	tokens = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
	print(tokens)
	```

	## Training Details

	This tokenizer was trained using the HuggingFace `tokenizers` library with:
	- Byte-level BPE algorithm
	- Training on ~100K+ binary file samples
	- Diverse file types: executables, documents, archives, media, source code, etc.

	## Model

	This tokenizer is designed to be used with Magic-BERT models for binary file
	MIME type classification. See the main model repository for more details.

	## License

	Apache 2.0