File size: 1,833 Bytes

801d3ab
05e962e
 
 
801d3ab
05e962e
 
 
 
 
 
801d3ab
 
05e962e
801d3ab
05e962e
 
801d3ab
05e962e
801d3ab
05e962e
 
 
 
801d3ab
05e962e
801d3ab
05e962e
 
b511d78
 
 
 
 
 
 
801d3ab
05e962e
801d3ab
05e962e
 
801d3ab
05e962e
7ea14b1
801d3ab
05e962e
 
 
 
801d3ab
05e962e
 
 
 
801d3ab
 
 
05e962e
 
 
 
801d3ab
05e962e
801d3ab
05e962e
 
801d3ab
05e962e
801d3ab
05e962e

---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- tokenizer
- bpe
- binary-analysis
- file-type-detection
- magic-bert
---

# Magic-BERT Tokenizer (16K vocabulary)

A byte-level BPE tokenizer trained on binary file data for the Magic-BERT project.
This tokenizer is designed for binary file classification and analysis tasks.

## Features

- **Vocabulary Size:** 16,384 tokens
- **Type:** Byte-level BPE (Byte Pair Encoding)
- **Training Data:** Binary files from diverse sources (executables, documents, archives, media files, etc.)
- **Encoding:** Latin-1 (ISO-8859-1) - treats each byte as a character

## Special Tokens

| Token | ID | Purpose |
|-------|-----|---------|
| `<\|start\|>` | 0 | Beginning of sequence (BOS) |
| `<\|end\|>` | 1 | End of sequence (EOS) |
| `<\|pad\|>` | 2 | Padding |
| `<\|unk\|>` | 3 | Unknown token |
| `<\|cls\|>` | 4 | Classification token |
| `<\|sep\|>` | 5 | Separator token |
| `<\|mask\|>` | 6 | Mask token (for MLM) |

## Usage

```python
from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("mjbommar/magic-bert-tokenizer-16k")

# Tokenize binary data (read as latin-1)
with open("some_file.bin", "rb") as f:
    binary_data = f.read()
text = binary_data.decode("latin-1")

# Encode
tokens = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
print(tokens)
```

## Training Details

This tokenizer was trained using the HuggingFace `tokenizers` library with:
- Byte-level BPE algorithm
- Training on ~100K+ binary file samples
- Diverse file types: executables, documents, archives, media, source code, etc.

## Model

This tokenizer is designed to be used with Magic-BERT models for binary file
MIME type classification. See the main model repository for more details.

## License

Apache 2.0