File size: 1,830 Bytes
8313357 37ffa62 8313357 37ffa62 8313357 37ffa62 8313357 37ffa62 8313357 37ffa62 8313357 37ffa62 8313357 37ffa62 8313357 37ffa62 e26eb7a 8313357 37ffa62 8313357 37ffa62 8313357 37ffa62 e69b18d 8313357 37ffa62 8313357 37ffa62 8313357 37ffa62 8313357 37ffa62 8313357 37ffa62 8313357 37ffa62 8313357 37ffa62 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- tokenizer
- bpe
- binary-analysis
- file-type-detection
- magic-bert
---
# Magic-BERT Tokenizer (4K vocabulary)
A byte-level BPE tokenizer trained on binary file data for the Magic-BERT project.
This tokenizer is designed for binary file classification and analysis tasks.
## Features
- **Vocabulary Size:** 4,096 tokens
- **Type:** Byte-level BPE (Byte Pair Encoding)
- **Training Data:** Binary files from diverse sources (executables, documents, archives, media files, etc.)
- **Encoding:** Latin-1 (ISO-8859-1) - treats each byte as a character
## Special Tokens
| Token | ID | Purpose |
|-------|-----|---------|
| `<\|start\|>` | 0 | Beginning of sequence (BOS) |
| `<\|end\|>` | 1 | End of sequence (EOS) |
| `<\|pad\|>` | 2 | Padding |
| `<\|unk\|>` | 3 | Unknown token |
| `<\|cls\|>` | 4 | Classification token |
| `<\|sep\|>` | 5 | Separator token |
| `<\|mask\|>` | 6 | Mask token (for MLM) |
## Usage
```python
from transformers import AutoTokenizer
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("mjbommar/magic-bert-tokenizer-4k")
# Tokenize binary data (read as latin-1)
with open("some_file.bin", "rb") as f:
binary_data = f.read()
text = binary_data.decode("latin-1")
# Encode
tokens = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
print(tokens)
```
## Training Details
This tokenizer was trained using the HuggingFace `tokenizers` library with:
- Byte-level BPE algorithm
- Training on ~100K+ binary file samples
- Diverse file types: executables, documents, archives, media, source code, etc.
## Model
This tokenizer is designed to be used with Magic-BERT models for binary file
MIME type classification. See the main model repository for more details.
## License
Apache 2.0
|