File size: 1,833 Bytes
4f274d8
95d2096
 
 
4f274d8
95d2096
 
 
 
 
 
4f274d8
 
95d2096
4f274d8
95d2096
 
4f274d8
95d2096
4f274d8
95d2096
 
 
 
4f274d8
95d2096
4f274d8
95d2096
 
e5662f7
 
 
 
 
 
 
4f274d8
95d2096
4f274d8
95d2096
 
4f274d8
95d2096
c29b82d
4f274d8
95d2096
 
 
 
4f274d8
95d2096
 
 
 
4f274d8
 
 
95d2096
 
 
 
4f274d8
95d2096
4f274d8
95d2096
 
4f274d8
95d2096
4f274d8
95d2096
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- tokenizer
- bpe
- binary-analysis
- file-type-detection
- magic-bert
---

# Magic-BERT Tokenizer (32K vocabulary)

A byte-level BPE tokenizer trained on binary file data for the Magic-BERT project.
This tokenizer is designed for binary file classification and analysis tasks.

## Features

- **Vocabulary Size:** 32,768 tokens
- **Type:** Byte-level BPE (Byte Pair Encoding)
- **Training Data:** Binary files from diverse sources (executables, documents, archives, media files, etc.)
- **Encoding:** Latin-1 (ISO-8859-1) - treats each byte as a character

## Special Tokens

| Token | ID | Purpose |
|-------|-----|---------|
| `<\|start\|>` | 0 | Beginning of sequence (BOS) |
| `<\|end\|>` | 1 | End of sequence (EOS) |
| `<\|pad\|>` | 2 | Padding |
| `<\|unk\|>` | 3 | Unknown token |
| `<\|cls\|>` | 4 | Classification token |
| `<\|sep\|>` | 5 | Separator token |
| `<\|mask\|>` | 6 | Mask token (for MLM) |

## Usage

```python
from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("mjbommar/magic-bert-tokenizer-32k")

# Tokenize binary data (read as latin-1)
with open("some_file.bin", "rb") as f:
    binary_data = f.read()
text = binary_data.decode("latin-1")

# Encode
tokens = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
print(tokens)
```

## Training Details

This tokenizer was trained using the HuggingFace `tokenizers` library with:
- Byte-level BPE algorithm
- Training on ~100K+ binary file samples
- Diverse file types: executables, documents, archives, media, source code, etc.

## Model

This tokenizer is designed to be used with Magic-BERT models for binary file
MIME type classification. See the main model repository for more details.

## License

Apache 2.0