File size: 1,830 Bytes
8313357
37ffa62
 
 
8313357
37ffa62
 
 
 
 
 
8313357
 
37ffa62
8313357
37ffa62
 
8313357
37ffa62
8313357
37ffa62
 
 
 
8313357
37ffa62
8313357
37ffa62
 
e26eb7a
 
 
 
 
 
 
8313357
37ffa62
8313357
37ffa62
 
8313357
37ffa62
e69b18d
8313357
37ffa62
 
 
 
8313357
37ffa62
 
 
 
8313357
 
 
37ffa62
 
 
 
8313357
37ffa62
8313357
37ffa62
 
8313357
37ffa62
8313357
37ffa62
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- tokenizer
- bpe
- binary-analysis
- file-type-detection
- magic-bert
---

# Magic-BERT Tokenizer (4K vocabulary)

A byte-level BPE tokenizer trained on binary file data for the Magic-BERT project.
This tokenizer is designed for binary file classification and analysis tasks.

## Features

- **Vocabulary Size:** 4,096 tokens
- **Type:** Byte-level BPE (Byte Pair Encoding)
- **Training Data:** Binary files from diverse sources (executables, documents, archives, media files, etc.)
- **Encoding:** Latin-1 (ISO-8859-1) - treats each byte as a character

## Special Tokens

| Token | ID | Purpose |
|-------|-----|---------|
| `<\|start\|>` | 0 | Beginning of sequence (BOS) |
| `<\|end\|>` | 1 | End of sequence (EOS) |
| `<\|pad\|>` | 2 | Padding |
| `<\|unk\|>` | 3 | Unknown token |
| `<\|cls\|>` | 4 | Classification token |
| `<\|sep\|>` | 5 | Separator token |
| `<\|mask\|>` | 6 | Mask token (for MLM) |

## Usage

```python
from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("mjbommar/magic-bert-tokenizer-4k")

# Tokenize binary data (read as latin-1)
with open("some_file.bin", "rb") as f:
    binary_data = f.read()
text = binary_data.decode("latin-1")

# Encode
tokens = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
print(tokens)
```

## Training Details

This tokenizer was trained using the HuggingFace `tokenizers` library with:
- Byte-level BPE algorithm
- Training on ~100K+ binary file samples
- Diverse file types: executables, documents, archives, media, source code, etc.

## Model

This tokenizer is designed to be used with Magic-BERT models for binary file
MIME type classification. See the main model repository for more details.

## License

Apache 2.0