Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Custom Tokenizer
|
| 2 |
+
## Examples
|
| 3 |
+
Example sentence: `This is a test sentence. On va voir comment elle est gérée .... 123 + 56 = 2567. Let's go! Imagine I have code 4 spaces.
|
| 4 |
+
and a backslash!! Eléonore est un prénom français. __name__ isInstance`
|
| 5 |
+
|
| 6 |
+
Encoded sentence: `['▁This', '▁is', '▁a', '▁test', '▁sent', 'ence.', '▁On', '▁va', '▁voir', '▁comment', '▁elle', '▁est', '▁g', 'érée', '▁....', '▁', '1', '2', '3', '▁+', '▁', '5', '6', '▁=', '▁', '2', '5', '6', '7', '.', "▁Let's", '▁go', '!', '▁Im', 'ag', 'ine', '▁I', '▁have', '▁code', '▁', '▁', '▁', '▁', '4', '▁spaces', '.\n', '▁and', '▁a', '▁', '▁', '▁', '▁', '▁', '▁back', 'sl', 'ash', '!!', '▁El', 'éon', 'ore', '▁est', '▁un', '▁prénom', '▁français.', '▁__name__', '▁is', 'Instance']`
|
| 7 |
+
|
| 8 |
+
Decoded sentence: `<s> This is a test sentence. On va voir comment elle est gérée .... 123 + 56 = 2567. Let's go! Imagine I have code 4 spaces.
|
| 9 |
+
and a backslash!! Eléonore est un prénom français. __name__ isInstance`
|
| 10 |
+
|
| 11 |
+
## Usage
|
| 12 |
+
```python
|
| 13 |
+
|
| 14 |
+
from transformers import LlamaTokenizerFast
|
| 15 |
+
|
| 16 |
+
tok = LlamaTokenizerFast.from_pretrained('<tok_name>')
|
| 17 |
+
|
| 18 |
+
tok.tokenize('This is a test sentence')
|
| 19 |
+
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
## Dataset Stats
|
| 23 |
+
Samples are trained on dataset `manu/tok-corpus-shuffled`
|
| 24 |
+
|
| 25 |
+
The dataset consists of french, english and code samples
|
| 26 |
+
|
| 27 |
+
More info on the dataset can be found [here](https://huggingface.co/datasets/manu/tok-corpus-shuffled)
|
| 28 |
+
|
| 29 |
+
For speed purposes, the tokenizer was trained on a sample of the dataset. Only the first samples were selected.
|
| 30 |
+
|
| 31 |
+
Sample size: 2000000
|
| 32 |
+
|
| 33 |
+
Size of Sampled: 7.0 GB
|
| 34 |
+
|
| 35 |
+
## Tokenizer Configs
|
| 36 |
+
Build from scratch: True
|
| 37 |
+
|
| 38 |
+
Pretrained tokenizer: None
|
| 39 |
+
|
| 40 |
+
Tokenizer is trained with digit separation, whitespaces (for code), byte fallback...
|
| 41 |
+
|