Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,85 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- ar
|
| 5 |
+
library_name: tokenizers
|
| 6 |
+
pipeline_tag: summarization
|
| 7 |
+
tags:
|
| 8 |
+
- arabic
|
| 9 |
+
- summarization
|
| 10 |
+
- tokenizers
|
| 11 |
+
- BPE
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
## Byte Level (BPE) Tokenizer for Arabic
|
| 15 |
+
|
| 16 |
+
Byte Level Tokenizer for Arabic, a robust tokenizer designed to handle Arabic text with precision and efficiency.
|
| 17 |
+
This tokenizer utilizes a `Byte-Pair Encoding (BPE)` approach to create a vocabulary of `32,000` tokens, catering specifically to the intricacies of the Arabic language.
|
| 18 |
+
|
| 19 |
+
### Goal
|
| 20 |
+
|
| 21 |
+
This tokenizer was created as part of the development of an Arabic BART transformer model for summarization from scratch using `PyTorch`.
|
| 22 |
+
In adherence to the configurations outlined in the official [BART](https://arxiv.org/abs/1910.13461) paper, which specifies the use of BPE tokenization, I sought a BPE tokenizer specifically tailored for Arabic.
|
| 23 |
+
While there are Arabic-only tokenizers and multilingual BPE tokenizers, a dedicated Arabic BPE tokenizer was not available. This gap inspired the creation of a `BPE` tokenizer focused solely on Arabic, ensuring alignment with BART's recommended configurations and enhancing the effectiveness of Arabic NLP tasks.
|
| 24 |
+
|
| 25 |
+
### Checkpoint Information
|
| 26 |
+
|
| 27 |
+
- **Name**: `IsmaelMousa/arabic-bpe-tokenizer`
|
| 28 |
+
- **Vocabulary Size**: `32,000`
|
| 29 |
+
|
| 30 |
+
### Overview
|
| 31 |
+
|
| 32 |
+
The Byte Level Tokenizer is optimized to manage Arabic text, which often includes a range of diacritics, different forms of the same word, and various prefixes and suffixes. This tokenizer addresses these challenges by breaking down text into byte-level tokens, ensuring that it can effectively process and understand the nuances of the Arabic language.
|
| 33 |
+
|
| 34 |
+
### Features
|
| 35 |
+
|
| 36 |
+
- **Byte-Pair Encoding (BPE)**: Efficiently manages a large vocabulary size while maintaining accuracy.
|
| 37 |
+
- **Comprehensive Coverage**: Handles Arabic script, including diacritics and various word forms.
|
| 38 |
+
- **Flexible Integration**: Easily integrates with the `tokenizers` library for seamless tokenization.
|
| 39 |
+
|
| 40 |
+
### Installation
|
| 41 |
+
|
| 42 |
+
To use this tokenizer, you need to install the `tokenizers` library. If you haven’t installed it yet, you can do so using pip:
|
| 43 |
+
|
| 44 |
+
```bash
|
| 45 |
+
pip install tokenizers
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
### Example Usage
|
| 49 |
+
Here is an example of how to use the Byte Level Tokenizer with the `tokenizers` library.
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
This example demonstrates tokenization of the Arabic sentence "لاشيء يعجبني, أريد أن أبكي":
|
| 53 |
+
|
| 54 |
+
```python
|
| 55 |
+
from tokenizers import Tokenizer
|
| 56 |
+
|
| 57 |
+
tokenizer = Tokenizer.from_pretrained("IsmaelMousa/arabic-bpe-tokenizer")
|
| 58 |
+
|
| 59 |
+
text = "لاشيء يعجبني, أريد أن أبكي"
|
| 60 |
+
|
| 61 |
+
encoded = tokenizer.encode(text)
|
| 62 |
+
decoded = tokenizer.decode(encoded.ids)
|
| 63 |
+
|
| 64 |
+
print("Encoded Tokens:", encoded.tokens)
|
| 65 |
+
print("Token IDs:", encoded.ids)
|
| 66 |
+
print("Decoded Text:", decoded)
|
| 67 |
+
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
output:
|
| 71 |
+
|
| 72 |
+
```bash
|
| 73 |
+
Encoded Tokens: ['<s>', 'ÙĦا', 'ĠØ´ÙĬØ¡', 'ĠÙĬع', 'جب', 'ÙĨÙĬ', ',', 'ĠأرÙĬد', 'ĠØ£ÙĨ', 'Ġأب', 'ÙĥÙĬ', '</s>']
|
| 74 |
+
|
| 75 |
+
Token IDs: [0, 419, 1773, 667, 2281, 489, 16, 7578, 331, 985, 1344, 2]
|
| 76 |
+
|
| 77 |
+
Decoded Text: لا شيء يعجبني, أريد أن أبكي
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
### Tokenizer Details
|
| 81 |
+
- **Byte-Level Tokenization**: This method ensures that every byte of input text is considered, making it suitable for languages with complex scripts.
|
| 82 |
+
- **Adaptability**: Can be fine-tuned or used as-is, depending on your specific needs and application scenarios.
|
| 83 |
+
|
| 84 |
+
### License
|
| 85 |
+
This project is licensed under the `MIT` License.
|