Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- is
|
| 4 |
+
library_name: transformers
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Icelandic Tokenizer README
|
| 8 |
+
|
| 9 |
+
## Overview
|
| 10 |
+
This BPE (Byte Pair Encoding) tokenizer is designed for the Icelandic GPT model, available at [Sigurdur/ice-gpt](https://huggingface.co/Sigurdur/ice-gpt). Trained on the Icelandic Gigaword Corpus ({IGC}-2022) - annotated version, it excels in accurately segmenting Icelandic text into meaningful tokens.
|
| 11 |
+
|
| 12 |
+
## Usage
|
| 13 |
+
Integrate this tokenizer into your NLP pipeline for preprocessing Icelandic text. The following example demonstrates basic usage:
|
| 14 |
+
|
| 15 |
+
```python
|
| 16 |
+
from transformers import GPT2Tokenizer
|
| 17 |
+
|
| 18 |
+
# Load the tokenizer
|
| 19 |
+
tokenizer = GPT2Tokenizer.from_pretrained("Sigurdur/ice-gpt")
|
| 20 |
+
tokenizer.pad_token_id = tokenizer.eos_token_id
|
| 21 |
+
|
| 22 |
+
tokenizer("Halló heimur!")["input_ids"]
|
| 23 |
+
```
|
| 24 |
+
|
| 25 |
+
## Citation
|
| 26 |
+
If you use this tokenizer in your work, please cite the original source of the training data:
|
| 27 |
+
|
| 28 |
+
```bibtex
|
| 29 |
+
@misc{20.500.12537/254,
|
| 30 |
+
title = {Icelandic Gigaword Corpus ({IGC}-2022) - annotated version},
|
| 31 |
+
author = {Barkarson, Starkaður and Steingrímsson, Steinþór and Andrésdóttir, Þórdís Dröfn and Hafsteinsdóttir, Hildur and Ingimundarson, Finnur Ágúst and Magnússon, Árni Davíð},
|
| 32 |
+
url = {http://hdl.handle.net/20.500.12537/254},
|
| 33 |
+
note = {{CLARIN}-{IS}},
|
| 34 |
+
year = {2022}
|
| 35 |
+
}
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
## Feedback
|
| 39 |
+
We welcome user feedback to enhance the tokenizer's functionality. Feel free to reach out with your insights and suggestions.
|
| 40 |
+
|
| 41 |
+
Happy tokenizing!
|
| 42 |
+
Siggi, Senior Programmer
|