File size: 1,496 Bytes
5281580 724d804 5281580 8312f51 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
---
language:
- is
library_name: transformers
---
# Icelandic Tokenizer README
## Overview
This BPE (Byte Pair Encoding) tokenizer is designed for the Icelandic GPT model, available at [Sigurdur/ice-gpt](https://huggingface.co/Sigurdur/ice-gpt). Trained on the Icelandic Gigaword Corpus ({IGC}-2022) - annotated version, it excels in accurately segmenting Icelandic text into meaningful tokens.
## Usage
Integrate this tokenizer into your NLP pipeline for preprocessing Icelandic text. The following example demonstrates basic usage:
```python
from transformers import GPT2Tokenizer
# Load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("Sigurdur/ice-tokenizer")
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer("Halló heimur!")["input_ids"]
```
## Citation
If you use this tokenizer in your work, please cite the original source of the training data:
```bibtex
@misc{20.500.12537/254,
title = {Icelandic Gigaword Corpus ({IGC}-2022) - annotated version},
author = {Barkarson, Starkaður and Steingrímsson, Steinþór and Andrésdóttir, Þórdís Dröfn and Hafsteinsdóttir, Hildur and Ingimundarson, Finnur Ágúst and Magnússon, Árni Davíð},
url = {http://hdl.handle.net/20.500.12537/254},
note = {{CLARIN}-{IS}},
year = {2022}
}
```
## Feedback
We welcome user feedback to enhance the tokenizer's functionality. Feel free to reach out with your insights and suggestions.
Happy tokenizing!
Sigurdur Haukur Birgisson
(readme created with chatgpt) |