|
|
--- |
|
|
language: |
|
|
- is |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Icelandic Tokenizer README |
|
|
|
|
|
## Overview |
|
|
This BPE (Byte Pair Encoding) tokenizer is designed for the Icelandic GPT model, available at [Sigurdur/ice-gpt](https://huggingface.co/Sigurdur/ice-gpt). Trained on the Icelandic Gigaword Corpus ({IGC}-2022) - annotated version, it excels in accurately segmenting Icelandic text into meaningful tokens. |
|
|
|
|
|
## Usage |
|
|
Integrate this tokenizer into your NLP pipeline for preprocessing Icelandic text. The following example demonstrates basic usage: |
|
|
|
|
|
```python |
|
|
from transformers import GPT2Tokenizer |
|
|
|
|
|
# Load the tokenizer |
|
|
tokenizer = GPT2Tokenizer.from_pretrained("Sigurdur/ice-tokenizer") |
|
|
tokenizer.pad_token_id = tokenizer.eos_token_id |
|
|
|
|
|
tokenizer("Halló heimur!")["input_ids"] |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
If you use this tokenizer in your work, please cite the original source of the training data: |
|
|
|
|
|
```bibtex |
|
|
@misc{20.500.12537/254, |
|
|
title = {Icelandic Gigaword Corpus ({IGC}-2022) - annotated version}, |
|
|
author = {Barkarson, Starkaður and Steingrímsson, Steinþór and Andrésdóttir, Þórdís Dröfn and Hafsteinsdóttir, Hildur and Ingimundarson, Finnur Ágúst and Magnússon, Árni Davíð}, |
|
|
url = {http://hdl.handle.net/20.500.12537/254}, |
|
|
note = {{CLARIN}-{IS}}, |
|
|
year = {2022} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Feedback |
|
|
We welcome user feedback to enhance the tokenizer's functionality. Feel free to reach out with your insights and suggestions. |
|
|
|
|
|
Happy tokenizing! |
|
|
|
|
|
Sigurdur Haukur Birgisson |
|
|
|
|
|
|
|
|
(readme created with chatgpt) |