File size: 1,496 Bytes
5281580
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
724d804
5281580
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8312f51
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
language:
- is
library_name: transformers
---

# Icelandic Tokenizer README

## Overview
This BPE (Byte Pair Encoding) tokenizer is designed for the Icelandic GPT model, available at [Sigurdur/ice-gpt](https://huggingface.co/Sigurdur/ice-gpt). Trained on the Icelandic Gigaword Corpus ({IGC}-2022) - annotated version, it excels in accurately segmenting Icelandic text into meaningful tokens.

## Usage
Integrate this tokenizer into your NLP pipeline for preprocessing Icelandic text. The following example demonstrates basic usage:

```python
from transformers import GPT2Tokenizer

# Load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("Sigurdur/ice-tokenizer")
tokenizer.pad_token_id = tokenizer.eos_token_id

tokenizer("Halló heimur!")["input_ids"]
```

## Citation
If you use this tokenizer in your work, please cite the original source of the training data:

```bibtex
@misc{20.500.12537/254,
  title = {Icelandic Gigaword Corpus ({IGC}-2022) - annotated version},
  author = {Barkarson, Starkaður and Steingrímsson, Steinþór and Andrésdóttir, Þórdís Dröfn and Hafsteinsdóttir, Hildur and Ingimundarson, Finnur Ágúst and Magnússon, Árni Davíð},
  url = {http://hdl.handle.net/20.500.12537/254},
  note = {{CLARIN}-{IS}},
  year = {2022}
}
```

## Feedback
We welcome user feedback to enhance the tokenizer's functionality. Feel free to reach out with your insights and suggestions.

Happy tokenizing!

Sigurdur Haukur Birgisson


(readme created with chatgpt)