Sigurdur commited on
Commit
5281580
·
1 Parent(s): a07fd18

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -0
README.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - is
4
+ library_name: transformers
5
+ ---
6
+
7
+ # Icelandic Tokenizer README
8
+
9
+ ## Overview
10
+ This BPE (Byte Pair Encoding) tokenizer is designed for the Icelandic GPT model, available at [Sigurdur/ice-gpt](https://huggingface.co/Sigurdur/ice-gpt). Trained on the Icelandic Gigaword Corpus ({IGC}-2022) - annotated version, it excels in accurately segmenting Icelandic text into meaningful tokens.
11
+
12
+ ## Usage
13
+ Integrate this tokenizer into your NLP pipeline for preprocessing Icelandic text. The following example demonstrates basic usage:
14
+
15
+ ```python
16
+ from transformers import GPT2Tokenizer
17
+
18
+ # Load the tokenizer
19
+ tokenizer = GPT2Tokenizer.from_pretrained("Sigurdur/ice-gpt")
20
+ tokenizer.pad_token_id = tokenizer.eos_token_id
21
+
22
+ tokenizer("Halló heimur!")["input_ids"]
23
+ ```
24
+
25
+ ## Citation
26
+ If you use this tokenizer in your work, please cite the original source of the training data:
27
+
28
+ ```bibtex
29
+ @misc{20.500.12537/254,
30
+ title = {Icelandic Gigaword Corpus ({IGC}-2022) - annotated version},
31
+ author = {Barkarson, Starkaður and Steingrímsson, Steinþór and Andrésdóttir, Þórdís Dröfn and Hafsteinsdóttir, Hildur and Ingimundarson, Finnur Ágúst and Magnússon, Árni Davíð},
32
+ url = {http://hdl.handle.net/20.500.12537/254},
33
+ note = {{CLARIN}-{IS}},
34
+ year = {2022}
35
+ }
36
+ ```
37
+
38
+ ## Feedback
39
+ We welcome user feedback to enhance the tokenizer's functionality. Feel free to reach out with your insights and suggestions.
40
+
41
+ Happy tokenizing!
42
+ Siggi, Senior Programmer