Sigurdur
/

ice-tokenizer

Model card Files Files and versions

Sigurdur commited on Nov 21, 2023

Commit

5281580

·

1 Parent(s): a07fd18

Create README.md

Files changed (1) hide show

README.md +42 -0

README.md ADDED Viewed

	@@ -0,0 +1,42 @@

+---
+language:
+- is
+library_name: transformers
+---
+# Icelandic Tokenizer README
+## Overview
+This BPE (Byte Pair Encoding) tokenizer is designed for the Icelandic GPT model, available at [Sigurdur/ice-gpt](https://huggingface.co/Sigurdur/ice-gpt). Trained on the Icelandic Gigaword Corpus ({IGC}-2022) - annotated version, it excels in accurately segmenting Icelandic text into meaningful tokens.
+## Usage
+Integrate this tokenizer into your NLP pipeline for preprocessing Icelandic text. The following example demonstrates basic usage:
+```python
+from transformers import GPT2Tokenizer
+# Load the tokenizer
+tokenizer = GPT2Tokenizer.from_pretrained("Sigurdur/ice-gpt")
+tokenizer.pad_token_id = tokenizer.eos_token_id
+tokenizer("Halló heimur!")["input_ids"]
+```
+## Citation
+If you use this tokenizer in your work, please cite the original source of the training data:
+```bibtex
+@misc{20.500.12537/254,
+  title = {Icelandic Gigaword Corpus ({IGC}-2022) - annotated version},
+  author = {Barkarson, Starkaður and Steingrímsson, Steinþór and Andrésdóttir, Þórdís Dröfn and Hafsteinsdóttir, Hildur and Ingimundarson, Finnur Ágúst and Magnússon, Árni Davíð},
+  url = {http://hdl.handle.net/20.500.12537/254},
+  note = {{CLARIN}-{IS}},
+  year = {2022}
+}
+```
+## Feedback
+We welcome user feedback to enhance the tokenizer's functionality. Feel free to reach out with your insights and suggestions.
+Happy tokenizing!
+Siggi, Senior Programmer