Sigurdur
/

ice-tokenizer

Model card Files Files and versions

ice-tokenizer / README.md

Sigurdur's picture

Update README.md

724d804 about 2 years ago

|

history blame contribute delete

1.5 kB

	---
	language:
	- is
	library_name: transformers
	---

	# Icelandic Tokenizer README

	## Overview
	This BPE (Byte Pair Encoding) tokenizer is designed for the Icelandic GPT model, available at [Sigurdur/ice-gpt](https://huggingface.co/Sigurdur/ice-gpt). Trained on the Icelandic Gigaword Corpus ({IGC}-2022) - annotated version, it excels in accurately segmenting Icelandic text into meaningful tokens.

	## Usage
	Integrate this tokenizer into your NLP pipeline for preprocessing Icelandic text. The following example demonstrates basic usage:

	```python
	from transformers import GPT2Tokenizer

	# Load the tokenizer
	tokenizer = GPT2Tokenizer.from_pretrained("Sigurdur/ice-tokenizer")
	tokenizer.pad_token_id = tokenizer.eos_token_id

	tokenizer("Halló heimur!")["input_ids"]
	```

	## Citation
	If you use this tokenizer in your work, please cite the original source of the training data:

	```bibtex
	@misc{20.500.12537/254,
	title = {Icelandic Gigaword Corpus ({IGC}-2022) - annotated version},
	author = {Barkarson, Starkaður and Steingrímsson, Steinþór and Andrésdóttir, Þórdís Dröfn and Hafsteinsdóttir, Hildur and Ingimundarson, Finnur Ágúst and Magnússon, Árni Davíð},
	url = {http://hdl.handle.net/20.500.12537/254},
	note = {{CLARIN}-{IS}},
	year = {2022}
	}
	```

	## Feedback
	We welcome user feedback to enhance the tokenizer's functionality. Feel free to reach out with your insights and suggestions.

	Happy tokenizing!

	Sigurdur Haukur Birgisson


	(readme created with chatgpt)