| | --- |
| | language: pl |
| | tags: |
| | - word2vec |
| | datasets: |
| | - KGR10 |
| | --- |
| | |
| | # KGR10 word2vec Polish word embeddings |
| |
|
| | Distributional language models for Polish trained on the KGR10 corpora. |
| |
|
| | ## Models |
| |
|
| | In the repository you can find two selected models, that were selected after evaluation (see table below). |
| | A model that performed the best is the default model/config (see `default_config.json`). |
| |
|
| | |method|dimension|hs|mwe|| |
| | |---|---|---|---| --- | |
| | |cbow|300|false|true| <-- default | |
| | |skipgram|300|true|true| |
| |
|
| |
|
| | ## Usage |
| |
|
| | To use these embedding models easily, it is required to install [embeddings](https://github.com/CLARIN-PL/embeddings). |
| |
|
| | ```bash |
| | pip install clarinpl-embeddings |
| | ``` |
| |
|
| | ### Utilising the default model (the easiest way) |
| |
|
| | Word embedding: |
| |
|
| | ```python |
| | from embeddings.embedding.auto_flair import AutoFlairWordEmbedding |
| | from flair.data import Sentence |
| | |
| | sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.") |
| | |
| | embedding = AutoFlairWordEmbedding.from_hub("clarin-pl/word2vec-kgr10") |
| | embedding.embed([sentence]) |
| | |
| | for token in sentence: |
| | print(token) |
| | print(token.embedding) |
| | ``` |
| |
|
| | Document embedding (averaged over words): |
| |
|
| | ```python |
| | from embeddings.embedding.auto_flair import AutoFlairDocumentEmbedding |
| | from flair.data import Sentence |
| | |
| | sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.") |
| | |
| | embedding = AutoFlairDocumentEmbedding.from_hub("clarin-pl/word2vec-kgr10") |
| | embedding.embed([sentence]) |
| | |
| | print(sentence.embedding) |
| | ``` |
| |
|
| | ### Customisable way |
| |
|
| | Word embedding: |
| |
|
| | ```python |
| | from embeddings.embedding.static.embedding import AutoStaticWordEmbedding |
| | from embeddings.embedding.static.word2vec import KGR10Word2VecConfig |
| | from flair.data import Sentence |
| | |
| | config = KGR10Word2VecConfig(method='skipgram', hs=False) |
| | embedding = AutoStaticWordEmbedding.from_config(config) |
| | |
| | sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.") |
| | embedding.embed([sentence]) |
| | |
| | for token in sentence: |
| | print(token) |
| | print(token.embedding) |
| | ``` |
| |
|
| | Document embedding (averaged over words): |
| |
|
| | ```python |
| | from embeddings.embedding.static.embedding import AutoStaticDocumentEmbedding |
| | from embeddings.embedding.static.word2vec import KGR10Word2VecConfig |
| | from flair.data import Sentence |
| | |
| | config = KGR10Word2VecConfig(method='skipgram', hs=False) |
| | embedding = AutoStaticDocumentEmbedding.from_config(config) |
| | |
| | sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.") |
| | embedding.embed([sentence]) |
| | |
| | print(sentence.embedding) |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | ``` |
| | Piasecki, Maciej; Janz, Arkadiusz; Kaszewski, Dominik; et al., 2017, Word Embeddings for Polish, CLARIN-PL digital repository, http://hdl.handle.net/11321/442. |
| | ``` |
| |
|
| | or |
| |
|
| |
|
| | ``` |
| | @misc{11321/442, |
| | title = {Word Embeddings for Polish}, |
| | author = {Piasecki, Maciej and Janz, Arkadiusz and Kaszewski, Dominik and Czachor, Gabriela}, |
| | url = {http://hdl.handle.net/11321/442}, |
| | note = {{CLARIN}-{PL} digital repository}, |
| | copyright = {{GNU} {GPL3}}, |
| | year = {2017} |
| | } |
| | ``` |
| |
|
| |
|