File size: 1,889 Bytes
469280b a1db3f8 469280b a1db3f8 8be7342 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
---
license: apache-2.0
language:
- hu
base_model:
- SZTAKI-HLT/hubert-base-cc
- FacebookAI/xlm-roberta-base
---
# 🧠 Static Word Embeddings for Hungarian (huBERT & XLM-RoBERTa)
This repository contains static word embedding models extracted from the following BERT-based models:
- [`SZTAKI-HLT/hubert-base-cc`](https://huggingface.co/SZTAKI-HLT/hubert-base-cc)
- [`FacebookAI/xlm-roberta-base`](https://huggingface.co/FacebookAI/xlm-roberta-base)
## 📦 Available Embedding Variants
Each model is provided in three static embedding variants:
- **Decontextualized**: Token embeddings extracted without any surrounding context.
- **Aggregate**: Static embeddings computed by averaging token representations of different contexts the word appears in.
- **X2Static**: Learned static embeddings trained via the **X2Static** method, designed to optimize static representations from contextual models.
## 🧪 Use Case
These embeddings were developed and evaluated as part of the paper: **_A Comparative Analysis of Static Word Embeddings for Hungarian_** by *Máté Gedeon*. They can be used for intrinsic tasks (e.g., word analogies) and extrinsic tasks (e.g., POS tagging, NER) in Hungarian NLP applications.
The paper can be found here: https://arxiv.org/abs/2505.07809
The corresponding GitHub repository: https://github.com/gedeonmate/hungarian_static_embeddings
## 🙏 Citation
If you use these models, code, or any part of the accompanying materials in your research, please cite:
```bibtex
@article{Gedeon_2025,
title={A Comparative Analysis of Static Word Embeddings for Hungarian},
volume={17},
ISSN={2061-2079},
url={http://dx.doi.org/10.36244/ICJ.2025.2.4},
DOI={10.36244/icj.2025.2.4},
number={2},
journal={Infocommunications Journal},
publisher={Infocommunications Journal},
author={Gedeon, Máté},
year={2025},
pages={28–34}
} |