File size: 1,889 Bytes

---
license: apache-2.0
language:
- hu
base_model:
- SZTAKI-HLT/hubert-base-cc
- FacebookAI/xlm-roberta-base
---

# 🧠 Static Word Embeddings for Hungarian (huBERT & XLM-RoBERTa)

This repository contains static word embedding models extracted from the following BERT-based models:

- [`SZTAKI-HLT/hubert-base-cc`](https://huggingface.co/SZTAKI-HLT/hubert-base-cc)
- [`FacebookAI/xlm-roberta-base`](https://huggingface.co/FacebookAI/xlm-roberta-base)

## 📦 Available Embedding Variants

Each model is provided in three static embedding variants:

- **Decontextualized**: Token embeddings extracted without any surrounding context.
- **Aggregate**: Static embeddings computed by averaging token representations of different contexts the word appears in.
- **X2Static**: Learned static embeddings trained via the **X2Static** method, designed to optimize static representations from contextual models.

## 🧪 Use Case

These embeddings were developed and evaluated as part of the paper: **_A Comparative Analysis of Static Word Embeddings for Hungarian_** by *Máté Gedeon*. They can be used for intrinsic tasks (e.g., word analogies) and extrinsic tasks (e.g., POS tagging, NER) in Hungarian NLP applications.

The paper can be found here: https://arxiv.org/abs/2505.07809

The corresponding GitHub repository: https://github.com/gedeonmate/hungarian_static_embeddings

## 🙏 Citation

If you use these models, code, or any part of the accompanying materials in your research, please cite:

```bibtex
@article{Gedeon_2025,
   title={A Comparative Analysis of Static Word Embeddings for Hungarian},
   volume={17},
   ISSN={2061-2079},
   url={http://dx.doi.org/10.36244/ICJ.2025.2.4},
   DOI={10.36244/icj.2025.2.4},
   number={2},
   journal={Infocommunications Journal},
   publisher={Infocommunications Journal},
   author={Gedeon, Máté},
   year={2025},
   pages={28–34}
}