gedeonmate's picture
Update README.md
8be7342 verified
---
license: apache-2.0
language:
- hu
base_model:
- SZTAKI-HLT/hubert-base-cc
- FacebookAI/xlm-roberta-base
---
# 🧠 Static Word Embeddings for Hungarian (huBERT & XLM-RoBERTa)
This repository contains static word embedding models extracted from the following BERT-based models:
- [`SZTAKI-HLT/hubert-base-cc`](https://huggingface.co/SZTAKI-HLT/hubert-base-cc)
- [`FacebookAI/xlm-roberta-base`](https://huggingface.co/FacebookAI/xlm-roberta-base)
## 📦 Available Embedding Variants
Each model is provided in three static embedding variants:
- **Decontextualized**: Token embeddings extracted without any surrounding context.
- **Aggregate**: Static embeddings computed by averaging token representations of different contexts the word appears in.
- **X2Static**: Learned static embeddings trained via the **X2Static** method, designed to optimize static representations from contextual models.
## 🧪 Use Case
These embeddings were developed and evaluated as part of the paper: **_A Comparative Analysis of Static Word Embeddings for Hungarian_** by *Máté Gedeon*. They can be used for intrinsic tasks (e.g., word analogies) and extrinsic tasks (e.g., POS tagging, NER) in Hungarian NLP applications.
The paper can be found here: https://arxiv.org/abs/2505.07809
The corresponding GitHub repository: https://github.com/gedeonmate/hungarian_static_embeddings
## 🙏 Citation
If you use these models, code, or any part of the accompanying materials in your research, please cite:
```bibtex
@article{Gedeon_2025,
title={A Comparative Analysis of Static Word Embeddings for Hungarian},
volume={17},
ISSN={2061-2079},
url={http://dx.doi.org/10.36244/ICJ.2025.2.4},
DOI={10.36244/icj.2025.2.4},
number={2},
journal={Infocommunications Journal},
publisher={Infocommunications Journal},
author={Gedeon, Máté},
year={2025},
pages={28–34}
}