--- license: apache-2.0 language: - hu base_model: - SZTAKI-HLT/hubert-base-cc - FacebookAI/xlm-roberta-base --- # 🧠 Static Word Embeddings for Hungarian (huBERT & XLM-RoBERTa) This repository contains static word embedding models extracted from the following BERT-based models: - [`SZTAKI-HLT/hubert-base-cc`](https://huggingface.co/SZTAKI-HLT/hubert-base-cc) - [`FacebookAI/xlm-roberta-base`](https://huggingface.co/FacebookAI/xlm-roberta-base) ## 📦 Available Embedding Variants Each model is provided in three static embedding variants: - **Decontextualized**: Token embeddings extracted without any surrounding context. - **Aggregate**: Static embeddings computed by averaging token representations of different contexts the word appears in. - **X2Static**: Learned static embeddings trained via the **X2Static** method, designed to optimize static representations from contextual models. ## 🧪 Use Case These embeddings were developed and evaluated as part of the paper: **_A Comparative Analysis of Static Word Embeddings for Hungarian_** by *Máté Gedeon*. They can be used for intrinsic tasks (e.g., word analogies) and extrinsic tasks (e.g., POS tagging, NER) in Hungarian NLP applications. The paper can be found here: https://arxiv.org/abs/2505.07809 The corresponding GitHub repository: https://github.com/gedeonmate/hungarian_static_embeddings ## 🙏 Citation If you use these models, code, or any part of the accompanying materials in your research, please cite: ```bibtex @article{Gedeon_2025, title={A Comparative Analysis of Static Word Embeddings for Hungarian}, volume={17}, ISSN={2061-2079}, url={http://dx.doi.org/10.36244/ICJ.2025.2.4}, DOI={10.36244/icj.2025.2.4}, number={2}, journal={Infocommunications Journal}, publisher={Infocommunications Journal}, author={Gedeon, Máté}, year={2025}, pages={28–34} }