Upload folder using huggingface_hub

8f47342 verified about 1 month ago

5.87 kB

	---
	library_name: pytorch
	tags:
	- hungarian
	- transformer
	- encoder
	- tokenization-free
	- character-based
	- glass-box
	license: cc-by-sa-4.0
	datasets:
	- webkorpusz-2.0
	metrics:
	- pos-accuracy
	- word-reconstruction
	---

	# 🧠 HuBrain: Tokenization-free Hungarian Semantic Encoder / Tokenizáció-mentes magyar szemantikai encoder

	[English Version](#english) \| [Magyar Változat](#magyar)

	🔗 GitHub Repository: [https://github.com/BraienStorm/hubrain-encoder](https://github.com/BraienStorm/hubrain-encoder)

	---

	<a name="english"></a>
	## 🌍 English Description

	🔗 Source Code (GitHub): [https://github.com/BraienStorm/hubrain-encoder](https://github.com/BraienStorm/hubrain-encoder)

	HuBrain is an experimental, character-based Glass-Box Semantic Encoder designed to model the morphological richness and semantic relationships of the Hungarian language without traditional tokenization (e.g., BPE).

	### 🚀 Live Visualization
	View the 1280-dimensional semantic latent space projection (PCA/T-SNE) here:
	👉 [HuBrain Projector Visualization](https://projector.tensorflow.org/?config=https://jevcsak.hu/model/hubrain.json)

	![Latent Space Projection](latens_space.png)

	### 📈 Training Progress (latest logs)
	The model is currently in Phase 2 (Joint Training). Recent logs show high stability and emergent factual knowledge:
	- POS Accuracy (Pm): ~91.5% - 97.1%
	- Word Reconstruction (Wm): ~30% - 74% (Emerging)
	- Latent Stability (Mag): ~100 (Balanced vector magnitude)
	- Learning Rate: 2.4e-05

	### 🛠️ Technical Specifications
	- Architecture: Transformer Encoder with RoPE support.
	- Dimensions: 1536 (256 Anchors + 1280 Semantic Context).
	- Layers: 18 Layers, 24 Heads.
	- Input: Raw characters (64-char fixed word length).
	- Vocab: No OOV issues (Character-level coverage).

	### 📥 Model Download
	The weighted model files (`.pth`) are stored on Hugging Face due to their large size (6.7 GB). You can download them using the following command:

	```bash
	# Required: pip install huggingface_hub
	python download_model.py
	```
	Or manually from: [https://huggingface.co/Braien/HuBrain-Encoder](https://huggingface.co/Braien/HuBrain-Encoder)

	### ⚙️ Requirements
	```bash
	pip install torch numpy huggingface_hub
	```

	### 🧪 Diagnostic Tools
	- `test_mask_prediction.py`: Context-based word recovery.
	- `test_analogy.py`: Semantic analogies (e.g. King-Man+Woman).
	- `export_projector.py`: Export to TF Projector format.

	### ⚖️ Licensing & Data Sources
	This model was trained using the Webkorpusz 2.0 dataset. By using this model, you agree to comply with the following licenses:
	- Common Crawl subcorpus: Used under the same terms as [Common Crawl](https://commoncrawl.org/terms-of-use/) itself.
	- Wikipedia subcorpus & processed data: Licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).
	- Disclaimer: The training data originates from automated web crawling; the model creator assumes no responsibility for the content.

	---

	<a name="magyar"></a>
	## 🇭🇺 Magyar leírás

	🔗 Forráskód (GitHub): [https://github.com/BraienStorm/hubrain-encoder](https://github.com/BraienStorm/hubrain-encoder)

	A HuBrain egy kísérleti, karakter-alapú Glass-Box Szemantikai Encoder, amely a magyar nyelv morfológiai gazdagságát és szemantikai összefüggéseit modellezi hagyományos tokenizáció (pl. BPE) használata nélkül.

	### 🚀 Élő Vizualizáció
	A modell látens terének 1280 dimenziós szemantikai leképzése megtekinthető itt:
	👉 [HuBrain Projector Visualization](https://projector.tensorflow.org/?config=https://jevcsak.hu/model/hubrain.json)

	![Látens tér projekció](latens_space.png)

	### 📈 Tréning Állapot (utolsó logok)
	A modell jelenleg a Phase 2 (Joint Training) fázisban van. Az utolsó logok stabil tanulást és kialakuló tudást mutatnak:
	- POS Pontosság (Pm): ~91.5% - 97.1%
	- Szó Rekonstrukció (Wm): ~30% - 74% (Folyamatosan javul)
	- Látens Stabilitás (Mag): ~100 (Kiegyensúlyozott vektor magnitúdó)
	- Tanulási ráta: 2.4e-05

	### 🛠️ Technikai adatok
	- Architektúra: Transformer Encoder RoPE támogatással.
	- Dimenziók: 1536 (256 Horgony + 1280 Szemantikai kontextus).
	- Rétegszám: 18 réteg, 24 fej.
	- Bemenet: Nyers karakterek (64 karakteres fix szóhossz).
	- Vocab: Nincs OOV (szótáron kívüli szó) probléma a karakter-szintű lefedettség miatt.

	### 📥 Modell letöltése
	A nagyméretű modellfájlok (`.pth`, összesen 6.7 GB) a Hugging Face-en tárolódnak. Az alábbi parancs futtatásával töltheted le őket:

	```bash
	# Szükséges: pip install huggingface_hub
	python download_model.py
	```
	Vagy manuálisan innen: [https://huggingface.co/Braien/HuBrain-Encoder](https://huggingface.co/Braien/HuBrain-Encoder)

	### ⚙️ Követelmények
	```bash
	pip install torch numpy huggingface_hub
	```

	### 🧪 Diagnosztikai eszközök
	- `test_mask_prediction.py`: Környezet alapú szó-visszafejtés.
	- `test_analogy.py`: Szemantikai analógiák (pl. király - férfi + nő).
	- `export_projector.py`: Exportálás TF Projector vizualizációhoz.

	### ⚖️ Licenc és Adatforrások
	A modell tanításához a Webkorpusz 2.0 adatbázist használtuk fel. A modell használatával Ön elfogadja az alábbi licencfeltételeket:
	- Common Crawl alkorpusz: A [Common Crawl](https://commoncrawl.org/terms-of-use/) saját felhasználási feltételei szerint került felhasználásra.
	- Wikipedia alkorpusz és feldolgozott adatok: A Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) licenc alá tartoznak.
	- Felelősségkizárás: Az adatok automatizált webes gyűjtésből származnak, a tartalmukért a modell készítője nem vállal felelősséget.