HuBrainV5 - Hungarian "Turbo" Encoder (Developer Preview)
HuBrainV5 is a high-capacity, character-level encoder for the Hungarian language. It uses a custom architecture designed for maximum semantic stability and precise character reconstruction.
Key Features
- Turbo Architecture: 1536-dimensional embeddings (12 layers, 24 attention heads).
- Hybrid Embeddings: 256-dimensional "Anchor" (for character reconstruction) + 1280-dimensional "Context" (for deep semantics).
- VICReg Regularization: Ensuring an orthogonal, well-spread latent space that prevents vector collapse.
- Character-Level: Processes raw characters via 2-bit quantization anchors, making it robust for the rich morphology of Hungarian.
Performance (Checkpoint 115k)
- Analogy Accuracy: ~99% success rate on grammatical and semantic analogies (when measured in centered/relative space).
- Geometric Health: Near-zero cosine similarity between random vectors in centered space (high isotropy).
How to use
This is a developer preview. To run the model, you need to provide a checkpoint file (mlm_step_XXXX.pth).
Installation
pip install torch numpy
Quick Start
Just run the demo script. It will automatically detect any .pth checkpoint in the same directory:
python demo_analogy.py
Architecture Details
The model splits each word into two parts:
- Anchors (0-255 dim): Contains the POS tag and the first 63 characters of the word, quantized into 2-bit chunks. This part is optimized for Masked Language Modeling (MLM).
- Context (256-1535 dim): A deep latent representation regularized by VICReg to capture semantic and grammatical relationships.
Creators
Designed and trained by the HuBrain Team on a 10-billion-word Hungarian corpus.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support