HuBrainV5 - Hungarian "Turbo" Encoder (Developer Preview)

HuBrainV5 is a high-capacity, character-level encoder for the Hungarian language. It uses a custom architecture designed for maximum semantic stability and precise character reconstruction.

Key Features

  • Turbo Architecture: 1536-dimensional embeddings (12 layers, 24 attention heads).
  • Hybrid Embeddings: 256-dimensional "Anchor" (for character reconstruction) + 1280-dimensional "Context" (for deep semantics).
  • VICReg Regularization: Ensuring an orthogonal, well-spread latent space that prevents vector collapse.
  • Character-Level: Processes raw characters via 2-bit quantization anchors, making it robust for the rich morphology of Hungarian.

Performance (Checkpoint 115k)

  • Analogy Accuracy: ~99% success rate on grammatical and semantic analogies (when measured in centered/relative space).
  • Geometric Health: Near-zero cosine similarity between random vectors in centered space (high isotropy).

How to use

This is a developer preview. To run the model, you need to provide a checkpoint file (mlm_step_XXXX.pth).

Installation

pip install torch numpy

Quick Start

Just run the demo script. It will automatically detect any .pth checkpoint in the same directory:

python demo_analogy.py

Architecture Details

The model splits each word into two parts:

  1. Anchors (0-255 dim): Contains the POS tag and the first 63 characters of the word, quantized into 2-bit chunks. This part is optimized for Masked Language Modeling (MLM).
  2. Context (256-1535 dim): A deep latent representation regularized by VICReg to capture semantic and grammatical relationships.

Creators

Designed and trained by the HuBrain Team on a 10-billion-word Hungarian corpus.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support