Arko007
/

zenyx-vanta-bert

@@ -12,47 +12,64 @@ tags:
 - gqa
 datasets:
 - HuggingFaceFW/fineweb-edu
-- open-web-math/open-web-math
-- bigcode/starcoderdata
 - HuggingFaceFW/fineweb
 metrics:
 - loss
 ---
-# Zenyx-Vanta 350M (Fresh Start / v2)
-Zenyx-Vanta is a modernized **Bidirectional Encoder** (BERT-style) model. This repository is currently undergoing a full retraining (v2) following a planned reset of the "Omni-Mix" checkpoints to eliminate mode collapse and mapping issues.
-The goal for v2 is to produce a high-fidelity encoder optimized strictly for educational reasoning and linguistic precision.
-## Status: Active Retraining
-- **Iteration:** v2
-- **Focus:** High-signal Educational Data (FineWeb-Edu)
-- **Status:** Initializing Step 0
-- **Target Loss:** < 1.20
-## Architecture Details
-- **Model Type:** Masked Language Model (MLM)
-- **Parameters:** ~350 Million
-- **Tokenizer:** Qwen 2.5 (151,646 vocab size)
-- **Positioning:** Rotary Positional Embeddings (RoPE) with 10k base
-- **Activation:** SwiGLU (SiLU-gated MLP)
-- **Attention:** Grouped Query Attention (GQA) with 12 Heads (4 KV Heads)
-## Training Data: The "Pure" Strategy
-Vanta v2 has transitioned to a pure data strategy to maximize reasoning capabilities:
-1. **FineWeb-Edu (100%):** Utilizing the 100BT sample, filtered for the highest educational scores.
 ## Technical Specifications
-| Parameter | Value |
-| :--- | :--- |
-| `hidden_size` | 768 |
-| `num_hidden_layers` | 12 |
-| `num_attention_heads` | 12 |
-| `num_key_value_heads` | 4 |
-| `intermediate_size` | 3072 |
-| `max_position_embeddings` | 2048 |
-| `hidden_act` | SwiGLU (SiLU) |
 ## Credits
-Developed by **Arko007** and the Zenyx team. Built with JAX/Flax on TPU infrastructure.

 - gqa
 datasets:
 - HuggingFaceFW/fineweb-edu
 - HuggingFaceFW/fineweb
+- bigcode/starcoderdata
+- open-web-math/open-web-math
 metrics:
 - loss
 ---
+# Zenyx-Vanta 350M (Omni-Mix)
+Zenyx-Vanta is a modernized **Bidirectional Encoder** (BERT-style) model. This iteration uses the **Omni-Mix** dataset strategy, designed to provide the encoder with a balance of high-quality educational text, general web knowledge, Pythonic logic, and mathematical reasoning.
+## Architecture Details
+* **Model Type:** Masked Language Model (MLM)
+* **Parameters:** ~350 Million
+* **Tokenizer:** Qwen 2.5 (151,646 vocab size)
+* **Positioning:** Rotary Positional Embeddings (RoPE) with 10k base
+* **Activation:** SwiGLU (SiLU-gated MLP)
+* **Attention:** Grouped Query Attention (GQA) with 12 Heads (4 KV Heads)
+## Training Data: The "Omni-Mix"
+Vanta was trained on a balanced 4-way distribution to maximize cross-domain reasoning:
+1. **FineWeb-Edu (25%):** High-signal educational content.
+2. **FineWeb (25%):** General linguistic context from broad web crawls.
+3. **StarCoderData - Python (25%):** Source code for logic and syntax understanding.
+4. **Open-Web-Math (25%):** Mathematical text and LaTeX for symbolic reasoning.
 ## Technical Specifications
+| Parameter | Value |
+| ----- | ----- |
+| `hidden_size` | 768 |
+| `num_hidden_layers` | 12 |
+| `num_attention_heads` | 12 |
+| `num_key_value_heads` | 4 |
+| `intermediate_size` | 3072 |
+| `max_position_embeddings` | 2048 |
+| `hidden_act` | SwiGLU (SiLU) |
+## Quick Start / Inference
+To use Zenyx-Vanta for mask filling, you can use the following snippet (requires `jax`, `flax`, and `transformers`):
+```python
+from transformers import AutoTokenizer
+import jax.numpy as jnp
+# Note: Ensure your local ZenyxVanta architecture definition matches the model weights
+# model = ZenyxVanta(vocab_size=151646)
+tokenizer = AutoTokenizer.from_pretrained("Arko007/zenyx-vanta-bert")
+text = "The powerhouse of the cell is the ___."
+prompt = text.replace("___", "<|MASK|>")
+inputs = tokenizer(prompt, return_tensors="np")
+# logits = model.apply({'params': params}, inputs['input_ids'])
+# ... (Standard JAX inference logic)
+```
 ## Credits
+Developed by **Arko007** and the **Zenyx** team.