Arko007 commited on
Commit
7876ce2
·
verified ·
1 Parent(s): 7b52f5f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -30
README.md CHANGED
@@ -12,47 +12,64 @@ tags:
12
  - gqa
13
  datasets:
14
  - HuggingFaceFW/fineweb-edu
15
- - open-web-math/open-web-math
16
- - bigcode/starcoderdata
17
  - HuggingFaceFW/fineweb
 
 
18
  metrics:
19
  - loss
20
  ---
21
 
22
- # Zenyx-Vanta 350M (Fresh Start / v2)
23
 
24
- Zenyx-Vanta is a modernized **Bidirectional Encoder** (BERT-style) model. This repository is currently undergoing a full retraining (v2) following a planned reset of the "Omni-Mix" checkpoints to eliminate mode collapse and mapping issues.
25
 
26
- The goal for v2 is to produce a high-fidelity encoder optimized strictly for educational reasoning and linguistic precision.
27
 
28
- ## Status: Active Retraining
29
- - **Iteration:** v2
30
- - **Focus:** High-signal Educational Data (FineWeb-Edu)
31
- - **Status:** Initializing Step 0
32
- - **Target Loss:** < 1.20
 
33
 
34
- ## Architecture Details
35
- - **Model Type:** Masked Language Model (MLM)
36
- - **Parameters:** ~350 Million
37
- - **Tokenizer:** Qwen 2.5 (151,646 vocab size)
38
- - **Positioning:** Rotary Positional Embeddings (RoPE) with 10k base
39
- - **Activation:** SwiGLU (SiLU-gated MLP)
40
- - **Attention:** Grouped Query Attention (GQA) with 12 Heads (4 KV Heads)
41
 
42
- ## Training Data: The "Pure" Strategy
43
- Vanta v2 has transitioned to a pure data strategy to maximize reasoning capabilities:
44
- 1. **FineWeb-Edu (100%):** Utilizing the 100BT sample, filtered for the highest educational scores.
 
 
 
45
 
46
  ## Technical Specifications
47
- | Parameter | Value |
48
- | :--- | :--- |
49
- | `hidden_size` | 768 |
50
- | `num_hidden_layers` | 12 |
51
- | `num_attention_heads` | 12 |
52
- | `num_key_value_heads` | 4 |
53
- | `intermediate_size` | 3072 |
54
- | `max_position_embeddings` | 2048 |
55
- | `hidden_act` | SwiGLU (SiLU) |
56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  ## Credits
58
- Developed by **Arko007** and the Zenyx team. Built with JAX/Flax on TPU infrastructure.
 
12
  - gqa
13
  datasets:
14
  - HuggingFaceFW/fineweb-edu
 
 
15
  - HuggingFaceFW/fineweb
16
+ - bigcode/starcoderdata
17
+ - open-web-math/open-web-math
18
  metrics:
19
  - loss
20
  ---
21
 
22
+ # Zenyx-Vanta 350M (Omni-Mix)
23
 
24
+ Zenyx-Vanta is a modernized **Bidirectional Encoder** (BERT-style) model. This iteration uses the **Omni-Mix** dataset strategy, designed to provide the encoder with a balance of high-quality educational text, general web knowledge, Pythonic logic, and mathematical reasoning.
25
 
26
+ ## Architecture Details
27
 
28
+ * **Model Type:** Masked Language Model (MLM)
29
+ * **Parameters:** ~350 Million
30
+ * **Tokenizer:** Qwen 2.5 (151,646 vocab size)
31
+ * **Positioning:** Rotary Positional Embeddings (RoPE) with 10k base
32
+ * **Activation:** SwiGLU (SiLU-gated MLP)
33
+ * **Attention:** Grouped Query Attention (GQA) with 12 Heads (4 KV Heads)
34
 
35
+ ## Training Data: The "Omni-Mix"
 
 
 
 
 
 
36
 
37
+ Vanta was trained on a balanced 4-way distribution to maximize cross-domain reasoning:
38
+
39
+ 1. **FineWeb-Edu (25%):** High-signal educational content.
40
+ 2. **FineWeb (25%):** General linguistic context from broad web crawls.
41
+ 3. **StarCoderData - Python (25%):** Source code for logic and syntax understanding.
42
+ 4. **Open-Web-Math (25%):** Mathematical text and LaTeX for symbolic reasoning.
43
 
44
  ## Technical Specifications
 
 
 
 
 
 
 
 
 
45
 
46
+ | Parameter | Value |
47
+ | ----- | ----- |
48
+ | `hidden_size` | 768 |
49
+ | `num_hidden_layers` | 12 |
50
+ | `num_attention_heads` | 12 |
51
+ | `num_key_value_heads` | 4 |
52
+ | `intermediate_size` | 3072 |
53
+ | `max_position_embeddings` | 2048 |
54
+ | `hidden_act` | SwiGLU (SiLU) |
55
+
56
+ ## Quick Start / Inference
57
+
58
+ To use Zenyx-Vanta for mask filling, you can use the following snippet (requires `jax`, `flax`, and `transformers`):
59
+
60
+ ```python
61
+ from transformers import AutoTokenizer
62
+ import jax.numpy as jnp
63
+ # Note: Ensure your local ZenyxVanta architecture definition matches the model weights
64
+ # model = ZenyxVanta(vocab_size=151646)
65
+
66
+ tokenizer = AutoTokenizer.from_pretrained("Arko007/zenyx-vanta-bert")
67
+ text = "The powerhouse of the cell is the ___."
68
+ prompt = text.replace("___", "<|MASK|>")
69
+
70
+ inputs = tokenizer(prompt, return_tensors="np")
71
+ # logits = model.apply({'params': params}, inputs['input_ids'])
72
+ # ... (Standard JAX inference logic)
73
+ ```
74
  ## Credits
75
+ Developed by **Arko007** and the **Zenyx** team.