Arko007
/

Zenyx_Base_220M

@@ -1,206 +1,135 @@
 ---
 license: apache-2.0
-language: en
 datasets:
-- open-web-math/open-web-math
-- bigcode/starcoderdata
-- HuggingFaceFW/fineweb
 - HuggingFaceFW/fineweb-edu
 - HuggingFaceFW/fineweb-2
 metrics:
 - perplexity
-- accuracy
-- precision
 pipeline_tag: text-generation
-library_name: transformers
-tags:
-- text_generaion
-- next_token_prediction
-- Zenyx
-- transformers
-- modern
----
-# finetuned-fineweb-model
-> Short summary:
-> - Architecture: Custom GPT-like Transformer (Flax/Linen)
-> - Tokenizer: Qwen/Qwen2-0.5B-Instruct
-> - Latest saved checkpoint: step 146500
-> - License: Apache-2.0
----
-## Model Card
-- Model name: Zenyx-Base-220M
-- Repository: Arko007/Zenyx-Base-220M
-Purpose
-- Final polishing run of a fineweb pretraining/fine-tuning pipeline using an "Infinite Omni Mix" of web / code / multilingual / math datasets sampled proportionally.
-- Intended for research / downstream fine-tuning or evaluation where a compact, well-polished Flax causal LM is desirable.
-Caveats and recommendations
-- This model was trained/continued on TPU v5e-8 hardware using Flax/JAX. Loading and inference on CPU/GPU is possible but may require converting weights or using Flax runtime.
 ---
-## Model architecture (specs)
-These are the properties of the exact model that was trained in the provided training session:
-- Model type: Causal Transformer Language Model (Flax / Linen implementation)
-- Tokenizer: Qwen/Qwen2-0.5B-Instruct (uses the tokenizer from HuggingFace; pad token added if missing)
-- Vocab size (configured): 151,646 (the script sets VOCAB_SIZE initially; the effective vocab is taken from the tokenizer at runtime)
-- Context length (block size / max sequence length): 2048 tokens
-- Number of Transformer layers: 12
-- Embedding dimension: 768
-- Number of attention heads: 12
-- Head dimension: 64 (embed_dim / num_heads)
-- MLP hidden dim: 3072
-- Number of KV heads: 4 (grouped-query attention)
-- Rotary embeddings: RoPE caching per-head-dim
-- Normalization: RMSNorm
-- Activation in FFN: SwiGLU (Silu gating)
-- Dropout: 0.1 (training)
-- Approximate total parameters: ~140M (order-of-magnitude estimate; exact parameter count computed at init in training logs)
-Notes:
-- The implementation uses grouped-query attention with num_kv_heads=4 (k/v maps are shared / repeated to match q heads).
-- The output head is tied to the token embedding (logits computed with embedding matrix transpose).
----
-## Training configuration
-This README documents the exact training settings used when the last checkpoint (step 146500) was saved.
-General
-- Seed: 42
-- Hardware used: TPU v5e-8
-Batching
-- Micro-batch size (per step): 16
-- Gradient accumulation steps: 32
-- Global batch size: 512 (MICRO_BATCH_SIZE * GRADIENT_ACCUM_STEPS)
-- Per-core batch (MICRO_BATCH_SIZE // TPU_CORES): derived at runtime from detected TPU_CORES
-Optimizer
-- Optimizer: AdamW (Optax)
-- Learning rate: 1e-6 (constant schedule — final polish)
-- Beta1 / Beta2: 0.9 / 0.95
-- Epsilon: 1e-8
-- Weight decay: 0.1
-- Gradient clipping: clip_by_global_norm(1.0)
-Training schedule & safeguards
-- Max training steps: 150,000
-- Validation set size (blocks): 2048 blocks (each block = 2048 tokens)
-- Safety checks include train/val gap threshold, min perplexity, and gradient norms (warnings; training continues)
-Checkpointing
-- Checkpoints saved to: `checkpoints/state_step{STEP}.msgpack` and `training_metadata.json` in the model repo
-- The training script uploads checkpoints and metadata to the Hugging Face Hub (requires authentication)
-Latest checkpoint
-- EVAL Step: 146500
-- Validation Loss: 2.3880
-- Validation Perplexity (PPL): 10.9
----
-## Datasets used for mixing
-Primary datasets used
-- HuggingFaceFW/fineweb-edu (name: sample-350BT) — training split (text)
-- HuggingFaceFW/fineweb (name: sample-350BT) — training split (text)
-- bigcode/starcoderdata (python) — training split (column: content -> renamed to text)
-- open-web-math/open-web-math — training split (text)
-- HuggingFaceFW/fineweb-2 (multilingual subsets):
-  - hin_Deva (Hindi Devanagari)
-  - cmn_Hani (Chinese Han)
-  - rus_Cyrl (Russian Cyrillic)
-  - jpn_Jpan (Japanese)
-  - fra_Latn (French)
-  - spa_Latn (Spanish)
-Mix ratios (as used in the script for the main interleave)
-- edu: 16.5%
-- raw web: 16.5%
-- code (StarCoder python): 33%
-- multilingual mix: 24% (interleaved across listed languages)
-- math: 10%
----
-## Evaluation summary
-- Validation configuration: 2048 validation blocks (saved to /tmp/finetune_fineweb/val_set_final_v2.npy during run)
-- Latest evaluation (EVAL Step 146500):
-  - Mean validation loss: 2.3880
-  - Validation perplexity: 10.9
-  - Model was checkpointed at this step and uploaded to the hub (per training script behavior)
----
-## How to use / inference
-Note: this model is implemented in Flax. The easiest way to run inference is to use the provided tokenizer and a compatible model loader or to use the training script's loading utilities if you keep the same Python package layout.
-1) Install prerequisites (example):
-```bash
-pip install transformers datasets flax jax[tpu] # or jaxlib for CPU/GPU
-pip install huggingface_hub
-```
-2) Authenticate (if the repo is private):
-```bash
-huggingface-cli login
-# or
-export HF_TOKEN="your_token_here"
-```
-3) Example (recommended approach — load tokenizer and run a simple tokenization + scoring):
 ```python
 from transformers import AutoTokenizer
-from huggingface_hub import hf_hub_download
-# Load tokenizer from Qwen
-tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct", trust_remote_code=True)
-# Download a particular checkpoint file from this model repo (example)
-# NOTE: adjust the filename to match what is in the model repo (e.g. checkpoints/state_step146500.msgpack)
-# hf_hub_download(repo_id="Arko007/finetuned-fineweb-model", filename="checkpoints/state_step146500.msgpack", repo_type="model")
-# If you provide or write a Flax model loader (matching the training class), you can then:
-#  - initialize the same model class
-#  - load checkpoint bytes with flax.serialization.from_bytes on TrainState (see training script)
-#  - run model.apply to get logits and decode outputs
 ```
-Because this model uses a custom Flax class defined in the training script, to perform autoregressive generation you should reuse the same model class definition (TransformerLM, attention wrapper, etc.) and load the checkpoint with flax.serialization.from_bytes exactly as the training script does.
-If you prefer PyTorch inference, convert Flax weights to PyTorch using the Hugging Face conversion utilities (if appropriate and if the repository includes consistent naming/structure). Care is required to ensure parameter name mappings match.
----
-## Files of interest in the repo
-- training_metadata.json — metadata uploaded with checkpoints (step, best_val_loss, no_improve, train_losses).
-- checkpoints/state_step{STEP}.msgpack — Flax serialized train state (to be loaded with flax.serialization).
----
-## Licensing and citation
-This model and the files in this repository are licensed under the Apache License 2.0.
-If you use this model in academic work or production, please cite this model and include a short description of the training data and settings used in your documentation (see sections above).
----
-## Contact / Maintainers
-- Maintainer: Account that owns the model repo on the Hugging Face Hub (Arko007)
-- For questions about training, dataset choices, or how to reuse the training script, open an issue on the model repo or contact the maintainer directly via Hugging Face profile.
----

 ---
+language:
+- en
+- fr
+- es
+- zh
+- hi
+- ja
+- ru
 license: apache-2.0
+library_name: flax
+tags:
+- jax
+- flax
+- tpu
+- text-generation
+- base-model
+- custom-architecture
 datasets:
 - HuggingFaceFW/fineweb-edu
+- bigcode/starcoderdata
 - HuggingFaceFW/fineweb-2
+- open-web-math/open-web-math
 metrics:
+- loss
 - perplexity
 pipeline_tag: text-generation
+inference: false
 ---
+# Zenyx-Base-220M: High-Density Nano Foundation Model
+<div align="center">
+![Model Architecture](https://img.shields.io/badge/Model-Zenyx_Base-blue?style=for-the-badge)
+![Parameter Count](https://img.shields.io/badge/Params-220M-orange?style=for-the-badge)
+![Training Tokens](https://img.shields.io/badge/Tokens-153B-green?style=for-the-badge)
+![Format](https://img.shields.io/badge/Weights-Safetensors-yellow?style=for-the-badge)
+</div>
+**Zenyx-Base-220M** is a 220 million parameter causal language model built from scratch using JAX/Flax on Kaggle TPU v5e-8.
+Unlike typical small models trained on limited data, Zenyx-Base was trained on **~153 Billion tokens**—far exceeding the Chinchilla optimal point for this parameter count. This "over-training" strategy was employed to maximize the information density and logic capabilities of the weights, creating a robust foundation for reasoning tasks.
+## 🧠 Model Description
+* **Architecture:** Custom Llama-style Transformer (RoPE, SwiGLU, RMSNorm, Grouped Query Attention).
+* **Tokenizer:** Qwen 2.5 Tokenizer (151,650 Vocab Size) for high compression efficiency.
+* **Context Window:** 2048 Tokens.
+* **Training Hardware:** TPU v5e-8.
+* **Final Validation Loss:** **~2.38** (Exceptional convergence for 220M).
+### Technical Specifications
+| Hyperparameter | Value |
+| :--- | :--- |
+| **Layers** | 12 |
+| **Hidden Dim** | 768 |
+| **MLP Dim** | 3072 |
+| **Attention Heads** | 12 |
+| **KV Heads** | 4 (GQA) |
+| **Vocab Size** | 151,646 |
+## 📚 Training Curriculum (The "Omni-Mix")
+The model was trained using a rigorous 4-stage curriculum designed to layer capabilities sequentially:
+1.  **Phase 1: Fundamentals (FineWeb-Edu)**
+    * Focus on high-quality educational English text to establish linguistic baselines.
+2.  **Phase 2: Logic & Structure (StarCoder - Python)**
+    * Introduction of code data to enforce logical indentation, syntax, and structured thinking.
+3.  **Phase 3: Multilingualism (FineWeb-2)**
+    * Exposure to 6 major languages (Hindi, Chinese, Russian, Japanese, French, Spanish) to expand the semantic embedding space.
+4.  **Phase 4: The Infinite Polish (Omni-Mix)**
+    * A weighted interleaving of all previous datasets plus **OpenWebMath** to converge the model's logic and language capabilities.
+## 💻 Usage
+This model is a raw **JAX/Flax** checkpoint saved in `.safetensors` format. It uses a custom architecture definition and requires `flax` and `jax` to run.
+### Loading with JAX/Flax
 ```python
+import jax
+import jax.numpy as jnp
+from flax.training import train_state
+from flax import serialization
+from safetensors.flax import load_file
 from transformers import AutoTokenizer
+import flax.linen as nn
+# 1. Define Architecture (Must match training config)
+class TransformerLM(nn.Module):
+    vocab_size: int
+    embed_dim: int = 768
+    num_layers: int = 12
+    num_heads: int = 12
+    num_kv_heads: int = 4
+    mlp_dim: int = 3072
+    max_length: int = 2048
+    dropout_rate: float = 0.0
+    # ... (Insert full model class definition here from the training script) ...
+# 2. Load Resources
+repo_id = "Arko007/Zenyx_Base_220M"
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", trust_remote_code=True)
+# 3. Initialize & Load Weights
+model = TransformerLM(vocab_size=len(tokenizer))
+dummy_input = jnp.ones((1, 1), dtype=jnp.int32)
+params = model.init(jax.random.PRNGKey(0), dummy_input)['params']
+# Load Safetensors
+# Ensure model.safetensors is downloaded locally
+loaded_params = load_file("model.safetensors")
+print("Weights loaded successfully!")
 ```
+## ⚠️ Limitations
+- Size: At 220M parameters, the model's knowledge retrieval capacity is limited compared to 7B+ models.
+- Base Model: This is a pre-trained base. It has not been fine-tuned for chat or instruction following (see Zenyx-DeepSeek-220M for the instruct version).
+- Hallucinations: While logically consistent, it may generate factually incorrect statements.
+## 📜 Citation
+```python
+@misc{ZenyxBase220M,
+  title = {Zenyx-Base-220M: High-Density Foundation Model},
+  author = {Arko007},
+  year = {2025},
+  publisher = {HuggingFace},
+  url = {[https://huggingface.co/Arko007/Zenyx_Base_220M](https://huggingface.co/Arko007/Zenyx_Base_220M)}
+}
+```