🎭 Bexamask-v1-228M

Bexamask is a highly efficient 228-million parameter Large Language Model (LLM) developed by Pynatic IT Solutions.

Bexamask was built completely from scratch using JAX/Flax on Google Cloud TPUs, featuring a modern architecture heavily optimized for fast inference (Grouped Query Attention, RMSNorm). It went through a rigorous 4-stage training pipeline (Pretraining → SFT → DPO → Identity Injection).

🚀 How to Run Bexamask locally

Bexamask is built on JAX/Flax using Google's MaxText framework. To run it, you need the model weights (bexamask_hf.safetensors), the YAML configuration (sft.yml), and the required JAX inference script.

Research Model Notice: This model is released primarily as a research-based model for studying Direct Preference Optimization (DPO) and end-to-end RLHF pipelines within the JAX/MaxText ecosystem. It serves as a proof-of-concept for successfully aligning small (sub-1B) parameter models using DPO locally on TPUs.

Prerequisites

Clone the MaxText repository: git clone https://github.com/google/maxtext.git
Install dependencies: pip install jax flax safetensors transformers

Running Inference

Download the bexamask_hf.safetensors file and use the provided safetensors_chat.py script.

# Start an interactive chat session
python3 safetensors_chat.py

Or pass a prompt directly:

python3 safetensors_chat.py "What is the capital of India?"

Example Output

You: [INST] Who are you? [/INST] 
Bexamask: I am Bexamask, a virtual AI assistant created by Pynatic IT Solutions. I'm here to help answer your questions!

🧠 Model Architecture details

Parameters: 227,649,024 (228M)
Hidden Size: 512
Layers: 24
Attention: Grouped Query Attention (16 Query Heads / 8 KV Heads)
Head Dimension: 128
MLP size: 4,096
Context Length: 4,096 tokens
Vocabulary Size: 50,257 (GPT-2 based)
Normalization: RMSNorm (eps=1e-6)
Activation: GELU
Precision (dtype): float32 (FP32)

📚 Training Pipeline & Datasets

Bexamask was trained in 4 distinct algorithmic stages to transform it from random weights into a highly conversational, safe, personality-driven AI.

1. Pretraining

Dataset: HuggingFaceFW/fineweb-edu
Taught the exact structure of human language, grammar, and fundamental world knowledge using next-token-prediction.

2. Supervised Fine-Tuning (SFT)

Dataset: HuggingFaceH4/ultrachat_200k
Transitioned the model from a "document autocomplete" engine into a chat engine that responds strictly to [INST] prompt [/INST] framing.

3. Direct Preference Optimization (DPO)

Dataset: HuggingFaceH4/ultrafeedback_binarized
Taught the model to prefer helpful, harmless, and high-quality responses by training it on human preference pairs (Chosen vs Rejected).

4. Identity & Boundary Fine-Tuning (Mix-SFT)

The final stage locked-in the model's persona without catastrophic forgetting. It used a massively oversampled blend of:

Custom Identity Data: Hand-written personality data embedding the knowledge that it is Bexamask from Pynatic IT Solutions.
Custom Refusal Data: Strict boundaries teaching the model to refuse physical/external tasks (e.g. "I am an AI, I cannot make coffee").
General Conversation: 10,000 human-written diverse Q&A pairs heavily sampled from databricks-dolly-15k to preserve reasoning.

⚠️ Limitations & Bias

As a 228M parameter model, Bexamask is highly efficient and conversational but lacks the massive encyclopedic knowledge of larger models like 8B or 70B parameter systems. Its responses should be fact-checked, especially in complex STEM domains. The model is intentionally conditioned to refuse actions implying physical intervention or real-time internet access.

Downloads last month: -; Downloads are not tracked for this model. How to track

frozbite
/

BexaMask-v1-228M