π Bexamask-v1-228M
Bexamask is a highly efficient 228-million parameter Large Language Model (LLM) developed by Pynatic IT Solutions.
Bexamask was built completely from scratch using JAX/Flax on Google Cloud TPUs, featuring a modern architecture heavily optimized for fast inference (Grouped Query Attention, RMSNorm). It went through a rigorous 4-stage training pipeline (Pretraining β SFT β DPO β Identity Injection).
π How to Run Bexamask locally
Bexamask is built on JAX/Flax using Google's MaxText framework. To run it, you need the model weights (bexamask_hf.safetensors), the YAML configuration (sft.yml), and the required JAX inference script.
Research Model Notice: This model is released primarily as a research-based model for studying Direct Preference Optimization (DPO) and end-to-end RLHF pipelines within the JAX/MaxText ecosystem. It serves as a proof-of-concept for successfully aligning small (sub-1B) parameter models using DPO locally on TPUs.
Prerequisites
- Clone the MaxText repository:
git clone https://github.com/google/maxtext.git - Install dependencies:
pip install jax flax safetensors transformers
Running Inference
Download the bexamask_hf.safetensors file and use the provided safetensors_chat.py script.
# Start an interactive chat session
python3 safetensors_chat.py
Or pass a prompt directly:
python3 safetensors_chat.py "What is the capital of India?"
Example Output
You: [INST] Who are you? [/INST]
Bexamask: I am Bexamask, a virtual AI assistant created by Pynatic IT Solutions. I'm here to help answer your questions!
π§ Model Architecture details
- Parameters: 227,649,024 (228M)
- Hidden Size: 512
- Layers: 24
- Attention: Grouped Query Attention (16 Query Heads / 8 KV Heads)
- Head Dimension: 128
- MLP size: 4,096
- Context Length: 4,096 tokens
- Vocabulary Size: 50,257 (GPT-2 based)
- Normalization: RMSNorm (eps=1e-6)
- Activation: GELU
- Precision (dtype):
float32(FP32)
π Training Pipeline & Datasets
Bexamask was trained in 4 distinct algorithmic stages to transform it from random weights into a highly conversational, safe, personality-driven AI.
1. Pretraining
Dataset: HuggingFaceFW/fineweb-edu
Taught the exact structure of human language, grammar, and fundamental world knowledge using next-token-prediction.
2. Supervised Fine-Tuning (SFT)
Dataset: HuggingFaceH4/ultrachat_200k
Transitioned the model from a "document autocomplete" engine into a chat engine that responds strictly to
[INST] prompt [/INST]framing.
3. Direct Preference Optimization (DPO)
Taught the model to prefer helpful, harmless, and high-quality responses by training it on human preference pairs (Chosen vs Rejected).
4. Identity & Boundary Fine-Tuning (Mix-SFT)
The final stage locked-in the model's persona without catastrophic forgetting. It used a massively oversampled blend of:
- Custom Identity Data: Hand-written personality data embedding the knowledge that it is Bexamask from Pynatic IT Solutions.
- Custom Refusal Data: Strict boundaries teaching the model to refuse physical/external tasks (e.g. "I am an AI, I cannot make coffee").
- General Conversation: 10,000 human-written diverse Q&A pairs heavily sampled from databricks-dolly-15k to preserve reasoning.
β οΈ Limitations & Bias
As a 228M parameter model, Bexamask is highly efficient and conversational but lacks the massive encyclopedic knowledge of larger models like 8B or 70B parameter systems. Its responses should be fact-checked, especially in complex STEM domains. The model is intentionally conditioned to refuse actions implying physical intervention or real-time internet access.