🎭 Bexamask-v1-228M

Bexamask is a highly efficient 228-million parameter Large Language Model (LLM) developed by Pynatic IT Solutions.

Bexamask was built completely from scratch using JAX/Flax on Google Cloud TPUs, featuring a modern architecture heavily optimized for fast inference (Grouped Query Attention, RMSNorm). It went through a rigorous 4-stage training pipeline (Pretraining β†’ SFT β†’ DPO β†’ Identity Injection).

πŸš€ How to Run Bexamask locally

Bexamask is built on JAX/Flax using Google's MaxText framework. To run it, you need the model weights (bexamask_hf.safetensors), the YAML configuration (sft.yml), and the required JAX inference script.

Research Model Notice: This model is released primarily as a research-based model for studying Direct Preference Optimization (DPO) and end-to-end RLHF pipelines within the JAX/MaxText ecosystem. It serves as a proof-of-concept for successfully aligning small (sub-1B) parameter models using DPO locally on TPUs.

Prerequisites

  1. Clone the MaxText repository: git clone https://github.com/google/maxtext.git
  2. Install dependencies: pip install jax flax safetensors transformers

Running Inference

Download the bexamask_hf.safetensors file and use the provided safetensors_chat.py script.

# Start an interactive chat session
python3 safetensors_chat.py

Or pass a prompt directly:

python3 safetensors_chat.py "What is the capital of India?"

Example Output

You: [INST] Who are you? [/INST] 
Bexamask: I am Bexamask, a virtual AI assistant created by Pynatic IT Solutions. I'm here to help answer your questions!

🧠 Model Architecture details

  • Parameters: 227,649,024 (228M)
  • Hidden Size: 512
  • Layers: 24
  • Attention: Grouped Query Attention (16 Query Heads / 8 KV Heads)
  • Head Dimension: 128
  • MLP size: 4,096
  • Context Length: 4,096 tokens
  • Vocabulary Size: 50,257 (GPT-2 based)
  • Normalization: RMSNorm (eps=1e-6)
  • Activation: GELU
  • Precision (dtype): float32 (FP32)

πŸ“š Training Pipeline & Datasets

Bexamask was trained in 4 distinct algorithmic stages to transform it from random weights into a highly conversational, safe, personality-driven AI.

1. Pretraining

  • Dataset: HuggingFaceFW/fineweb-edu

  • Taught the exact structure of human language, grammar, and fundamental world knowledge using next-token-prediction.

2. Supervised Fine-Tuning (SFT)

  • Dataset: HuggingFaceH4/ultrachat_200k

  • Transitioned the model from a "document autocomplete" engine into a chat engine that responds strictly to [INST] prompt [/INST] framing.

3. Direct Preference Optimization (DPO)

4. Identity & Boundary Fine-Tuning (Mix-SFT)

The final stage locked-in the model's persona without catastrophic forgetting. It used a massively oversampled blend of:

  • Custom Identity Data: Hand-written personality data embedding the knowledge that it is Bexamask from Pynatic IT Solutions.
  • Custom Refusal Data: Strict boundaries teaching the model to refuse physical/external tasks (e.g. "I am an AI, I cannot make coffee").
  • General Conversation: 10,000 human-written diverse Q&A pairs heavily sampled from databricks-dolly-15k to preserve reasoning.

⚠️ Limitations & Bias

As a 228M parameter model, Bexamask is highly efficient and conversational but lacks the massive encyclopedic knowledge of larger models like 8B or 70B parameter systems. Its responses should be fact-checked, especially in complex STEM domains. The model is intentionally conditioned to refuse actions implying physical intervention or real-time internet access.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train frozbite/BexaMask-v1-228M