---
license: apache-2.0
tags:
- text-generation
- custom-architecture
- qwen
- base-model
- pretrained-from-scratch
- fineweb
language:
- en
pipeline_tag: text-generation
---

# 🥑 Nutral Base

**Nutral Base** is a lightweight, highly optimized custom language model **trained completely from scratch**. It has been meticulously configured and tuned for ultra-fast loss convergence and peak hardware throughput. 

The tokenizer and base layers have been pre-injected with specialized tokens to ensure seamless downstream alignment for structured **Chain-of-Thought (CoT) Reasoning** and instruction-following tasks.

> ⚠️ **Important Note:** This is a pure **Base Model**. While it is highly capable of fast text generation and raw pattern completion, it is not fine-tuned to follow conversational instructions or step-by-step reasoning prompts out of the box. For the structured reasoning and chat-aligned variant, please refer to the fine-tuned version: [Nutral Reasoning Instruct](./).

---

## 📌 Model Details

* **Base Architecture:** **Qwen2** (`Qwen2ForCausalLM`)
* **Training Type:** Pre-trained from Scratch
* **Natural Language:** English (`en`)
* **Programming Language:** Python
* **Primary Task:** Causal Language Modeling (Text Generation)

---

## 📊 Architecture & Parameters

Configured dynamically via a custom `Qwen2Config` setup, the precise structural layout and parameter breakdown of the model are detailed below:

| Hyperparameter | Configuration Value |
| :--- | :--- |
| **Total Parameters** | **~17.5 Million (17,498,368)** |
| **Embedding Dimension ($d_{model}$)** | 512 |
| **Number of Layers ($n_{layers}$)** | 8 *(Optimized from 10 for hyper-speed training)* |
| **Attention Heads ($n_{heads}$)** | 8 |
| **Intermediate / FFN Size** | 1,024 |
| **Context Window ($seq_{len}$)** | 256 tokens |
| **Vocabulary Size** | 50,304 *(GPT-2 Base + Integrated Custom Reasoner Tokens)* |
| **Tie Word Embeddings** | True *(Enforces strict parameter sharing for weight compactness)* |

---

## 🛠️ Pre-training Dataset & Strategy

To bypass typical data-loading bottlenecks and keep the GPUs fully saturated at 100% compute capacity, a custom high-speed RAM buffering methodology was utilized to stream data.

* **Dataset Name:** `HuggingFaceFW/fineweb` (Specifically using the high-quality `sample-10BT` split)
* **Total Tokens Trained:** **20,000,000 (20 Million Tokens)**
* **Sequence Length:** 256 tokens
* **Training Objective:** Next-token prediction

---

## ⚙️ Hardware & Infrastructure Details

The pre-training phase was executed on consumer-enterprise hybrid infrastructure within a multi-GPU Kaggle environment:

* **Hardware Used:** **2x NVIDIA T4 Tensor Core GPUs**
* **Parallelization Framework:** PyTorch `nn.DataParallel`
* **Precision Mode:** Automatic Mixed Precision (`torch.amp.autocast` FP16)
* **Optimizer:** AdamW ($\beta_1 = 0.9, \beta_2 = 0.999$, Weight Decay = 0.01)
* **Peak Learning Rate:** `2e-3` *(Elevated aggressively for rapid early-stage loss drop)*
* **LR Scheduler:** Cosine Decay paired with a 10% Linear Warmup phase
* **Effective Batch Size:** 16 (Micro-batch) × 8 (Gradient Accumulation Steps) = 128 sequences per step

---

## 📦 Core Technical Libraries Used

The entire lifecycle of building, tokenizing, and pre-training this model from scratch relies on the following core ecosystem libraries natively built on **Python**:

* **`transformers`** (v4.x+) - Utilized for the foundational structural layout (`Qwen2Config`), core model assembly (`Qwen2ForCausalLM`), and tokenization pipeline management.
* **`datasets`** - Implemented to stream, parse, and handle the Hugging Face `fineweb` dataset efficiently at scale.
* **`accelerate`** - Used to configure multi-GPU scaling mechanics and stabilize the distributed runtime environment.
* **`torch` (PyTorch)** - Serving as the primary deep learning framework for tensor computations, mixed-precision handling (`torch.amp`), and `DataParallel` coordination.

---

## 🧪 Special Tokens Injected (Base Level)

To prepare the foundational layers for seamless, bug-free SFT and Reasoning phases, the tokenizer was permanently injected with the following structural tokens during base initialization:
* `<|im_start|>` and `<|im_end|>` (Strict ChatML format alignment)
* `<think>` and `</think>` (Explicit reasoning encapsulation blocks)

---

## 🚀 Quickstart: How to Use in Transformers

You can initialize and test this raw base model using the standard Hugging Face pipeline as shown below:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_path = "Nebullixlabs/Nutral-Base"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")

# Base ChatML prompt structure
prompt = "<|im_start|>user\nExplain data science in short.<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=64, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))