| --- |
| license: apache-2.0 |
| tags: |
| - text-generation |
| - custom-architecture |
| - qwen |
| - base-model |
| - pretrained-from-scratch |
| - fineweb |
| language: |
| - en |
| pipeline_tag: text-generation |
| --- |
| |
| # π₯ Nutral Base |
|
|
| **Nutral Base** is a lightweight, highly optimized custom language model **trained completely from scratch**. It has been meticulously configured and tuned for ultra-fast loss convergence and peak hardware throughput. |
|
|
| The tokenizer and base layers have been pre-injected with specialized tokens to ensure seamless downstream alignment for structured **Chain-of-Thought (CoT) Reasoning** and instruction-following tasks. |
|
|
| > β οΈ **Important Note:** This is a pure **Base Model**. While it is highly capable of fast text generation and raw pattern completion, it is not fine-tuned to follow conversational instructions or step-by-step reasoning prompts out of the box. For the structured reasoning and chat-aligned variant, please refer to the fine-tuned version: [Nutral Reasoning Instruct](./). |
|
|
| --- |
|
|
| ## π Model Details |
|
|
| * **Base Architecture:** **Qwen2** (`Qwen2ForCausalLM`) |
| * **Training Type:** Pre-trained from Scratch |
| * **Natural Language:** English (`en`) |
| * **Programming Language:** Python |
| * **Primary Task:** Causal Language Modeling (Text Generation) |
|
|
| --- |
|
|
| ## π Architecture & Parameters |
|
|
| Configured dynamically via a custom `Qwen2Config` setup, the precise structural layout and parameter breakdown of the model are detailed below: |
|
|
| | Hyperparameter | Configuration Value | |
| | :--- | :--- | |
| | **Total Parameters** | **~17.5 Million (17,498,368)** | |
| | **Embedding Dimension ($d_{model}$)** | 512 | |
| | **Number of Layers ($n_{layers}$)** | 8 *(Optimized from 10 for hyper-speed training)* | |
| | **Attention Heads ($n_{heads}$)** | 8 | |
| | **Intermediate / FFN Size** | 1,024 | |
| | **Context Window ($seq_{len}$)** | 256 tokens | |
| | **Vocabulary Size** | 50,304 *(GPT-2 Base + Integrated Custom Reasoner Tokens)* | |
| | **Tie Word Embeddings** | True *(Enforces strict parameter sharing for weight compactness)* | |
|
|
| --- |
|
|
| ## π οΈ Pre-training Dataset & Strategy |
|
|
| To bypass typical data-loading bottlenecks and keep the GPUs fully saturated at 100% compute capacity, a custom high-speed RAM buffering methodology was utilized to stream data. |
|
|
| * **Dataset Name:** `HuggingFaceFW/fineweb` (Specifically using the high-quality `sample-10BT` split) |
| * **Total Tokens Trained:** **20,000,000 (20 Million Tokens)** |
| * **Sequence Length:** 256 tokens |
| * **Training Objective:** Next-token prediction |
|
|
| --- |
|
|
| ## βοΈ Hardware & Infrastructure Details |
|
|
| The pre-training phase was executed on consumer-enterprise hybrid infrastructure within a multi-GPU Kaggle environment: |
|
|
| * **Hardware Used:** **2x NVIDIA T4 Tensor Core GPUs** |
| * **Parallelization Framework:** PyTorch `nn.DataParallel` |
| * **Precision Mode:** Automatic Mixed Precision (`torch.amp.autocast` FP16) |
| * **Optimizer:** AdamW ($\beta_1 = 0.9, \beta_2 = 0.999$, Weight Decay = 0.01) |
| * **Peak Learning Rate:** `2e-3` *(Elevated aggressively for rapid early-stage loss drop)* |
| * **LR Scheduler:** Cosine Decay paired with a 10% Linear Warmup phase |
| * **Effective Batch Size:** 16 (Micro-batch) Γ 8 (Gradient Accumulation Steps) = 128 sequences per step |
|
|
| --- |
|
|
| ## π¦ Core Technical Libraries Used |
|
|
| The entire lifecycle of building, tokenizing, and pre-training this model from scratch relies on the following core ecosystem libraries natively built on **Python**: |
|
|
| * **`transformers`** (v4.x+) - Utilized for the foundational structural layout (`Qwen2Config`), core model assembly (`Qwen2ForCausalLM`), and tokenization pipeline management. |
| * **`datasets`** - Implemented to stream, parse, and handle the Hugging Face `fineweb` dataset efficiently at scale. |
| * **`accelerate`** - Used to configure multi-GPU scaling mechanics and stabilize the distributed runtime environment. |
| * **`torch` (PyTorch)** - Serving as the primary deep learning framework for tensor computations, mixed-precision handling (`torch.amp`), and `DataParallel` coordination. |
|
|
| --- |
|
|
| ## π§ͺ Special Tokens Injected (Base Level) |
|
|
| To prepare the foundational layers for seamless, bug-free SFT and Reasoning phases, the tokenizer was permanently injected with the following structural tokens during base initialization: |
| * `<|im_start|>` and `<|im_end|>` (Strict ChatML format alignment) |
| * `<think>` and `</think>` (Explicit reasoning encapsulation blocks) |
|
|
| --- |
|
|
| ## π Quickstart: How to Use in Transformers |
|
|
| You can initialize and test this raw base model using the standard Hugging Face pipeline as shown below: |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch |
| |
| model_path = "Nebullixlabs/Nutral-Base" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_path) |
| model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto") |
| |
| # Base ChatML prompt structure |
| prompt = "<|im_start|>user\nExplain data science in short.<|im_end|>\n<|im_start|>assistant\n" |
| inputs = tokenizer(prompt, return_tensors="pt").to("cuda") |
| |
| outputs = model.generate(**inputs, max_new_tokens=64, temperature=0.7, do_sample=True) |
| print(tokenizer.decode(outputs[0], skip_special_tokens=False)) |