--- license: apache-2.0 tags: - text-generation - custom-architecture - qwen - base-model - pretrained-from-scratch - fineweb language: - en pipeline_tag: text-generation --- # ๐Ÿฅ‘ Nutral Base **Nutral Base** is a lightweight, highly optimized custom language model **trained completely from scratch**. It has been meticulously configured and tuned for ultra-fast loss convergence and peak hardware throughput. The tokenizer and base layers have been pre-injected with specialized tokens to ensure seamless downstream alignment for structured **Chain-of-Thought (CoT) Reasoning** and instruction-following tasks. > โš ๏ธ **Important Note:** This is a pure **Base Model**. While it is highly capable of fast text generation and raw pattern completion, it is not fine-tuned to follow conversational instructions or step-by-step reasoning prompts out of the box. For the structured reasoning and chat-aligned variant, please refer to the fine-tuned version: [Nutral Reasoning Instruct](./). --- ## ๐Ÿ“Œ Model Details * **Base Architecture:** **Qwen2** (`Qwen2ForCausalLM`) * **Training Type:** Pre-trained from Scratch * **Natural Language:** English (`en`) * **Programming Language:** Python * **Primary Task:** Causal Language Modeling (Text Generation) --- ## ๐Ÿ“Š Architecture & Parameters Configured dynamically via a custom `Qwen2Config` setup, the precise structural layout and parameter breakdown of the model are detailed below: | Hyperparameter | Configuration Value | | :--- | :--- | | **Total Parameters** | **~17.5 Million (17,498,368)** | | **Embedding Dimension ($d_{model}$)** | 512 | | **Number of Layers ($n_{layers}$)** | 8 *(Optimized from 10 for hyper-speed training)* | | **Attention Heads ($n_{heads}$)** | 8 | | **Intermediate / FFN Size** | 1,024 | | **Context Window ($seq_{len}$)** | 256 tokens | | **Vocabulary Size** | 50,304 *(GPT-2 Base + Integrated Custom Reasoner Tokens)* | | **Tie Word Embeddings** | True *(Enforces strict parameter sharing for weight compactness)* | --- ## ๐Ÿ› ๏ธ Pre-training Dataset & Strategy To bypass typical data-loading bottlenecks and keep the GPUs fully saturated at 100% compute capacity, a custom high-speed RAM buffering methodology was utilized to stream data. * **Dataset Name:** `HuggingFaceFW/fineweb` (Specifically using the high-quality `sample-10BT` split) * **Total Tokens Trained:** **20,000,000 (20 Million Tokens)** * **Sequence Length:** 256 tokens * **Training Objective:** Next-token prediction --- ## โš™๏ธ Hardware & Infrastructure Details The pre-training phase was executed on consumer-enterprise hybrid infrastructure within a multi-GPU Kaggle environment: * **Hardware Used:** **2x NVIDIA T4 Tensor Core GPUs** * **Parallelization Framework:** PyTorch `nn.DataParallel` * **Precision Mode:** Automatic Mixed Precision (`torch.amp.autocast` FP16) * **Optimizer:** AdamW ($\beta_1 = 0.9, \beta_2 = 0.999$, Weight Decay = 0.01) * **Peak Learning Rate:** `2e-3` *(Elevated aggressively for rapid early-stage loss drop)* * **LR Scheduler:** Cosine Decay paired with a 10% Linear Warmup phase * **Effective Batch Size:** 16 (Micro-batch) ร— 8 (Gradient Accumulation Steps) = 128 sequences per step --- ## ๐Ÿ“ฆ Core Technical Libraries Used The entire lifecycle of building, tokenizing, and pre-training this model from scratch relies on the following core ecosystem libraries natively built on **Python**: * **`transformers`** (v4.x+) - Utilized for the foundational structural layout (`Qwen2Config`), core model assembly (`Qwen2ForCausalLM`), and tokenization pipeline management. * **`datasets`** - Implemented to stream, parse, and handle the Hugging Face `fineweb` dataset efficiently at scale. * **`accelerate`** - Used to configure multi-GPU scaling mechanics and stabilize the distributed runtime environment. * **`torch` (PyTorch)** - Serving as the primary deep learning framework for tensor computations, mixed-precision handling (`torch.amp`), and `DataParallel` coordination. --- ## ๐Ÿงช Special Tokens Injected (Base Level) To prepare the foundational layers for seamless, bug-free SFT and Reasoning phases, the tokenizer was permanently injected with the following structural tokens during base initialization: * `<|im_start|>` and `<|im_end|>` (Strict ChatML format alignment) * `` and `` (Explicit reasoning encapsulation blocks) --- ## ๐Ÿš€ Quickstart: How to Use in Transformers You can initialize and test this raw base model using the standard Hugging Face pipeline as shown below: ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_path = "Nebullixlabs/Nutral-Base" tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto") # Base ChatML prompt structure prompt = "<|im_start|>user\nExplain data science in short.<|im_end|>\n<|im_start|>assistant\n" inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=64, temperature=0.7, do_sample=True) print(tokenizer.decode(outputs[0], skip_special_tokens=False))