Update README.md

26840d8 verified 21 days ago

5.15 kB

	---
	license: apache-2.0
	tags:
	- text-generation
	- custom-architecture
	- qwen
	- base-model
	- pretrained-from-scratch
	- fineweb
	language:
	- en
	pipeline_tag: text-generation
	---

	# 🥑 Nutral Base

	Nutral Base is a lightweight, highly optimized custom language model trained completely from scratch. It has been meticulously configured and tuned for ultra-fast loss convergence and peak hardware throughput.

	The tokenizer and base layers have been pre-injected with specialized tokens to ensure seamless downstream alignment for structured Chain-of-Thought (CoT) Reasoning and instruction-following tasks.

	> ⚠️ Important Note: This is a pure Base Model. While it is highly capable of fast text generation and raw pattern completion, it is not fine-tuned to follow conversational instructions or step-by-step reasoning prompts out of the box. For the structured reasoning and chat-aligned variant, please refer to the fine-tuned version: [Nutral Reasoning Instruct](./).

	---

	## 📌 Model Details

	* Base Architecture: Qwen2 (`Qwen2ForCausalLM`)
	* Training Type: Pre-trained from Scratch
	* Natural Language: English (`en`)
	* Programming Language: Python
	* Primary Task: Causal Language Modeling (Text Generation)

	---

	## 📊 Architecture & Parameters

	Configured dynamically via a custom `Qwen2Config` setup, the precise structural layout and parameter breakdown of the model are detailed below:

	\| Hyperparameter \| Configuration Value \|
	\| :--- \| :--- \|
	\| Total Parameters \| ~17.5 Million (17,498,368) \|
	\| Embedding Dimension ($d_{model}$) \| 512 \|
	\| Number of Layers ($n_{layers}$) \| 8 (Optimized from 10 for hyper-speed training) \|
	\| Attention Heads ($n_{heads}$) \| 8 \|
	\| Intermediate / FFN Size \| 1,024 \|
	\| Context Window ($seq_{len}$) \| 256 tokens \|
	\| Vocabulary Size \| 50,304 (GPT-2 Base + Integrated Custom Reasoner Tokens) \|
	\| Tie Word Embeddings \| True (Enforces strict parameter sharing for weight compactness) \|

	---

	## 🛠️ Pre-training Dataset & Strategy

	To bypass typical data-loading bottlenecks and keep the GPUs fully saturated at 100% compute capacity, a custom high-speed RAM buffering methodology was utilized to stream data.

	* Dataset Name: `HuggingFaceFW/fineweb` (Specifically using the high-quality `sample-10BT` split)
	* Total Tokens Trained: 20,000,000 (20 Million Tokens)
	* Sequence Length: 256 tokens
	* Training Objective: Next-token prediction

	---

	## ⚙️ Hardware & Infrastructure Details

	The pre-training phase was executed on consumer-enterprise hybrid infrastructure within a multi-GPU Kaggle environment:

	* Hardware Used: 2x NVIDIA T4 Tensor Core GPUs
	* Parallelization Framework: PyTorch `nn.DataParallel`
	* Precision Mode: Automatic Mixed Precision (`torch.amp.autocast` FP16)
	* Optimizer: AdamW ($\beta_1 = 0.9, \beta_2 = 0.999$, Weight Decay = 0.01)
	* Peak Learning Rate: `2e-3` (Elevated aggressively for rapid early-stage loss drop)
	* LR Scheduler: Cosine Decay paired with a 10% Linear Warmup phase
	* Effective Batch Size: 16 (Micro-batch) × 8 (Gradient Accumulation Steps) = 128 sequences per step

	---

	## 📦 Core Technical Libraries Used

	The entire lifecycle of building, tokenizing, and pre-training this model from scratch relies on the following core ecosystem libraries natively built on Python:

	* `transformers` (v4.x+) - Utilized for the foundational structural layout (`Qwen2Config`), core model assembly (`Qwen2ForCausalLM`), and tokenization pipeline management.
	* `datasets` - Implemented to stream, parse, and handle the Hugging Face `fineweb` dataset efficiently at scale.
	* `accelerate` - Used to configure multi-GPU scaling mechanics and stabilize the distributed runtime environment.
	* `torch` (PyTorch) - Serving as the primary deep learning framework for tensor computations, mixed-precision handling (`torch.amp`), and `DataParallel` coordination.

	---

	## 🧪 Special Tokens Injected (Base Level)

	To prepare the foundational layers for seamless, bug-free SFT and Reasoning phases, the tokenizer was permanently injected with the following structural tokens during base initialization:
	* `<\|im_start\|>` and `<\|im_end\|>` (Strict ChatML format alignment)
	* `<think>` and `</think>` (Explicit reasoning encapsulation blocks)

	---

	## 🚀 Quickstart: How to Use in Transformers

	You can initialize and test this raw base model using the standard Hugging Face pipeline as shown below:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_path = "Nebullixlabs/Nutral-Base"

	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")

	# Base ChatML prompt structure
	prompt = "<\|im_start\|>user\nExplain data science in short.<\|im_end\|>\n<\|im_start\|>assistant\n"
	inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

	outputs = model.generate(**inputs, max_new_tokens=64, temperature=0.7, do_sample=True)
	print(tokenizer.decode(outputs[0], skip_special_tokens=False))