Nutral-Base / README.md
Nebulixlabs's picture
Update README.md
26840d8 verified
|
Raw
History Blame Contribute Delete
5.15 kB
---
license: apache-2.0
tags:
- text-generation
- custom-architecture
- qwen
- base-model
- pretrained-from-scratch
- fineweb
language:
- en
pipeline_tag: text-generation
---
# πŸ₯‘ Nutral Base
**Nutral Base** is a lightweight, highly optimized custom language model **trained completely from scratch**. It has been meticulously configured and tuned for ultra-fast loss convergence and peak hardware throughput.
The tokenizer and base layers have been pre-injected with specialized tokens to ensure seamless downstream alignment for structured **Chain-of-Thought (CoT) Reasoning** and instruction-following tasks.
> ⚠️ **Important Note:** This is a pure **Base Model**. While it is highly capable of fast text generation and raw pattern completion, it is not fine-tuned to follow conversational instructions or step-by-step reasoning prompts out of the box. For the structured reasoning and chat-aligned variant, please refer to the fine-tuned version: [Nutral Reasoning Instruct](./).
---
## πŸ“Œ Model Details
* **Base Architecture:** **Qwen2** (`Qwen2ForCausalLM`)
* **Training Type:** Pre-trained from Scratch
* **Natural Language:** English (`en`)
* **Programming Language:** Python
* **Primary Task:** Causal Language Modeling (Text Generation)
---
## πŸ“Š Architecture & Parameters
Configured dynamically via a custom `Qwen2Config` setup, the precise structural layout and parameter breakdown of the model are detailed below:
| Hyperparameter | Configuration Value |
| :--- | :--- |
| **Total Parameters** | **~17.5 Million (17,498,368)** |
| **Embedding Dimension ($d_{model}$)** | 512 |
| **Number of Layers ($n_{layers}$)** | 8 *(Optimized from 10 for hyper-speed training)* |
| **Attention Heads ($n_{heads}$)** | 8 |
| **Intermediate / FFN Size** | 1,024 |
| **Context Window ($seq_{len}$)** | 256 tokens |
| **Vocabulary Size** | 50,304 *(GPT-2 Base + Integrated Custom Reasoner Tokens)* |
| **Tie Word Embeddings** | True *(Enforces strict parameter sharing for weight compactness)* |
---
## πŸ› οΈ Pre-training Dataset & Strategy
To bypass typical data-loading bottlenecks and keep the GPUs fully saturated at 100% compute capacity, a custom high-speed RAM buffering methodology was utilized to stream data.
* **Dataset Name:** `HuggingFaceFW/fineweb` (Specifically using the high-quality `sample-10BT` split)
* **Total Tokens Trained:** **20,000,000 (20 Million Tokens)**
* **Sequence Length:** 256 tokens
* **Training Objective:** Next-token prediction
---
## βš™οΈ Hardware & Infrastructure Details
The pre-training phase was executed on consumer-enterprise hybrid infrastructure within a multi-GPU Kaggle environment:
* **Hardware Used:** **2x NVIDIA T4 Tensor Core GPUs**
* **Parallelization Framework:** PyTorch `nn.DataParallel`
* **Precision Mode:** Automatic Mixed Precision (`torch.amp.autocast` FP16)
* **Optimizer:** AdamW ($\beta_1 = 0.9, \beta_2 = 0.999$, Weight Decay = 0.01)
* **Peak Learning Rate:** `2e-3` *(Elevated aggressively for rapid early-stage loss drop)*
* **LR Scheduler:** Cosine Decay paired with a 10% Linear Warmup phase
* **Effective Batch Size:** 16 (Micro-batch) Γ— 8 (Gradient Accumulation Steps) = 128 sequences per step
---
## πŸ“¦ Core Technical Libraries Used
The entire lifecycle of building, tokenizing, and pre-training this model from scratch relies on the following core ecosystem libraries natively built on **Python**:
* **`transformers`** (v4.x+) - Utilized for the foundational structural layout (`Qwen2Config`), core model assembly (`Qwen2ForCausalLM`), and tokenization pipeline management.
* **`datasets`** - Implemented to stream, parse, and handle the Hugging Face `fineweb` dataset efficiently at scale.
* **`accelerate`** - Used to configure multi-GPU scaling mechanics and stabilize the distributed runtime environment.
* **`torch` (PyTorch)** - Serving as the primary deep learning framework for tensor computations, mixed-precision handling (`torch.amp`), and `DataParallel` coordination.
---
## πŸ§ͺ Special Tokens Injected (Base Level)
To prepare the foundational layers for seamless, bug-free SFT and Reasoning phases, the tokenizer was permanently injected with the following structural tokens during base initialization:
* `<|im_start|>` and `<|im_end|>` (Strict ChatML format alignment)
* `<think>` and `</think>` (Explicit reasoning encapsulation blocks)
---
## πŸš€ Quickstart: How to Use in Transformers
You can initialize and test this raw base model using the standard Hugging Face pipeline as shown below:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_path = "Nebullixlabs/Nutral-Base"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")
# Base ChatML prompt structure
prompt = "<|im_start|>user\nExplain data science in short.<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=64, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))