--- language: - en license: apache-2.0 tags: - llama - causal-lm - from-scratch - dpo - chat - text-generation library_name: transformers pipeline_tag: text-generation model-index: - name: Transformer-1B-Chat results: [] --- # Transformer-1B-Chat A **1.1 billion parameter** decoder-only language model trained **entirely from scratch** -- pretraining, supervised fine-tuning, and preference alignment -- on 8x NVIDIA H100 GPUs. ## Model Details | Property | Value | |---|---| | Parameters | 1,105,827,840 (1.1B) | | Architecture | LLaMA-style Decoder-only Transformer | | Hidden Size | 2048 | | Intermediate Size | 5504 (SwiGLU) | | Layers | 22 | | Attention Heads | 32 (Grouped Query Attention) | | KV Heads | 8 | | Head Dim | 64 | | Max Sequence Length | 2048 | | Vocab Size | 32,003 | | Precision | BFloat16 | ### Architecture Highlights - **RoPE** (Rotary Position Embeddings) with theta=10,000 - **Grouped Query Attention** (GQA) -- 4:1 query-to-KV head ratio for efficient inference - **SwiGLU** Feed-Forward Network - **RMSNorm** in a pre-norm configuration - **Flash Attention 2** via PyTorch SDPA ## Training Pipeline This model was built through a complete 3-stage training pipeline: ### Stage 1: Pretraining | Detail | Value | |---|---| | Dataset | HuggingFaceFW/fineweb-edu (sample-10BT) | | Tokens Trained | ~20B tokens | | Steps | 19,070 | | Duration | ~12.3 hours | | Optimizer | AdamW (lr=3e-4, betas=0.9/0.95, wd=0.1) | | Schedule | WSD (Warmup-Stable-Decay), warmup=1000 steps | | Batch Size | 512 sequences (8 GPUs x 8 micro x 8 grad accum) | | Final Loss | 2.43 | | Throughput | ~338K tokens/sec | ### Stage 2: Supervised Fine-Tuning (SFT) | Detail | Value | |---|---| | Dataset | HuggingFaceH4/ultrachat_200k (207,865 conversations) | | Steps | 3,240 (2 epochs) | | Duration | ~52 minutes | | Optimizer | AdamW (lr=2e-5, cosine decay) | | Batch Size | 256 sequences | | Final Loss | 1.20 | ### Stage 3: Direct Preference Optimization (DPO) | Detail | Value | |---|---| | Dataset | argilla/ultrafeedback-binarized-preferences-cleaned (60,917 pairs) | | Steps | 952 (1 epoch) | | Duration | ~14 minutes | | Optimizer | AdamW (lr=5e-7, cosine decay) | | Beta | 0.1 | | Batch Size | 64 pairs | | Final Loss | 0.49 | | Final Accuracy | 72.5% (chosen preferred over rejected) | | Final Reward Margin | 0.84 | ### Hardware - **8x NVIDIA H100 80GB HBM3** - **Distributed Strategy**: PyTorch DDP (DistributedDataParallel) - **Communication**: NCCL - **Mixed Precision**: BF16 autocast - **Total Training Time**: ~13.5 hours (all 3 stages) ## Chat Template The model uses a simple chat template with special tokens: ``` <|user|> Your message here <|end|> <|assistant|> Model response here <|end|> ``` ### Special Tokens | Token | ID | Purpose | |---|---|---| | `<|user|>` | 32000 | Start of user turn | | `<|assistant|>` | 32001 | Start of assistant turn | | `<|end|>` | 32002 | End of turn | ## Limitations - **1.1B parameters** -- smaller models have inherent limitations in reasoning depth and factual accuracy - Trained on English data only - May generate plausible-sounding but incorrect information - The DPO alignment is single-epoch; additional iterations could improve quality - Not safety-tuned beyond what the UltraFeedback dataset provides ## Training Code The full training code is open-sourced alongside this model. ``` model/ config.py # Model and training hyperparameters transformer.py # Full transformer implementation from scratch data.py # Pretraining data pipeline (FineWeb-Edu) sft_data.py # SFT data pipeline (UltraChat) dpo_data.py # DPO data pipeline (UltraFeedback) train.py # Pretraining script (DDP, 8-GPU) train_sft.py # SFT script train_dpo.py # DPO script chat.py # Interactive chat interface export_to_hf.py # Export to HuggingFace format ``` ## License Apache 2.0