| | --- |
| | language: |
| | - en |
| | license: apache-2.0 |
| | tags: |
| | - llama |
| | - causal-lm |
| | - from-scratch |
| | - dpo |
| | - chat |
| | - text-generation |
| | library_name: transformers |
| | pipeline_tag: text-generation |
| | model-index: |
| | - name: Transformer-1B-Chat |
| | results: [] |
| | --- |
| | |
| | # Transformer-1B-Chat |
| |
|
| | A **1.1 billion parameter** decoder-only language model trained **entirely from scratch** -- pretraining, supervised fine-tuning, and preference alignment -- on 8x NVIDIA H100 GPUs. |
| |
|
| | ## Model Details |
| |
|
| | | Property | Value | |
| | |---|---| |
| | | Parameters | 1,105,827,840 (1.1B) | |
| | | Architecture | LLaMA-style Decoder-only Transformer | |
| | | Hidden Size | 2048 | |
| | | Intermediate Size | 5504 (SwiGLU) | |
| | | Layers | 22 | |
| | | Attention Heads | 32 (Grouped Query Attention) | |
| | | KV Heads | 8 | |
| | | Head Dim | 64 | |
| | | Max Sequence Length | 2048 | |
| | | Vocab Size | 32,003 | |
| | | Precision | BFloat16 | |
| |
|
| | ### Architecture Highlights |
| |
|
| | - **RoPE** (Rotary Position Embeddings) with theta=10,000 |
| | - **Grouped Query Attention** (GQA) -- 4:1 query-to-KV head ratio for efficient inference |
| | - **SwiGLU** Feed-Forward Network |
| | - **RMSNorm** in a pre-norm configuration |
| | - **Flash Attention 2** via PyTorch SDPA |
| |
|
| | ## Training Pipeline |
| |
|
| | This model was built through a complete 3-stage training pipeline: |
| |
|
| | ### Stage 1: Pretraining |
| |
|
| | | Detail | Value | |
| | |---|---| |
| | | Dataset | HuggingFaceFW/fineweb-edu (sample-10BT) | |
| | | Tokens Trained | ~20B tokens | |
| | | Steps | 19,070 | |
| | | Duration | ~12.3 hours | |
| | | Optimizer | AdamW (lr=3e-4, betas=0.9/0.95, wd=0.1) | |
| | | Schedule | WSD (Warmup-Stable-Decay), warmup=1000 steps | |
| | | Batch Size | 512 sequences (8 GPUs x 8 micro x 8 grad accum) | |
| | | Final Loss | 2.43 | |
| | | Throughput | ~338K tokens/sec | |
| |
|
| | ### Stage 2: Supervised Fine-Tuning (SFT) |
| |
|
| | | Detail | Value | |
| | |---|---| |
| | | Dataset | HuggingFaceH4/ultrachat_200k (207,865 conversations) | |
| | | Steps | 3,240 (2 epochs) | |
| | | Duration | ~52 minutes | |
| | | Optimizer | AdamW (lr=2e-5, cosine decay) | |
| | | Batch Size | 256 sequences | |
| | | Final Loss | 1.20 | |
| | |
| | ### Stage 3: Direct Preference Optimization (DPO) |
| | |
| | | Detail | Value | |
| | |---|---| |
| | | Dataset | argilla/ultrafeedback-binarized-preferences-cleaned (60,917 pairs) | |
| | | Steps | 952 (1 epoch) | |
| | | Duration | ~14 minutes | |
| | | Optimizer | AdamW (lr=5e-7, cosine decay) | |
| | | Beta | 0.1 | |
| | | Batch Size | 64 pairs | |
| | | Final Loss | 0.49 | |
| | | Final Accuracy | 72.5% (chosen preferred over rejected) | |
| | | Final Reward Margin | 0.84 | |
| | |
| | ### Hardware |
| | |
| | - **8x NVIDIA H100 80GB HBM3** |
| | - **Distributed Strategy**: PyTorch DDP (DistributedDataParallel) |
| | - **Communication**: NCCL |
| | - **Mixed Precision**: BF16 autocast |
| | - **Total Training Time**: ~13.5 hours (all 3 stages) |
| | |
| | ## Chat Template |
| | |
| | The model uses a simple chat template with special tokens: |
| | |
| | ``` |
| | <|user|> |
| | Your message here |
| | <|end|> |
| | <|assistant|> |
| | Model response here |
| | <|end|> |
| | ``` |
| | |
| | ### Special Tokens |
| | |
| | | Token | ID | Purpose | |
| | |---|---|---| |
| | | `<|user|>` | 32000 | Start of user turn | |
| | | `<|assistant|>` | 32001 | Start of assistant turn | |
| | | `<|end|>` | 32002 | End of turn | |
| | |
| | ## Limitations |
| | |
| | - **1.1B parameters** -- smaller models have inherent limitations in reasoning depth and factual accuracy |
| | - Trained on English data only |
| | - May generate plausible-sounding but incorrect information |
| | - The DPO alignment is single-epoch; additional iterations could improve quality |
| | - Not safety-tuned beyond what the UltraFeedback dataset provides |
| | |
| | ## Training Code |
| | |
| | The full training code is open-sourced alongside this model. |
| | |
| | ``` |
| | model/ |
| | config.py # Model and training hyperparameters |
| | transformer.py # Full transformer implementation from scratch |
| | data.py # Pretraining data pipeline (FineWeb-Edu) |
| | sft_data.py # SFT data pipeline (UltraChat) |
| | dpo_data.py # DPO data pipeline (UltraFeedback) |
| | train.py # Pretraining script (DDP, 8-GPU) |
| | train_sft.py # SFT script |
| | train_dpo.py # DPO script |
| | chat.py # Interactive chat interface |
| | export_to_hf.py # Export to HuggingFace format |
| | ``` |
| | |
| | ## License |
| | |
| | Apache 2.0 |
| | |