File size: 3,933 Bytes
ba7d753 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | ---
language:
- en
license: apache-2.0
tags:
- llama
- causal-lm
- from-scratch
- dpo
- chat
- text-generation
library_name: transformers
pipeline_tag: text-generation
model-index:
- name: Transformer-1B-Chat
results: []
---
# Transformer-1B-Chat
A **1.1 billion parameter** decoder-only language model trained **entirely from scratch** -- pretraining, supervised fine-tuning, and preference alignment -- on 8x NVIDIA H100 GPUs.
## Model Details
| Property | Value |
|---|---|
| Parameters | 1,105,827,840 (1.1B) |
| Architecture | LLaMA-style Decoder-only Transformer |
| Hidden Size | 2048 |
| Intermediate Size | 5504 (SwiGLU) |
| Layers | 22 |
| Attention Heads | 32 (Grouped Query Attention) |
| KV Heads | 8 |
| Head Dim | 64 |
| Max Sequence Length | 2048 |
| Vocab Size | 32,003 |
| Precision | BFloat16 |
### Architecture Highlights
- **RoPE** (Rotary Position Embeddings) with theta=10,000
- **Grouped Query Attention** (GQA) -- 4:1 query-to-KV head ratio for efficient inference
- **SwiGLU** Feed-Forward Network
- **RMSNorm** in a pre-norm configuration
- **Flash Attention 2** via PyTorch SDPA
## Training Pipeline
This model was built through a complete 3-stage training pipeline:
### Stage 1: Pretraining
| Detail | Value |
|---|---|
| Dataset | HuggingFaceFW/fineweb-edu (sample-10BT) |
| Tokens Trained | ~20B tokens |
| Steps | 19,070 |
| Duration | ~12.3 hours |
| Optimizer | AdamW (lr=3e-4, betas=0.9/0.95, wd=0.1) |
| Schedule | WSD (Warmup-Stable-Decay), warmup=1000 steps |
| Batch Size | 512 sequences (8 GPUs x 8 micro x 8 grad accum) |
| Final Loss | 2.43 |
| Throughput | ~338K tokens/sec |
### Stage 2: Supervised Fine-Tuning (SFT)
| Detail | Value |
|---|---|
| Dataset | HuggingFaceH4/ultrachat_200k (207,865 conversations) |
| Steps | 3,240 (2 epochs) |
| Duration | ~52 minutes |
| Optimizer | AdamW (lr=2e-5, cosine decay) |
| Batch Size | 256 sequences |
| Final Loss | 1.20 |
### Stage 3: Direct Preference Optimization (DPO)
| Detail | Value |
|---|---|
| Dataset | argilla/ultrafeedback-binarized-preferences-cleaned (60,917 pairs) |
| Steps | 952 (1 epoch) |
| Duration | ~14 minutes |
| Optimizer | AdamW (lr=5e-7, cosine decay) |
| Beta | 0.1 |
| Batch Size | 64 pairs |
| Final Loss | 0.49 |
| Final Accuracy | 72.5% (chosen preferred over rejected) |
| Final Reward Margin | 0.84 |
### Hardware
- **8x NVIDIA H100 80GB HBM3**
- **Distributed Strategy**: PyTorch DDP (DistributedDataParallel)
- **Communication**: NCCL
- **Mixed Precision**: BF16 autocast
- **Total Training Time**: ~13.5 hours (all 3 stages)
## Chat Template
The model uses a simple chat template with special tokens:
```
<|user|>
Your message here
<|end|>
<|assistant|>
Model response here
<|end|>
```
### Special Tokens
| Token | ID | Purpose |
|---|---|---|
| `<|user|>` | 32000 | Start of user turn |
| `<|assistant|>` | 32001 | Start of assistant turn |
| `<|end|>` | 32002 | End of turn |
## Limitations
- **1.1B parameters** -- smaller models have inherent limitations in reasoning depth and factual accuracy
- Trained on English data only
- May generate plausible-sounding but incorrect information
- The DPO alignment is single-epoch; additional iterations could improve quality
- Not safety-tuned beyond what the UltraFeedback dataset provides
## Training Code
The full training code is open-sourced alongside this model.
```
model/
config.py # Model and training hyperparameters
transformer.py # Full transformer implementation from scratch
data.py # Pretraining data pipeline (FineWeb-Edu)
sft_data.py # SFT data pipeline (UltraChat)
dpo_data.py # DPO data pipeline (UltraFeedback)
train.py # Pretraining script (DDP, 8-GPU)
train_sft.py # SFT script
train_dpo.py # DPO script
chat.py # Interactive chat interface
export_to_hf.py # Export to HuggingFace format
```
## License
Apache 2.0
|