dkumar15 commited on
Commit
ba7d753
·
verified ·
1 Parent(s): 83db370

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +147 -0
README.md ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - llama
7
+ - causal-lm
8
+ - from-scratch
9
+ - dpo
10
+ - chat
11
+ - text-generation
12
+ library_name: transformers
13
+ pipeline_tag: text-generation
14
+ model-index:
15
+ - name: Transformer-1B-Chat
16
+ results: []
17
+ ---
18
+
19
+ # Transformer-1B-Chat
20
+
21
+ A **1.1 billion parameter** decoder-only language model trained **entirely from scratch** -- pretraining, supervised fine-tuning, and preference alignment -- on 8x NVIDIA H100 GPUs.
22
+
23
+ ## Model Details
24
+
25
+ | Property | Value |
26
+ |---|---|
27
+ | Parameters | 1,105,827,840 (1.1B) |
28
+ | Architecture | LLaMA-style Decoder-only Transformer |
29
+ | Hidden Size | 2048 |
30
+ | Intermediate Size | 5504 (SwiGLU) |
31
+ | Layers | 22 |
32
+ | Attention Heads | 32 (Grouped Query Attention) |
33
+ | KV Heads | 8 |
34
+ | Head Dim | 64 |
35
+ | Max Sequence Length | 2048 |
36
+ | Vocab Size | 32,003 |
37
+ | Precision | BFloat16 |
38
+
39
+ ### Architecture Highlights
40
+
41
+ - **RoPE** (Rotary Position Embeddings) with theta=10,000
42
+ - **Grouped Query Attention** (GQA) -- 4:1 query-to-KV head ratio for efficient inference
43
+ - **SwiGLU** Feed-Forward Network
44
+ - **RMSNorm** in a pre-norm configuration
45
+ - **Flash Attention 2** via PyTorch SDPA
46
+
47
+ ## Training Pipeline
48
+
49
+ This model was built through a complete 3-stage training pipeline:
50
+
51
+ ### Stage 1: Pretraining
52
+
53
+ | Detail | Value |
54
+ |---|---|
55
+ | Dataset | HuggingFaceFW/fineweb-edu (sample-10BT) |
56
+ | Tokens Trained | ~20B tokens |
57
+ | Steps | 19,070 |
58
+ | Duration | ~12.3 hours |
59
+ | Optimizer | AdamW (lr=3e-4, betas=0.9/0.95, wd=0.1) |
60
+ | Schedule | WSD (Warmup-Stable-Decay), warmup=1000 steps |
61
+ | Batch Size | 512 sequences (8 GPUs x 8 micro x 8 grad accum) |
62
+ | Final Loss | 2.43 |
63
+ | Throughput | ~338K tokens/sec |
64
+
65
+ ### Stage 2: Supervised Fine-Tuning (SFT)
66
+
67
+ | Detail | Value |
68
+ |---|---|
69
+ | Dataset | HuggingFaceH4/ultrachat_200k (207,865 conversations) |
70
+ | Steps | 3,240 (2 epochs) |
71
+ | Duration | ~52 minutes |
72
+ | Optimizer | AdamW (lr=2e-5, cosine decay) |
73
+ | Batch Size | 256 sequences |
74
+ | Final Loss | 1.20 |
75
+
76
+ ### Stage 3: Direct Preference Optimization (DPO)
77
+
78
+ | Detail | Value |
79
+ |---|---|
80
+ | Dataset | argilla/ultrafeedback-binarized-preferences-cleaned (60,917 pairs) |
81
+ | Steps | 952 (1 epoch) |
82
+ | Duration | ~14 minutes |
83
+ | Optimizer | AdamW (lr=5e-7, cosine decay) |
84
+ | Beta | 0.1 |
85
+ | Batch Size | 64 pairs |
86
+ | Final Loss | 0.49 |
87
+ | Final Accuracy | 72.5% (chosen preferred over rejected) |
88
+ | Final Reward Margin | 0.84 |
89
+
90
+ ### Hardware
91
+
92
+ - **8x NVIDIA H100 80GB HBM3**
93
+ - **Distributed Strategy**: PyTorch DDP (DistributedDataParallel)
94
+ - **Communication**: NCCL
95
+ - **Mixed Precision**: BF16 autocast
96
+ - **Total Training Time**: ~13.5 hours (all 3 stages)
97
+
98
+ ## Chat Template
99
+
100
+ The model uses a simple chat template with special tokens:
101
+
102
+ ```
103
+ <|user|>
104
+ Your message here
105
+ <|end|>
106
+ <|assistant|>
107
+ Model response here
108
+ <|end|>
109
+ ```
110
+
111
+ ### Special Tokens
112
+
113
+ | Token | ID | Purpose |
114
+ |---|---|---|
115
+ | `<|user|>` | 32000 | Start of user turn |
116
+ | `<|assistant|>` | 32001 | Start of assistant turn |
117
+ | `<|end|>` | 32002 | End of turn |
118
+
119
+ ## Limitations
120
+
121
+ - **1.1B parameters** -- smaller models have inherent limitations in reasoning depth and factual accuracy
122
+ - Trained on English data only
123
+ - May generate plausible-sounding but incorrect information
124
+ - The DPO alignment is single-epoch; additional iterations could improve quality
125
+ - Not safety-tuned beyond what the UltraFeedback dataset provides
126
+
127
+ ## Training Code
128
+
129
+ The full training code is open-sourced alongside this model.
130
+
131
+ ```
132
+ model/
133
+ config.py # Model and training hyperparameters
134
+ transformer.py # Full transformer implementation from scratch
135
+ data.py # Pretraining data pipeline (FineWeb-Edu)
136
+ sft_data.py # SFT data pipeline (UltraChat)
137
+ dpo_data.py # DPO data pipeline (UltraFeedback)
138
+ train.py # Pretraining script (DDP, 8-GPU)
139
+ train_sft.py # SFT script
140
+ train_dpo.py # DPO script
141
+ chat.py # Interactive chat interface
142
+ export_to_hf.py # Export to HuggingFace format
143
+ ```
144
+
145
+ ## License
146
+
147
+ Apache 2.0