File size: 3,933 Bytes
ba7d753
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
language:
- en
license: apache-2.0
tags:
- llama
- causal-lm
- from-scratch
- dpo
- chat
- text-generation
library_name: transformers
pipeline_tag: text-generation
model-index:
- name: Transformer-1B-Chat
  results: []
---

# Transformer-1B-Chat

A **1.1 billion parameter** decoder-only language model trained **entirely from scratch** -- pretraining, supervised fine-tuning, and preference alignment -- on 8x NVIDIA H100 GPUs.

## Model Details

| Property | Value |
|---|---|
| Parameters | 1,105,827,840 (1.1B) |
| Architecture | LLaMA-style Decoder-only Transformer |
| Hidden Size | 2048 |
| Intermediate Size | 5504 (SwiGLU) |
| Layers | 22 |
| Attention Heads | 32 (Grouped Query Attention) |
| KV Heads | 8 |
| Head Dim | 64 |
| Max Sequence Length | 2048 |
| Vocab Size | 32,003 |
| Precision | BFloat16 |

### Architecture Highlights

- **RoPE** (Rotary Position Embeddings) with theta=10,000
- **Grouped Query Attention** (GQA) -- 4:1 query-to-KV head ratio for efficient inference
- **SwiGLU** Feed-Forward Network
- **RMSNorm** in a pre-norm configuration
- **Flash Attention 2** via PyTorch SDPA

## Training Pipeline

This model was built through a complete 3-stage training pipeline:

### Stage 1: Pretraining

| Detail | Value |
|---|---|
| Dataset | HuggingFaceFW/fineweb-edu (sample-10BT) |
| Tokens Trained | ~20B tokens |
| Steps | 19,070 |
| Duration | ~12.3 hours |
| Optimizer | AdamW (lr=3e-4, betas=0.9/0.95, wd=0.1) |
| Schedule | WSD (Warmup-Stable-Decay), warmup=1000 steps |
| Batch Size | 512 sequences (8 GPUs x 8 micro x 8 grad accum) |
| Final Loss | 2.43 |
| Throughput | ~338K tokens/sec |

### Stage 2: Supervised Fine-Tuning (SFT)

| Detail | Value |
|---|---|
| Dataset | HuggingFaceH4/ultrachat_200k (207,865 conversations) |
| Steps | 3,240 (2 epochs) |
| Duration | ~52 minutes |
| Optimizer | AdamW (lr=2e-5, cosine decay) |
| Batch Size | 256 sequences |
| Final Loss | 1.20 |

### Stage 3: Direct Preference Optimization (DPO)

| Detail | Value |
|---|---|
| Dataset | argilla/ultrafeedback-binarized-preferences-cleaned (60,917 pairs) |
| Steps | 952 (1 epoch) |
| Duration | ~14 minutes |
| Optimizer | AdamW (lr=5e-7, cosine decay) |
| Beta | 0.1 |
| Batch Size | 64 pairs |
| Final Loss | 0.49 |
| Final Accuracy | 72.5% (chosen preferred over rejected) |
| Final Reward Margin | 0.84 |

### Hardware

- **8x NVIDIA H100 80GB HBM3**
- **Distributed Strategy**: PyTorch DDP (DistributedDataParallel)
- **Communication**: NCCL
- **Mixed Precision**: BF16 autocast
- **Total Training Time**: ~13.5 hours (all 3 stages)

## Chat Template

The model uses a simple chat template with special tokens:

```
<|user|>
Your message here
<|end|>
<|assistant|>
Model response here
<|end|>
```

### Special Tokens

| Token | ID | Purpose |
|---|---|---|
| `<|user|>` | 32000 | Start of user turn |
| `<|assistant|>` | 32001 | Start of assistant turn |
| `<|end|>` | 32002 | End of turn |

## Limitations

- **1.1B parameters** -- smaller models have inherent limitations in reasoning depth and factual accuracy
- Trained on English data only
- May generate plausible-sounding but incorrect information
- The DPO alignment is single-epoch; additional iterations could improve quality
- Not safety-tuned beyond what the UltraFeedback dataset provides

## Training Code

The full training code is open-sourced alongside this model.

```
model/
  config.py          # Model and training hyperparameters
  transformer.py     # Full transformer implementation from scratch
  data.py            # Pretraining data pipeline (FineWeb-Edu)
  sft_data.py        # SFT data pipeline (UltraChat)
  dpo_data.py        # DPO data pipeline (UltraFeedback)
train.py             # Pretraining script (DDP, 8-GPU)
train_sft.py         # SFT script
train_dpo.py         # DPO script
chat.py              # Interactive chat interface
export_to_hf.py      # Export to HuggingFace format
```

## License

Apache 2.0