Upload README.md with huggingface_hub

ba7d753 verified 3 days ago

3.93 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- llama
	- causal-lm
	- from-scratch
	- dpo
	- chat
	- text-generation
	library_name: transformers
	pipeline_tag: text-generation
	model-index:
	- name: Transformer-1B-Chat
	results: []
	---

	# Transformer-1B-Chat

	A 1.1 billion parameter decoder-only language model trained entirely from scratch -- pretraining, supervised fine-tuning, and preference alignment -- on 8x NVIDIA H100 GPUs.

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Parameters \| 1,105,827,840 (1.1B) \|
	\| Architecture \| LLaMA-style Decoder-only Transformer \|
	\| Hidden Size \| 2048 \|
	\| Intermediate Size \| 5504 (SwiGLU) \|
	\| Layers \| 22 \|
	\| Attention Heads \| 32 (Grouped Query Attention) \|
	\| KV Heads \| 8 \|
	\| Head Dim \| 64 \|
	\| Max Sequence Length \| 2048 \|
	\| Vocab Size \| 32,003 \|
	\| Precision \| BFloat16 \|

	### Architecture Highlights

	- RoPE (Rotary Position Embeddings) with theta=10,000
	- Grouped Query Attention (GQA) -- 4:1 query-to-KV head ratio for efficient inference
	- SwiGLU Feed-Forward Network
	- RMSNorm in a pre-norm configuration
	- Flash Attention 2 via PyTorch SDPA

	## Training Pipeline

	This model was built through a complete 3-stage training pipeline:

	### Stage 1: Pretraining

	\| Detail \| Value \|
	\|---\|---\|
	\| Dataset \| HuggingFaceFW/fineweb-edu (sample-10BT) \|
	\| Tokens Trained \| ~20B tokens \|
	\| Steps \| 19,070 \|
	\| Duration \| ~12.3 hours \|
	\| Optimizer \| AdamW (lr=3e-4, betas=0.9/0.95, wd=0.1) \|
	\| Schedule \| WSD (Warmup-Stable-Decay), warmup=1000 steps \|
	\| Batch Size \| 512 sequences (8 GPUs x 8 micro x 8 grad accum) \|
	\| Final Loss \| 2.43 \|
	\| Throughput \| ~338K tokens/sec \|

	### Stage 2: Supervised Fine-Tuning (SFT)

	\| Detail \| Value \|
	\|---\|---\|
	\| Dataset \| HuggingFaceH4/ultrachat_200k (207,865 conversations) \|
	\| Steps \| 3,240 (2 epochs) \|
	\| Duration \| ~52 minutes \|
	\| Optimizer \| AdamW (lr=2e-5, cosine decay) \|
	\| Batch Size \| 256 sequences \|
	\| Final Loss \| 1.20 \|

	### Stage 3: Direct Preference Optimization (DPO)

	\| Detail \| Value \|
	\|---\|---\|
	\| Dataset \| argilla/ultrafeedback-binarized-preferences-cleaned (60,917 pairs) \|
	\| Steps \| 952 (1 epoch) \|
	\| Duration \| ~14 minutes \|
	\| Optimizer \| AdamW (lr=5e-7, cosine decay) \|
	\| Beta \| 0.1 \|
	\| Batch Size \| 64 pairs \|
	\| Final Loss \| 0.49 \|
	\| Final Accuracy \| 72.5% (chosen preferred over rejected) \|
	\| Final Reward Margin \| 0.84 \|

	### Hardware

	- 8x NVIDIA H100 80GB HBM3
	- Distributed Strategy: PyTorch DDP (DistributedDataParallel)
	- Communication: NCCL
	- Mixed Precision: BF16 autocast
	- Total Training Time: ~13.5 hours (all 3 stages)

	## Chat Template

	The model uses a simple chat template with special tokens:

	```
	<\|user\|>
	Your message here
	<\|end\|>
	<\|assistant\|>
	Model response here
	<\|end\|>
	```

	### Special Tokens

	\| Token \| ID \| Purpose \|
	\|---\|---\|---\|
	\| `<\|user\|>` \| 32000 \| Start of user turn \|
	\| `<\|assistant\|>` \| 32001 \| Start of assistant turn \|
	\| `<\|end\|>` \| 32002 \| End of turn \|

	## Limitations

	- 1.1B parameters -- smaller models have inherent limitations in reasoning depth and factual accuracy
	- Trained on English data only
	- May generate plausible-sounding but incorrect information
	- The DPO alignment is single-epoch; additional iterations could improve quality
	- Not safety-tuned beyond what the UltraFeedback dataset provides

	## Training Code

	The full training code is open-sourced alongside this model.

	```
	model/
	config.py # Model and training hyperparameters
	transformer.py # Full transformer implementation from scratch
	data.py # Pretraining data pipeline (FineWeb-Edu)
	sft_data.py # SFT data pipeline (UltraChat)
	dpo_data.py # DPO data pipeline (UltraFeedback)
	train.py # Pretraining script (DDP, 8-GPU)
	train_sft.py # SFT script
	train_dpo.py # DPO script
	chat.py # Interactive chat interface
	export_to_hf.py # Export to HuggingFace format
	```

	## License

	Apache 2.0

	---
	language:
	- en
	license: apache-2.0
	tags:
	- llama
	- causal-lm
	- from-scratch
	- dpo
	- chat
	- text-generation
	library_name: transformers
	pipeline_tag: text-generation
	model-index:
	- name: Transformer-1B-Chat
	results: []
	---

	# Transformer-1B-Chat

	A 1.1 billion parameter decoder-only language model trained entirely from scratch -- pretraining, supervised fine-tuning, and preference alignment -- on 8x NVIDIA H100 GPUs.

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Parameters \| 1,105,827,840 (1.1B) \|
	\| Architecture \| LLaMA-style Decoder-only Transformer \|
	\| Hidden Size \| 2048 \|
	\| Intermediate Size \| 5504 (SwiGLU) \|
	\| Layers \| 22 \|
	\| Attention Heads \| 32 (Grouped Query Attention) \|
	\| KV Heads \| 8 \|
	\| Head Dim \| 64 \|
	\| Max Sequence Length \| 2048 \|
	\| Vocab Size \| 32,003 \|
	\| Precision \| BFloat16 \|

	### Architecture Highlights

	- RoPE (Rotary Position Embeddings) with theta=10,000
	- Grouped Query Attention (GQA) -- 4:1 query-to-KV head ratio for efficient inference
	- SwiGLU Feed-Forward Network
	- RMSNorm in a pre-norm configuration
	- Flash Attention 2 via PyTorch SDPA

	## Training Pipeline

	This model was built through a complete 3-stage training pipeline:

	### Stage 1: Pretraining

	\| Detail \| Value \|
	\|---\|---\|
	\| Dataset \| HuggingFaceFW/fineweb-edu (sample-10BT) \|
	\| Tokens Trained \| ~20B tokens \|
	\| Steps \| 19,070 \|
	\| Duration \| ~12.3 hours \|
	\| Optimizer \| AdamW (lr=3e-4, betas=0.9/0.95, wd=0.1) \|
	\| Schedule \| WSD (Warmup-Stable-Decay), warmup=1000 steps \|
	\| Batch Size \| 512 sequences (8 GPUs x 8 micro x 8 grad accum) \|
	\| Final Loss \| 2.43 \|
	\| Throughput \| ~338K tokens/sec \|

	### Stage 2: Supervised Fine-Tuning (SFT)

	\| Detail \| Value \|
	\|---\|---\|
	\| Dataset \| HuggingFaceH4/ultrachat_200k (207,865 conversations) \|
	\| Steps \| 3,240 (2 epochs) \|
	\| Duration \| ~52 minutes \|
	\| Optimizer \| AdamW (lr=2e-5, cosine decay) \|
	\| Batch Size \| 256 sequences \|
	\| Final Loss \| 1.20 \|

	### Stage 3: Direct Preference Optimization (DPO)

	\| Detail \| Value \|
	\|---\|---\|
	\| Dataset \| argilla/ultrafeedback-binarized-preferences-cleaned (60,917 pairs) \|
	\| Steps \| 952 (1 epoch) \|
	\| Duration \| ~14 minutes \|
	\| Optimizer \| AdamW (lr=5e-7, cosine decay) \|
	\| Beta \| 0.1 \|
	\| Batch Size \| 64 pairs \|
	\| Final Loss \| 0.49 \|
	\| Final Accuracy \| 72.5% (chosen preferred over rejected) \|
	\| Final Reward Margin \| 0.84 \|

	### Hardware

	- 8x NVIDIA H100 80GB HBM3
	- Distributed Strategy: PyTorch DDP (DistributedDataParallel)
	- Communication: NCCL
	- Mixed Precision: BF16 autocast
	- Total Training Time: ~13.5 hours (all 3 stages)

	## Chat Template

	The model uses a simple chat template with special tokens:

	```
	<\|user\|>
	Your message here
	<\|end\|>
	<\|assistant\|>
	Model response here
	<\|end\|>
	```

	### Special Tokens

	\| Token \| ID \| Purpose \|
	\|---\|---\|---\|
	\| `<\|user\|>` \| 32000 \| Start of user turn \|
	\| `<\|assistant\|>` \| 32001 \| Start of assistant turn \|
	\| `<\|end\|>` \| 32002 \| End of turn \|

	## Limitations

	- 1.1B parameters -- smaller models have inherent limitations in reasoning depth and factual accuracy
	- Trained on English data only
	- May generate plausible-sounding but incorrect information
	- The DPO alignment is single-epoch; additional iterations could improve quality
	- Not safety-tuned beyond what the UltraFeedback dataset provides

	## Training Code

	The full training code is open-sourced alongside this model.

	```
	model/
	config.py # Model and training hyperparameters
	transformer.py # Full transformer implementation from scratch
	data.py # Pretraining data pipeline (FineWeb-Edu)
	sft_data.py # SFT data pipeline (UltraChat)
	dpo_data.py # DPO data pipeline (UltraFeedback)
	train.py # Pretraining script (DDP, 8-GPU)
	train_sft.py # SFT script
	train_dpo.py # DPO script
	chat.py # Interactive chat interface
	export_to_hf.py # Export to HuggingFace format
	```

	## License

	Apache 2.0