|
|
--- |
|
|
language: en |
|
|
license: mit |
|
|
tags: |
|
|
- causal-lm |
|
|
- custom-architecture |
|
|
- transformer |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# Custom 57M Language Model |
|
|
|
|
|
A custom 57.55M parameter causal language model with modern transformer architecture. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Parameters**: 57,553,632 (57.55M) |
|
|
- **Architecture**: 12-layer Transformer |
|
|
- **Hidden Size**: 432 |
|
|
- **Attention Heads**: 8 |
|
|
- **Head Dimension**: 54 |
|
|
- **Intermediate Size**: 1,728 |
|
|
- **Vocabulary Size**: 50,257 (GPT-2 tokenizer) |
|
|
- **Max Sequence Length**: 1,024 |
|
|
|
|
|
## Architecture Features |
|
|
|
|
|
- **RoPE Positional Embeddings**: Rotary Position Embedding (θ=10000.0) |
|
|
- **SwiGLU Activation**: Swish-Gated Linear Unit in feed-forward networks |
|
|
- **RMSNorm**: Root Mean Square Layer Normalization (ε=1e-06) |
|
|
- **Tied Embeddings**: Input and output embeddings share weights |
|
|
- **Dropout**: 0.1 dropout rate |
|
|
|
|
|
## Training Configuration |
|
|
|
|
|
- **Dummy Phase**: 2 epochs, 1,000 samples, LR=0.0005 |
|
|
- **C4 Phase**: 3 epochs, 1,000 samples, LR=0.0003 |
|
|
- **Optimizer**: AdamW (weight_decay=0.1) |
|
|
- **Scheduler**: Cosine Annealing |
|
|
- **Gradient Clipping**: 1.0 |
|
|
|
|
|
## Generation Parameters |
|
|
|
|
|
- **Temperature**: 0.8 |
|
|
- **Top-K**: 50 |
|
|
- **Top-P**: 0.9 |
|
|
- **Repetition Penalty**: 1.1 |
|
|
- **Max New Tokens**: 100 |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("your-username/custom-57m-language-model") |
|
|
model = AutoModelForCausalLM.from_pretrained("your-username/custom-57m-language-model") |
|
|
|
|
|
input_text = "The future of artificial intelligence" |
|
|
inputs = tokenizer.encode(input_text, return_tensors='pt') |
|
|
outputs = model.generate( |
|
|
inputs, |
|
|
max_length=100, |
|
|
temperature=0.8, |
|
|
top_k=50, |
|
|
top_p=0.9, |
|
|
repetition_penalty=1.1 |
|
|
) |
|
|
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(generated_text) |
|
|
``` |
|
|
|
|
|
## Training Dataset |
|
|
|
|
|
- **Primary**: C4 (Colossal Clean Crawled Corpus) |
|
|
- **Warm-up**: Synthetic dummy data for initial training |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License |
|
|
|
|
|
## Model Card |
|
|
|
|
|
This model was trained as an educational demonstration of transformer architecture implementation with modern techniques like RoPE embeddings and SwiGLU activations. |
|
|
|