File size: 2,170 Bytes
5ec20bb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
---
language: en
license: mit
tags:
- causal-lm
- custom-architecture
- transformer
pipeline_tag: text-generation
---
# Custom 57M Language Model
A custom 57.55M parameter causal language model with modern transformer architecture.
## Model Details
- **Parameters**: 57,553,632 (57.55M)
- **Architecture**: 12-layer Transformer
- **Hidden Size**: 432
- **Attention Heads**: 8
- **Head Dimension**: 54
- **Intermediate Size**: 1,728
- **Vocabulary Size**: 50,257 (GPT-2 tokenizer)
- **Max Sequence Length**: 1,024
## Architecture Features
- **RoPE Positional Embeddings**: Rotary Position Embedding (θ=10000.0)
- **SwiGLU Activation**: Swish-Gated Linear Unit in feed-forward networks
- **RMSNorm**: Root Mean Square Layer Normalization (ε=1e-06)
- **Tied Embeddings**: Input and output embeddings share weights
- **Dropout**: 0.1 dropout rate
## Training Configuration
- **Dummy Phase**: 2 epochs, 1,000 samples, LR=0.0005
- **C4 Phase**: 3 epochs, 1,000 samples, LR=0.0003
- **Optimizer**: AdamW (weight_decay=0.1)
- **Scheduler**: Cosine Annealing
- **Gradient Clipping**: 1.0
## Generation Parameters
- **Temperature**: 0.8
- **Top-K**: 50
- **Top-P**: 0.9
- **Repetition Penalty**: 1.1
- **Max New Tokens**: 100
## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("your-username/custom-57m-language-model")
model = AutoModelForCausalLM.from_pretrained("your-username/custom-57m-language-model")
input_text = "The future of artificial intelligence"
inputs = tokenizer.encode(input_text, return_tensors='pt')
outputs = model.generate(
inputs,
max_length=100,
temperature=0.8,
top_k=50,
top_p=0.9,
repetition_penalty=1.1
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```
## Training Dataset
- **Primary**: C4 (Colossal Clean Crawled Corpus)
- **Warm-up**: Synthetic dummy data for initial training
## License
MIT License
## Model Card
This model was trained as an educational demonstration of transformer architecture implementation with modern techniques like RoPE embeddings and SwiGLU activations.
|