Mostafa8Mehrabi's picture
Upload 57.6M parameter custom language model
5ec20bb verified
---
language: en
license: mit
tags:
- causal-lm
- custom-architecture
- transformer
pipeline_tag: text-generation
---
# Custom 57M Language Model
A custom 57.55M parameter causal language model with modern transformer architecture.
## Model Details
- **Parameters**: 57,553,632 (57.55M)
- **Architecture**: 12-layer Transformer
- **Hidden Size**: 432
- **Attention Heads**: 8
- **Head Dimension**: 54
- **Intermediate Size**: 1,728
- **Vocabulary Size**: 50,257 (GPT-2 tokenizer)
- **Max Sequence Length**: 1,024
## Architecture Features
- **RoPE Positional Embeddings**: Rotary Position Embedding (θ=10000.0)
- **SwiGLU Activation**: Swish-Gated Linear Unit in feed-forward networks
- **RMSNorm**: Root Mean Square Layer Normalization (ε=1e-06)
- **Tied Embeddings**: Input and output embeddings share weights
- **Dropout**: 0.1 dropout rate
## Training Configuration
- **Dummy Phase**: 2 epochs, 1,000 samples, LR=0.0005
- **C4 Phase**: 3 epochs, 1,000 samples, LR=0.0003
- **Optimizer**: AdamW (weight_decay=0.1)
- **Scheduler**: Cosine Annealing
- **Gradient Clipping**: 1.0
## Generation Parameters
- **Temperature**: 0.8
- **Top-K**: 50
- **Top-P**: 0.9
- **Repetition Penalty**: 1.1
- **Max New Tokens**: 100
## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("your-username/custom-57m-language-model")
model = AutoModelForCausalLM.from_pretrained("your-username/custom-57m-language-model")
input_text = "The future of artificial intelligence"
inputs = tokenizer.encode(input_text, return_tensors='pt')
outputs = model.generate(
inputs,
max_length=100,
temperature=0.8,
top_k=50,
top_p=0.9,
repetition_penalty=1.1
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```
## Training Dataset
- **Primary**: C4 (Colossal Clean Crawled Corpus)
- **Warm-up**: Synthetic dummy data for initial training
## License
MIT License
## Model Card
This model was trained as an educational demonstration of transformer architecture implementation with modern techniques like RoPE embeddings and SwiGLU activations.