--- language: en license: mit tags: - causal-lm - custom-architecture - transformer pipeline_tag: text-generation --- # Custom 57M Language Model A custom 57.55M parameter causal language model with modern transformer architecture. ## Model Details - **Parameters**: 57,553,632 (57.55M) - **Architecture**: 12-layer Transformer - **Hidden Size**: 432 - **Attention Heads**: 8 - **Head Dimension**: 54 - **Intermediate Size**: 1,728 - **Vocabulary Size**: 50,257 (GPT-2 tokenizer) - **Max Sequence Length**: 1,024 ## Architecture Features - **RoPE Positional Embeddings**: Rotary Position Embedding (θ=10000.0) - **SwiGLU Activation**: Swish-Gated Linear Unit in feed-forward networks - **RMSNorm**: Root Mean Square Layer Normalization (ε=1e-06) - **Tied Embeddings**: Input and output embeddings share weights - **Dropout**: 0.1 dropout rate ## Training Configuration - **Dummy Phase**: 2 epochs, 1,000 samples, LR=0.0005 - **C4 Phase**: 3 epochs, 1,000 samples, LR=0.0003 - **Optimizer**: AdamW (weight_decay=0.1) - **Scheduler**: Cosine Annealing - **Gradient Clipping**: 1.0 ## Generation Parameters - **Temperature**: 0.8 - **Top-K**: 50 - **Top-P**: 0.9 - **Repetition Penalty**: 1.1 - **Max New Tokens**: 100 ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("your-username/custom-57m-language-model") model = AutoModelForCausalLM.from_pretrained("your-username/custom-57m-language-model") input_text = "The future of artificial intelligence" inputs = tokenizer.encode(input_text, return_tensors='pt') outputs = model.generate( inputs, max_length=100, temperature=0.8, top_k=50, top_p=0.9, repetition_penalty=1.1 ) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(generated_text) ``` ## Training Dataset - **Primary**: C4 (Colossal Clean Crawled Corpus) - **Warm-up**: Synthetic dummy data for initial training ## License MIT License ## Model Card This model was trained as an educational demonstration of transformer architecture implementation with modern techniques like RoPE embeddings and SwiGLU activations.