--- language: - en license: apache-2.0 tags: - gpt2 - causal-lm - text-generation - from-scratch - fineweb - undertrained library_name: transformers pipeline_tag: text-generation --- # Llara Llara is a 91.4M parameter autoregressive language model trained from scratch on English web text. It follows the GPT-2 Small architecture and is trained entirely from random initialisation — no pretrained weights, no distillation, no fine-tuning of an existing model. but it does use GPT's tokenizer The name **Llara** is original and unrelated to LLaMA or LoRA. **Note**: The model is undertrained according to `The Chinchilla Laws (2022)` --- ## Model Details | Property | Value | |---|---| | Architecture | GPT-2 (decoder-only transformer) | | Parameters | ~90-100M | | Context length | 256 tokens | | Embedding dim | 768 | | Layers | 12 | | Attention heads | 12 | | Vocabulary | 50,257 (GPT-2 BPE) | | Training data | FineWeb (HuggingFaceFW/fineweb) + Custom dataset | | Training docs | 256,000,000 tokens | | Epochs | 1 | | Precision | fp16 | --- ## Training Llara was trained on 1 million documents sampled from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), a large-scale curated English web dataset. Documents were tokenised with the GPT-2 BPE tokeniser and packed into non-overlapping 1024-token blocks. **Training configuration:** | Hyperparameter | Value | |---|---| | Optimiser | AdamW | | Learning rate | 3e-4 | | LR schedule | Cosine decay | | Warmup steps | 2,000 | | Weight decay | 0.1 | | Effective batch size | 32 | | Gradient accumulation | 8 steps | | Dropout | 0.1 (residual, embedding, attention) | Gradient checkpointing was enabled throughout training to reduce memory usage. --- ## Usage ```python from transformers import GPT2LMHeadModel, AutoTokenizer, pipeline model = GPT2LMHeadModel.from_pretrained("helloadhavan/llara1.0-100M-base") tokenizer = AutoTokenizer.from_pretrained("helloadhavan/llara1.0-100M-base") gen = pipeline("text-generation", model=model, tokenizer=tokenizer) output = gen( "The history of artificial intelligence", max_new_tokens=200, do_sample=True, temperature=0.8, top_p=0.95, repetition_penalty=1.1, ) print(output[0]["generated_text"]) ``` --- ## Limitations - Llara is trained on English web text only and performs poorly on other languages. - Like all autoregressive LMs trained on web data, it may reproduce biases, factual errors, or inappropriate content present in the training corpus. - It is a research model trained from scratch and is not instruction-tuned or aligned — it should not be used in production or user-facing applications without further fine-tuning and safety work. - At 95M parameters and 256k training documents, it is significantly smaller and less trained than models like GPT-2 (which saw 40GB of text). Outputs may be incoherent on complex prompts. --- ## Intended Use Llara is intended for: - Research and experimentation with small language models - Learning how GPT-style models are trained from scratch - A base for fine-tuning on downstream tasks --- ## Training Framework Trained using [Hugging Face Transformers](https://github.com/huggingface/transformers) `Trainer` on a single GPU. --- ## License Apache 2.0
Note: i am a AI hobbyist, not an AI engineer