---
language:
- en
license: apache-2.0
tags:
- gpt2
- causal-lm
- text-generation
- from-scratch
- fineweb
- undertrained
library_name: transformers
pipeline_tag: text-generation
---

# Llara

<img src="data:image/svg+xml;base64,PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4KPHN2ZyB2ZXJzaW9uPSIxLjEiIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyIgd2lkdGg9IjUwMCIgaGVpZ2h0PSIyMDAiIHN0eWxlPSJiYWNrZ3JvdW5kLWNvbG9yOiAjRkZGRkZGOyI+CiAgPGRlZnM+CiAgICA8c3R5bGUgdHlwZT0idGV4dC9jc3MiPgogICAgICBAaW1wb3J0IHVybCgnaHR0cHM6Ly9mb250cy5nb29nbGVhcGlzLmNvbS9jc3MyP2ZhbWlseT1JQk0rUGxleCtTYW5zOml0YWwsd2dodEAwLDEwMC4uNzAwOzEsMTAwLi43MDAnKTsKICAgICAgCiAgICAgIC5jdXN0b20tdGV4dCB7CiAgICAgICAgZm9udC1mYW1pbHk6ICdJQk0gUGxleCBTYW5zJywnUm9ib3RvJywgc2Fucy1zZXJpZjsKICAgICAgICBmb250LXNpemU6IDcwcHg7CiAgICAgICAgZmlsbDogIzAwMDAwMDsKICAgICAgICBmb250LXdlaWdodDogNjAwOyAgCiAgICAgIH0KICAgIDwvc3R5bGU+CiAgPC9kZWZzPgo8cGF0aCBkPSJNMCAwIEM2NiAwIDEzMiAwIDIwMCAwIEMyMDAgNjYgMjAwIDEzMiAyMDAgMjAwIEMxMzQgMjAwIDY4IDIwMCAwIDIwMCBDMCAxMzQgMCA2OCAwIDAgWiAiIGZpbGw9IiNGQUZBRkEiIHRyYW5zZm9ybT0idHJhbnNsYXRlKDAsMCkiLz4KPHBhdGggZD0iTTAgMCBDMzkuMjcgMCA3OC41NCAwIDExOSAwIEMxMTkgMzkuMjcgMTE5IDc4LjU0IDExOSAxMTkgQzEwNi4xMyAxMTkgOTMuMjYgMTE5IDgwIDExOSBDODAgOTIuOTMgODAgNjYuODYgODAgNDAgQzUzLjYgNDAgMjcuMiA0MCAwIDQwIEMwIDI2LjggMCAxMy42IDAgMCBaICIgZmlsbD0iIzAxMDEwMSIgdHJhbnNmb3JtPSJ0cmFuc2xhdGUoNDAsNDApIi8+CjxwYXRoIGQ9Ik0wIDAgQzEzLjIgMCAyNi40IDAgNDAgMCBDNDAgMTIuODcgNDAgMjUuNzQgNDAgMzkgQzI2LjggMzkgMTMuNiAzOSAwIDM5IEMwIDI2LjEzIDAgMTMuMjYgMCAwIFogIiBmaWxsPSIjMDIwMjAyIiB0cmFuc2Zvcm09InRyYW5zbGF0ZSg0MCwxMjApIi8+CiAgPHRleHQgeD0iMjAwIiB5PSIxMzUiIGNsYXNzPSJjdXN0b20tdGV4dCI+TGxhcmExLjA8L3RleHQ+Cjwvc3ZnPgo=">


Llara is a 91.4M parameter autoregressive language model trained from scratch on English web text. It follows the GPT-2 Small architecture and is trained entirely from random initialisation — no pretrained weights, no distillation, no fine-tuning of an existing model.
but it does use GPT's tokenizer

The name **Llara** is original and unrelated to LLaMA or LoRA.

**Note**: The model is undertrained according to `The Chinchilla Laws (2022)`

---

## Model Details

| Property | Value |
|---|---|
| Architecture | GPT-2 (decoder-only transformer) |
| Parameters | ~90-100M |
| Context length | 256 tokens |
| Embedding dim | 768 |
| Layers | 12 |
| Attention heads | 12 |
| Vocabulary | 50,257 (GPT-2 BPE) |
| Training data | FineWeb (HuggingFaceFW/fineweb) + Custom dataset |
| Training docs | 256,000,000 tokens |
| Epochs | 1 |
| Precision | fp16 |

---

## Training

Llara was trained on 1 million documents sampled from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), a large-scale curated English web dataset. Documents were tokenised with the GPT-2 BPE tokeniser and packed into non-overlapping 1024-token blocks.

**Training configuration:**

| Hyperparameter | Value |
|---|---|
| Optimiser | AdamW |
| Learning rate | 3e-4 |
| LR schedule | Cosine decay |
| Warmup steps | 2,000 |
| Weight decay | 0.1 |
| Effective batch size | 32 |
| Gradient accumulation | 8 steps |
| Dropout | 0.1 (residual, embedding, attention) |

Gradient checkpointing was enabled throughout training to reduce memory usage.

---

## Usage

```python
from transformers import GPT2LMHeadModel, AutoTokenizer, pipeline

model = GPT2LMHeadModel.from_pretrained("helloadhavan/llara1.0-100M-base")
tokenizer = AutoTokenizer.from_pretrained("helloadhavan/llara1.0-100M-base")

gen = pipeline("text-generation", model=model, tokenizer=tokenizer)

output = gen(
    "The history of artificial intelligence",
    max_new_tokens=200,
    do_sample=True,
    temperature=0.8,
    top_p=0.95,
    repetition_penalty=1.1,
)

print(output[0]["generated_text"])
```

---

## Limitations

- Llara is trained on English web text only and performs poorly on other languages.
- Like all autoregressive LMs trained on web data, it may reproduce biases, factual errors, or inappropriate content present in the training corpus.
- It is a research model trained from scratch and is not instruction-tuned or aligned — it should not be used in production or user-facing applications without further fine-tuning and safety work.
- At 95M parameters and 256k training documents, it is significantly smaller and less trained than models like GPT-2 (which saw 40GB of text). Outputs may be incoherent on complex prompts.

---

## Intended Use

Llara is intended for:

- Research and experimentation with small language models
- Learning how GPT-style models are trained from scratch
- A base for fine-tuning on downstream tasks

---

## Training Framework

Trained using [Hugging Face Transformers](https://github.com/huggingface/transformers) `Trainer` on a single GPU.

---

## License

Apache 2.0

<div>
  <blockquote><strong>Note:</strong> i am a AI hobbyist, not an AI engineer</blockquote>
</div>