--- language: - en license: apache-2.0 tags: - agora - causal-lm - transformer - gqa - rope library_name: transformers pipeline_tag: text-generation --- # Agora **Agora** is a compact decoder-only language model built on a modern transformer architecture. It uses Grouped Query Attention (GQA), Rotary Position Embeddings (RoPE), SwiGLU activations, and RMSNorm throughout — combining design decisions from LLaMA, Mistral, and Gemma into a clean, efficient baseline. ## Architecture | Parameter | Value | |-------------------------|--------------| | Hidden size | 2048 | | Intermediate size | 8192 | | Layers | 24 | | Attention heads | 16 | | KV heads (GQA) | 8 | | Head dimension | 128 | | Max sequence length | 4096 | | Vocabulary size | 32 000 | | Activation | SiLU (SwiGLU gate) | | Positional encoding | RoPE (θ = 10 000) | | Normalisation | RMSNorm (ε = 1e-5) | | Precision | bfloat16 | Total parameters: **~1.3 B** (estimate; depends on weight tying). ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "Scantrack/Agora" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) prompt = "The key to building efficient language models is" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=200, do_sample=True, temperature=0.8, top_p=0.95, repetition_penalty=1.1, ) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` > **Note:** Pass `trust_remote_code=True` because the config and model classes are custom (`configuration_agora.py`, `modeling_agora.py`). ## Design Decisions **GQA (8 KV heads, 16 query heads)** — halves the KV cache size versus MHA while keeping full expressiveness on the query side. Reduces memory bandwidth bottleneck during inference at 2× the batch sizes. **RoPE** — relative position information is injected directly into attention scores without learned position embeddings, making the model more naturally extensible to longer contexts. **SwiGLU** — the gated variant of SiLU (gate_proj × up_proj → down_proj) outperforms standard FFN layers on most benchmarks at equivalent parameter count. **RMSNorm** — faster than LayerNorm (no mean subtraction), numerically stable, and standard in modern LLMs. **bfloat16** — preferred over fp16 for training stability (larger dynamic range); inference runs cleanly on any Ampere+ GPU or modern CPU with bfloat16 support. ## Tokenizer Agora uses the **LLaMA tokenizer** (SentencePiece, BPE, 32 000 vocab). You can swap in any compatible SentencePiece model by replacing `tokenizer.model` and updating `tokenizer_config.json`. ## Training *(Fill in once training is complete.)* - Dataset: - Training compute: - Optimizer: - Learning rate schedule: - Final loss: ## Limitations This is a research/prototype release. The model card will be updated after pretraining completes with evaluation results on standard benchmarks (HellaSwag, MMLU, ARC, TruthfulQA, etc.). ## License Apache 2.0 — see `LICENSE`.