---
language:
  - en
license: apache-2.0
tags:
  - agora
  - causal-lm
  - transformer
  - gqa
  - rope
library_name: transformers
pipeline_tag: text-generation
---

# Agora

**Agora** is a compact decoder-only language model built on a modern transformer architecture. It uses Grouped Query Attention (GQA), Rotary Position Embeddings (RoPE), SwiGLU activations, and RMSNorm throughout — combining design decisions from LLaMA, Mistral, and Gemma into a clean, efficient baseline.

## Architecture

| Parameter               | Value        |
|-------------------------|--------------|
| Hidden size             | 2048         |
| Intermediate size       | 8192         |
| Layers                  | 24           |
| Attention heads         | 16           |
| KV heads (GQA)          | 8            |
| Head dimension          | 128          |
| Max sequence length     | 4096         |
| Vocabulary size         | 32 000       |
| Activation              | SiLU (SwiGLU gate) |
| Positional encoding     | RoPE (θ = 10 000) |
| Normalisation           | RMSNorm (ε = 1e-5) |
| Precision               | bfloat16     |

Total parameters: **~1.3 B** (estimate; depends on weight tying).

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "Scantrack/Agora"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

prompt = "The key to building efficient language models is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        repetition_penalty=1.1,
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))
```

> **Note:** Pass `trust_remote_code=True` because the config and model classes are custom (`configuration_agora.py`, `modeling_agora.py`).

## Design Decisions

**GQA (8 KV heads, 16 query heads)** — halves the KV cache size versus MHA while keeping full expressiveness on the query side. Reduces memory bandwidth bottleneck during inference at 2× the batch sizes.

**RoPE** — relative position information is injected directly into attention scores without learned position embeddings, making the model more naturally extensible to longer contexts.

**SwiGLU** — the gated variant of SiLU (gate_proj × up_proj → down_proj) outperforms standard FFN layers on most benchmarks at equivalent parameter count.

**RMSNorm** — faster than LayerNorm (no mean subtraction), numerically stable, and standard in modern LLMs.

**bfloat16** — preferred over fp16 for training stability (larger dynamic range); inference runs cleanly on any Ampere+ GPU or modern CPU with bfloat16 support.

## Tokenizer

Agora uses the **LLaMA tokenizer** (SentencePiece, BPE, 32 000 vocab). You can swap in any compatible SentencePiece model by replacing `tokenizer.model` and updating `tokenizer_config.json`.

## Training

*(Fill in once training is complete.)*

- Dataset:
- Training compute:
- Optimizer:
- Learning rate schedule:
- Final loss:

## Limitations

This is a research/prototype release. The model card will be updated after pretraining completes with evaluation results on standard benchmarks (HellaSwag, MMLU, ARC, TruthfulQA, etc.).

## License

Apache 2.0 — see `LICENSE`.