Agora / README.md
Scantrack's picture
Kiss haha
7a44667 verified
|
Raw
History Blame Contribute Delete
3.49 kB
---
language:
- en
license: apache-2.0
tags:
- agora
- causal-lm
- transformer
- gqa
- rope
library_name: transformers
pipeline_tag: text-generation
---
# Agora
**Agora** is a compact decoder-only language model built on a modern transformer architecture. It uses Grouped Query Attention (GQA), Rotary Position Embeddings (RoPE), SwiGLU activations, and RMSNorm throughout β€” combining design decisions from LLaMA, Mistral, and Gemma into a clean, efficient baseline.
## Architecture
| Parameter | Value |
|-------------------------|--------------|
| Hidden size | 2048 |
| Intermediate size | 8192 |
| Layers | 24 |
| Attention heads | 16 |
| KV heads (GQA) | 8 |
| Head dimension | 128 |
| Max sequence length | 4096 |
| Vocabulary size | 32 000 |
| Activation | SiLU (SwiGLU gate) |
| Positional encoding | RoPE (ΞΈ = 10 000) |
| Normalisation | RMSNorm (Ξ΅ = 1e-5) |
| Precision | bfloat16 |
Total parameters: **~1.3 B** (estimate; depends on weight tying).
## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "Scantrack/Agora"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
prompt = "The key to building efficient language models is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=0.8,
top_p=0.95,
repetition_penalty=1.1,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
> **Note:** Pass `trust_remote_code=True` because the config and model classes are custom (`configuration_agora.py`, `modeling_agora.py`).
## Design Decisions
**GQA (8 KV heads, 16 query heads)** β€” halves the KV cache size versus MHA while keeping full expressiveness on the query side. Reduces memory bandwidth bottleneck during inference at 2Γ— the batch sizes.
**RoPE** β€” relative position information is injected directly into attention scores without learned position embeddings, making the model more naturally extensible to longer contexts.
**SwiGLU** β€” the gated variant of SiLU (gate_proj Γ— up_proj β†’ down_proj) outperforms standard FFN layers on most benchmarks at equivalent parameter count.
**RMSNorm** β€” faster than LayerNorm (no mean subtraction), numerically stable, and standard in modern LLMs.
**bfloat16** β€” preferred over fp16 for training stability (larger dynamic range); inference runs cleanly on any Ampere+ GPU or modern CPU with bfloat16 support.
## Tokenizer
Agora uses the **LLaMA tokenizer** (SentencePiece, BPE, 32 000 vocab). You can swap in any compatible SentencePiece model by replacing `tokenizer.model` and updating `tokenizer_config.json`.
## Training
*(Fill in once training is complete.)*
- Dataset:
- Training compute:
- Optimizer:
- Learning rate schedule:
- Final loss:
## Limitations
This is a research/prototype release. The model card will be updated after pretraining completes with evaluation results on standard benchmarks (HellaSwag, MMLU, ARC, TruthfulQA, etc.).
## License
Apache 2.0 β€” see `LICENSE`.