Agora / README.md

Kiss haha

7a44667 verified 25 days ago

3.49 kB

language:
  - en
license: apache-2.0
tags:
  - agora
  - causal-lm
  - transformer
  - gqa
  - rope
library_name: transformers
pipeline_tag: text-generation

Agora

Agora is a compact decoder-only language model built on a modern transformer architecture. It uses Grouped Query Attention (GQA), Rotary Position Embeddings (RoPE), SwiGLU activations, and RMSNorm throughout — combining design decisions from LLaMA, Mistral, and Gemma into a clean, efficient baseline.

Architecture

Parameter	Value
Hidden size	2048
Intermediate size	8192
Layers	24
Attention heads	16
KV heads (GQA)	8
Head dimension	128
Max sequence length	4096
Vocabulary size	32 000
Activation	SiLU (SwiGLU gate)
Positional encoding	RoPE (θ = 10 000)
Normalisation	RMSNorm (ε = 1e-5)
Precision	bfloat16

Total parameters: ~1.3 B (estimate; depends on weight tying).

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "Scantrack/Agora"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

prompt = "The key to building efficient language models is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        repetition_penalty=1.1,
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

Note: Pass trust_remote_code=True because the config and model classes are custom (configuration_agora.py, modeling_agora.py).

Design Decisions

GQA (8 KV heads, 16 query heads) — halves the KV cache size versus MHA while keeping full expressiveness on the query side. Reduces memory bandwidth bottleneck during inference at 2× the batch sizes.

RoPE — relative position information is injected directly into attention scores without learned position embeddings, making the model more naturally extensible to longer contexts.

SwiGLU — the gated variant of SiLU (gate_proj × up_proj → down_proj) outperforms standard FFN layers on most benchmarks at equivalent parameter count.

RMSNorm — faster than LayerNorm (no mean subtraction), numerically stable, and standard in modern LLMs.

bfloat16 — preferred over fp16 for training stability (larger dynamic range); inference runs cleanly on any Ampere+ GPU or modern CPU with bfloat16 support.

Tokenizer

Agora uses the LLaMA tokenizer (SentencePiece, BPE, 32 000 vocab). You can swap in any compatible SentencePiece model by replacing tokenizer.model and updating tokenizer_config.json.

Training

(Fill in once training is complete.)

Dataset:
Training compute:
Optimizer:
Learning rate schedule:
Final loss:

Limitations

This is a research/prototype release. The model card will be updated after pretraining completes with evaluation results on standard benchmarks (HellaSwag, MMLU, ARC, TruthfulQA, etc.).

License

Apache 2.0 — see LICENSE.