Agora / README.md
Scantrack's picture
Kiss haha
7a44667 verified
|
Raw
History Blame Contribute Delete
3.49 kB
metadata
language:
  - en
license: apache-2.0
tags:
  - agora
  - causal-lm
  - transformer
  - gqa
  - rope
library_name: transformers
pipeline_tag: text-generation

Agora

Agora is a compact decoder-only language model built on a modern transformer architecture. It uses Grouped Query Attention (GQA), Rotary Position Embeddings (RoPE), SwiGLU activations, and RMSNorm throughout — combining design decisions from LLaMA, Mistral, and Gemma into a clean, efficient baseline.

Architecture

Parameter Value
Hidden size 2048
Intermediate size 8192
Layers 24
Attention heads 16
KV heads (GQA) 8
Head dimension 128
Max sequence length 4096
Vocabulary size 32 000
Activation SiLU (SwiGLU gate)
Positional encoding RoPE (θ = 10 000)
Normalisation RMSNorm (ε = 1e-5)
Precision bfloat16

Total parameters: ~1.3 B (estimate; depends on weight tying).

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "Scantrack/Agora"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

prompt = "The key to building efficient language models is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        repetition_penalty=1.1,
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

Note: Pass trust_remote_code=True because the config and model classes are custom (configuration_agora.py, modeling_agora.py).

Design Decisions

GQA (8 KV heads, 16 query heads) — halves the KV cache size versus MHA while keeping full expressiveness on the query side. Reduces memory bandwidth bottleneck during inference at 2× the batch sizes.

RoPE — relative position information is injected directly into attention scores without learned position embeddings, making the model more naturally extensible to longer contexts.

SwiGLU — the gated variant of SiLU (gate_proj × up_proj → down_proj) outperforms standard FFN layers on most benchmarks at equivalent parameter count.

RMSNorm — faster than LayerNorm (no mean subtraction), numerically stable, and standard in modern LLMs.

bfloat16 — preferred over fp16 for training stability (larger dynamic range); inference runs cleanly on any Ampere+ GPU or modern CPU with bfloat16 support.

Tokenizer

Agora uses the LLaMA tokenizer (SentencePiece, BPE, 32 000 vocab). You can swap in any compatible SentencePiece model by replacing tokenizer.model and updating tokenizer_config.json.

Training

(Fill in once training is complete.)

  • Dataset:
  • Training compute:
  • Optimizer:
  • Learning rate schedule:
  • Final loss:

Limitations

This is a research/prototype release. The model card will be updated after pretraining completes with evaluation results on standard benchmarks (HellaSwag, MMLU, ARC, TruthfulQA, etc.).

License

Apache 2.0 — see LICENSE.