Instructions to use Scantrack/Agora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Scantrack/Agora with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Scantrack/Agora", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Scantrack/Agora", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Scantrack/Agora with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Scantrack/Agora" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Scantrack/Agora", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Scantrack/Agora
- SGLang
How to use Scantrack/Agora with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Scantrack/Agora" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Scantrack/Agora", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Scantrack/Agora" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Scantrack/Agora", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Scantrack/Agora with Docker Model Runner:
docker model run hf.co/Scantrack/Agora
Agora
Agora is a compact decoder-only language model built on a modern transformer architecture. It uses Grouped Query Attention (GQA), Rotary Position Embeddings (RoPE), SwiGLU activations, and RMSNorm throughout โ combining design decisions from LLaMA, Mistral, and Gemma into a clean, efficient baseline.
Architecture
| Parameter | Value |
|---|---|
| Hidden size | 2048 |
| Intermediate size | 8192 |
| Layers | 24 |
| Attention heads | 16 |
| KV heads (GQA) | 8 |
| Head dimension | 128 |
| Max sequence length | 4096 |
| Vocabulary size | 32 000 |
| Activation | SiLU (SwiGLU gate) |
| Positional encoding | RoPE (ฮธ = 10 000) |
| Normalisation | RMSNorm (ฮต = 1e-5) |
| Precision | bfloat16 |
Total parameters: ~1.3 B (estimate; depends on weight tying).
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "Scantrack/Agora"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
prompt = "The key to building efficient language models is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=0.8,
top_p=0.95,
repetition_penalty=1.1,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Note: Pass
trust_remote_code=Truebecause the config and model classes are custom (configuration_agora.py,modeling_agora.py).
Design Decisions
GQA (8 KV heads, 16 query heads) โ halves the KV cache size versus MHA while keeping full expressiveness on the query side. Reduces memory bandwidth bottleneck during inference at 2ร the batch sizes.
RoPE โ relative position information is injected directly into attention scores without learned position embeddings, making the model more naturally extensible to longer contexts.
SwiGLU โ the gated variant of SiLU (gate_proj ร up_proj โ down_proj) outperforms standard FFN layers on most benchmarks at equivalent parameter count.
RMSNorm โ faster than LayerNorm (no mean subtraction), numerically stable, and standard in modern LLMs.
bfloat16 โ preferred over fp16 for training stability (larger dynamic range); inference runs cleanly on any Ampere+ GPU or modern CPU with bfloat16 support.
Tokenizer
Agora uses the LLaMA tokenizer (SentencePiece, BPE, 32 000 vocab). You can swap in any compatible SentencePiece model by replacing tokenizer.model and updating tokenizer_config.json.
Training
(Fill in once training is complete.)
- Dataset:
- Training compute:
- Optimizer:
- Learning rate schedule:
- Final loss:
Limitations
This is a research/prototype release. The model card will be updated after pretraining completes with evaluation results on standard benchmarks (HellaSwag, MMLU, ARC, TruthfulQA, etc.).
License
Apache 2.0 โ see LICENSE.
- Downloads last month
- 22