Instructions to use Scantrack/Agora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Scantrack/Agora with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Scantrack/Agora", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Scantrack/Agora", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Scantrack/Agora with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Scantrack/Agora" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Scantrack/Agora", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Scantrack/Agora
- SGLang
How to use Scantrack/Agora with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Scantrack/Agora" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Scantrack/Agora", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Scantrack/Agora" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Scantrack/Agora", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Scantrack/Agora with Docker Model Runner:
docker model run hf.co/Scantrack/Agora
| language: | |
| - en | |
| license: apache-2.0 | |
| tags: | |
| - agora | |
| - causal-lm | |
| - transformer | |
| - gqa | |
| - rope | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| # Agora | |
| **Agora** is a compact decoder-only language model built on a modern transformer architecture. It uses Grouped Query Attention (GQA), Rotary Position Embeddings (RoPE), SwiGLU activations, and RMSNorm throughout β combining design decisions from LLaMA, Mistral, and Gemma into a clean, efficient baseline. | |
| ## Architecture | |
| | Parameter | Value | | |
| |-------------------------|--------------| | |
| | Hidden size | 2048 | | |
| | Intermediate size | 8192 | | |
| | Layers | 24 | | |
| | Attention heads | 16 | | |
| | KV heads (GQA) | 8 | | |
| | Head dimension | 128 | | |
| | Max sequence length | 4096 | | |
| | Vocabulary size | 32 000 | | |
| | Activation | SiLU (SwiGLU gate) | | |
| | Positional encoding | RoPE (ΞΈ = 10 000) | | |
| | Normalisation | RMSNorm (Ξ΅ = 1e-5) | | |
| | Precision | bfloat16 | | |
| Total parameters: **~1.3 B** (estimate; depends on weight tying). | |
| ## Usage | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| import torch | |
| model_id = "Scantrack/Agora" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_id, | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| trust_remote_code=True, | |
| ) | |
| prompt = "The key to building efficient language models is" | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| with torch.no_grad(): | |
| output = model.generate( | |
| **inputs, | |
| max_new_tokens=200, | |
| do_sample=True, | |
| temperature=0.8, | |
| top_p=0.95, | |
| repetition_penalty=1.1, | |
| ) | |
| print(tokenizer.decode(output[0], skip_special_tokens=True)) | |
| ``` | |
| > **Note:** Pass `trust_remote_code=True` because the config and model classes are custom (`configuration_agora.py`, `modeling_agora.py`). | |
| ## Design Decisions | |
| **GQA (8 KV heads, 16 query heads)** β halves the KV cache size versus MHA while keeping full expressiveness on the query side. Reduces memory bandwidth bottleneck during inference at 2Γ the batch sizes. | |
| **RoPE** β relative position information is injected directly into attention scores without learned position embeddings, making the model more naturally extensible to longer contexts. | |
| **SwiGLU** β the gated variant of SiLU (gate_proj Γ up_proj β down_proj) outperforms standard FFN layers on most benchmarks at equivalent parameter count. | |
| **RMSNorm** β faster than LayerNorm (no mean subtraction), numerically stable, and standard in modern LLMs. | |
| **bfloat16** β preferred over fp16 for training stability (larger dynamic range); inference runs cleanly on any Ampere+ GPU or modern CPU with bfloat16 support. | |
| ## Tokenizer | |
| Agora uses the **LLaMA tokenizer** (SentencePiece, BPE, 32 000 vocab). You can swap in any compatible SentencePiece model by replacing `tokenizer.model` and updating `tokenizer_config.json`. | |
| ## Training | |
| *(Fill in once training is complete.)* | |
| - Dataset: | |
| - Training compute: | |
| - Optimizer: | |
| - Learning rate schedule: | |
| - Final loss: | |
| ## Limitations | |
| This is a research/prototype release. The model card will be updated after pretraining completes with evaluation results on standard benchmarks (HellaSwag, MMLU, ARC, TruthfulQA, etc.). | |
| ## License | |
| Apache 2.0 β see `LICENSE`. | |