--- language: - en - ar license: mit tags: - silx-ai - quasar - foundation-model - 3b - moe - long-context - bittensor - sn24 - distillation - hybrid-transformer pipeline_tag: text-generation library_name: transformers ---

Quasar Foundation Model

# **Quasar Foundation Models (RoPE Base)** **Quasar Foundation Models** are SILX AI’s core models designed for **long-context reasoning**, **agentic systems**, and **persistent memory-based intelligence**. This release is **NOT a state-of-the-art final model**. It is a **base pretraining model** designed specifically for **distributed knowledge distillation on Bittensor (SN24 Quasar subnet)**. The goal is to create a shared architecture where miners continuously **distill knowledge from frontier models (e.g., Qwen, GLM)** into Quasar. --- ## ⚠️ Important Note This model is: - A **base model** - **Pretrained for only a few billion tokens** - Designed for **distillation and scaling**, not benchmarking Performance will improve through **iterative subnet training + distillation cycles**. --- ## Model Overview - **Model Name:** Quasar 3B (RoPE Base) - **Organization:** SILX AI - **Architecture:** Quasar-RoPE Hybrid Transformer - **Total Parameters:** 3B - **Active Parameters:** ~1B (Mixture-of-Experts) - **Training Stage:** Stage 1 (Base Pretraining) - **Sequence Length:** 16K tokens (RoPE phase) --- ## Training Strategy Quasar follows a **multi-stage training pipeline**: ### **Stage 1 — RoPE Pretraining** - Train using **Rotary Positional Embeddings (RoPE)** - Context length: **16K tokens** - Objective: stabilize training and build core reasoning ### **Stage 2 — Distillation (SN24)** - Distributed training on **Bittensor subnet (SN24)** - Miners distill knowledge from: - Qwen - GLM - Target: transfer reasoning + capabilities into Quasar ### **Stage 3 — DroPE Long-Context Training** - Remove positional embeddings entirely (**DroPE phase**) - Transition to **position-free reasoning** - Train on **ultra-long context (up to 5M tokens)** This staged approach allows: - Stable early training - Efficient knowledge transfer - Extreme context scaling without positional bottlenecks --- # **Quasar-RoPE Hybrid Architecture** Quasar is a **high-throughput hybrid transformer** designed for **trillion-token scale training**. It combines: - **Looped computation** - **Persistent latent memory** - **Hybrid attention mechanisms** - **Stable Mixture-of-Experts routing** --- ## 1. Looped Transformer Logic Instead of increasing depth traditionally, Quasar uses **looped execution**: - A fixed set of layers is reused multiple times (`num_loops`) - This multiplies effective depth without increasing VRAM ### Key Mechanism: - **Anchor P (Input Injection):** - Embedding output is stored as `P` - Injected into the hidden state at every loop - **Gradient Stabilization:** - Injection gradients scaled by `1 / num_loops` - Prevents instability during recirculation --- ## 2. Hybrid Layer Composition Each loop contains a mix of: ### **Quasar Layers** - Use **Latent Memory Module** - Handle long-range dependencies - Read/write persistent state ### **GLA Layers (Gated Linear Attention)** - Fast, RNN-like recurrence - Efficient local sequence modeling --- ## 3. Persistent Latent Memory A defining component of Quasar: - **Memory Slots:** - Fixed parameter banks (e.g., 128–256 slots) - **Segment Compression:** - Tokens grouped into segments (default: 64 tokens) - Reduced noise during updates - **Saliency Gating:** - Learns which information is important - Writes only high-value signals to memory --- ## 4. SMEBU (Stability-Maximized Expert Balancing Unit) Custom Mixture-of-Experts system: - **Global Bias Buffers** - Stored outside optimizer - Prevent routing collapse - **Zero-Loop Updates** - Expert balancing done in vectorized pass - No recursive instability - **Sparse Activation** - ~1B active parameters per forward pass --- ## 5. Technical Specifications - **Normalization:** RMSNorm (Pre-Norm) - **Positional Encoding:** RoPE (`theta = 1,000,000`) - **Initialization:** Depth-scaled `1/sqrt(2L)` - **Architecture Type:** Hybrid Transformer + Memory + MoE --- # Architecture Overview ## Core Data Flow ``` Token IDs ↓ Embedding Layer ↓ Anchor P Snapshot ↓ ┌──────────────────────────────────────────────┐ │ Loop (i < num_loops) │ │ │ │ Quasar Block │ │ ↓ │ │ GLA Block │ │ ↓ │ │ SMEBU MoE │ │ ↓ │ │ Inject Anchor P (Residual Conditioning) │ └──────────────────────────────────────────────┘ ↓ Next Loop Iteration (state updated) Final Loop Output ↓ RMSNorm ↓ LM Head ↓ Logits ``` --- ## Latent Memory Update Path ``` Hidden States ↓ Layer Normalization (RMSNorm) ↓ Segment Compressor ↓ Segment Representation (Z) ↓ ├──────────────→ Saliency Gate (importance scoring) │ ↓ │ Write Signal │ ↓ └──────────────→ Memory Write Operation ↓ Persistent Memory Bank (M) ↓ Updated Memory (M') ↓ Memory Read Module ↓ Memory-Augmented Hidden State ↓ Output ``` --- ## SMEBU MoE Stability Flow ``` Router Network ↓ Token Routing Scores ↓ * Global Bias Buffer (non-trainable stability path) ↓ Top-K Expert Selection ↓ Selected Experts ↓ Expert Output Aggregation ↓ Final MoE Output ↓ Post-Loop Bias Update (vectorized, stabilized) ``` --- # Intended Use This model is designed as a **foundation base model** for the Quasar ecosystem and is primarily intended for: - **Bittensor SN24 miners** participating in distributed training and knowledge distillation - **Distillation pipelines** transferring capabilities from frontier models (e.g., Qwen, GLM) - **Research on long-context architectures**, especially beyond traditional positional encoding limits - **Agentic system development**, where persistent memory and long-horizon reasoning are required --- # Next Steps - Training on **SN24** in the coming days - Miners distill knowledge into this model - Then we go for **Run 2 — DroPE training** at **5M tokens**