---
language:
- en
- ar
license: mit
tags:
- silx-ai
- quasar
- foundation-model
- 3b
- moe
- long-context
- bittensor
- sn24
- distillation
- hybrid-transformer
pipeline_tag: text-generation
library_name: transformers
---
# **Quasar Foundation Models (RoPE Base)**
**Quasar Foundation Models** are SILX AI’s core models designed for **long-context reasoning**, **agentic systems**, and **persistent memory-based intelligence**.
This release is **NOT a state-of-the-art final model**.
It is a **base pretraining model** designed specifically for **distributed knowledge distillation on Bittensor (SN24 Quasar subnet)**.
The goal is to create a shared architecture where miners continuously **distill knowledge from frontier models (e.g., Qwen, GLM)** into Quasar.
---
## ⚠️ Important Note
This model is:
- A **base model**
- **Pretrained for only a few billion tokens**
- Designed for **distillation and scaling**, not benchmarking
Performance will improve through **iterative subnet training + distillation cycles**.
---
## Model Overview
- **Model Name:** Quasar 3B (RoPE Base)
- **Organization:** SILX AI
- **Architecture:** Quasar-RoPE Hybrid Transformer
- **Total Parameters:** 3B
- **Active Parameters:** ~1B (Mixture-of-Experts)
- **Training Stage:** Stage 1 (Base Pretraining)
- **Sequence Length:** 16K tokens (RoPE phase)
---
## Training Strategy
Quasar follows a **multi-stage training pipeline**:
### **Stage 1 — RoPE Pretraining**
- Train using **Rotary Positional Embeddings (RoPE)**
- Context length: **16K tokens**
- Objective: stabilize training and build core reasoning
### **Stage 2 — Distillation (SN24)**
- Distributed training on **Bittensor subnet (SN24)**
- Miners distill knowledge from:
- Qwen
- GLM
- Target: transfer reasoning + capabilities into Quasar
### **Stage 3 — DroPE Long-Context Training**
- Remove positional embeddings entirely (**DroPE phase**)
- Transition to **position-free reasoning**
- Train on **ultra-long context (up to 5M tokens)**
This staged approach allows:
- Stable early training
- Efficient knowledge transfer
- Extreme context scaling without positional bottlenecks
---
# **Quasar-RoPE Hybrid Architecture**
Quasar is a **high-throughput hybrid transformer** designed for **trillion-token scale training**.
It combines:
- **Looped computation**
- **Persistent latent memory**
- **Hybrid attention mechanisms**
- **Stable Mixture-of-Experts routing**
---
## 1. Looped Transformer Logic
Instead of increasing depth traditionally, Quasar uses **looped execution**:
- A fixed set of layers is reused multiple times (`num_loops`)
- This multiplies effective depth without increasing VRAM
### Key Mechanism:
- **Anchor P (Input Injection):**
- Embedding output is stored as `P`
- Injected into the hidden state at every loop
- **Gradient Stabilization:**
- Injection gradients scaled by `1 / num_loops`
- Prevents instability during recirculation
---
## 2. Hybrid Layer Composition
Each loop contains a mix of:
### **Quasar Layers**
- Use **Latent Memory Module**
- Handle long-range dependencies
- Read/write persistent state
### **GLA Layers (Gated Linear Attention)**
- Fast, RNN-like recurrence
- Efficient local sequence modeling
---
## 3. Persistent Latent Memory
A defining component of Quasar:
- **Memory Slots:**
- Fixed parameter banks (e.g., 128–256 slots)
- **Segment Compression:**
- Tokens grouped into segments (default: 64 tokens)
- Reduced noise during updates
- **Saliency Gating:**
- Learns which information is important
- Writes only high-value signals to memory
---
## 4. SMEBU (Stability-Maximized Expert Balancing Unit)
Custom Mixture-of-Experts system:
- **Global Bias Buffers**
- Stored outside optimizer
- Prevent routing collapse
- **Zero-Loop Updates**
- Expert balancing done in vectorized pass
- No recursive instability
- **Sparse Activation**
- ~1B active parameters per forward pass
---
## 5. Technical Specifications
- **Normalization:** RMSNorm (Pre-Norm)
- **Positional Encoding:** RoPE (`theta = 1,000,000`)
- **Initialization:** Depth-scaled `1/sqrt(2L)`
- **Architecture Type:** Hybrid Transformer + Memory + MoE
---
# Architecture Overview
## Core Data Flow
```
Token IDs
↓
Embedding Layer
↓
Anchor P Snapshot
↓
┌──────────────────────────────────────────────┐
│ Loop (i < num_loops) │
│ │
│ Quasar Block │
│ ↓ │
│ GLA Block │
│ ↓ │
│ SMEBU MoE │
│ ↓ │
│ Inject Anchor P (Residual Conditioning) │
└──────────────────────────────────────────────┘
↓
Next Loop Iteration (state updated)
Final Loop Output
↓
RMSNorm
↓
LM Head
↓
Logits
```
---
## Latent Memory Update Path
```
Hidden States
↓
Layer Normalization (RMSNorm)
↓
Segment Compressor
↓
Segment Representation (Z)
↓
├──────────────→ Saliency Gate (importance scoring)
│ ↓
│ Write Signal
│ ↓
└──────────────→ Memory Write Operation
↓
Persistent Memory Bank (M)
↓
Updated Memory (M')
↓
Memory Read Module
↓
Memory-Augmented Hidden State
↓
Output
```
---
## SMEBU MoE Stability Flow
```
Router Network
↓
Token Routing Scores
↓
* Global Bias Buffer (non-trainable stability path)
↓
Top-K Expert Selection
↓
Selected Experts
↓
Expert Output Aggregation
↓
Final MoE Output
↓
Post-Loop Bias Update (vectorized, stabilized)
```
---
# Intended Use
This model is designed as a **foundation base model** for the Quasar ecosystem and is primarily intended for:
- **Bittensor SN24 miners** participating in distributed training and knowledge distillation
- **Distillation pipelines** transferring capabilities from frontier models (e.g., Qwen, GLM)
- **Research on long-context architectures**, especially beyond traditional positional encoding limits
- **Agentic system development**, where persistent memory and long-horizon reasoning are required
---
# Next Steps
- Training on **SN24** in the coming days
- Miners distill knowledge into this model
- Then we go for **Run 2 — DroPE training** at **5M tokens**