Text Generation
Transformers
Safetensors
English
Arabic
quasar
silx-ai
foundation-model
3b
Mixture of Experts
long-context
bittensor
sn24
distillation
hybrid-transformer
conversational
custom_code
Instructions to use protobuga/sn24-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use protobuga/sn24-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="protobuga/sn24-v1", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("protobuga/sn24-v1", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use protobuga/sn24-v1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "protobuga/sn24-v1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "protobuga/sn24-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/protobuga/sn24-v1
- SGLang
How to use protobuga/sn24-v1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "protobuga/sn24-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "protobuga/sn24-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "protobuga/sn24-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "protobuga/sn24-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use protobuga/sn24-v1 with Docker Model Runner:
docker model run hf.co/protobuga/sn24-v1
| language: | |
| - en | |
| - ar | |
| license: mit | |
| tags: | |
| - silx-ai | |
| - quasar | |
| - foundation-model | |
| - 3b | |
| - moe | |
| - long-context | |
| - bittensor | |
| - sn24 | |
| - distillation | |
| - hybrid-transformer | |
| pipeline_tag: text-generation | |
| library_name: transformers | |
| <p align="center"> | |
| <img src="./Quasar.png" alt="Quasar Foundation Model" width="100%"> | |
| </p> | |
| # **Quasar Foundation Models (RoPE Base)** | |
| **Quasar Foundation Models** are SILX AI’s core models designed for **long-context reasoning**, **agentic systems**, and **persistent memory-based intelligence**. | |
| This release is **NOT a state-of-the-art final model**. | |
| It is a **base pretraining model** designed specifically for **distributed knowledge distillation on Bittensor (SN24 Quasar subnet)**. | |
| The goal is to create a shared architecture where miners continuously **distill knowledge from frontier models (e.g., Qwen, GLM)** into Quasar. | |
| --- | |
| ## ⚠️ Important Note | |
| This model is: | |
| - A **base model** | |
| - **Pretrained for only a few billion tokens** | |
| - Designed for **distillation and scaling**, not benchmarking | |
| Performance will improve through **iterative subnet training + distillation cycles**. | |
| --- | |
| ## Model Overview | |
| - **Model Name:** Quasar 3B (RoPE Base) | |
| - **Organization:** SILX AI | |
| - **Architecture:** Quasar-RoPE Hybrid Transformer | |
| - **Total Parameters:** 3B | |
| - **Active Parameters:** ~1B (Mixture-of-Experts) | |
| - **Training Stage:** Stage 1 (Base Pretraining) | |
| - **Sequence Length:** 16K tokens (RoPE phase) | |
| --- | |
| ## Training Strategy | |
| Quasar follows a **multi-stage training pipeline**: | |
| ### **Stage 1 — RoPE Pretraining** | |
| - Train using **Rotary Positional Embeddings (RoPE)** | |
| - Context length: **16K tokens** | |
| - Objective: stabilize training and build core reasoning | |
| ### **Stage 2 — Distillation (SN24)** | |
| - Distributed training on **Bittensor subnet (SN24)** | |
| - Miners distill knowledge from: | |
| - Qwen | |
| - GLM | |
| - Target: transfer reasoning + capabilities into Quasar | |
| ### **Stage 3 — DroPE Long-Context Training** | |
| - Remove positional embeddings entirely (**DroPE phase**) | |
| - Transition to **position-free reasoning** | |
| - Train on **ultra-long context (up to 5M tokens)** | |
| This staged approach allows: | |
| - Stable early training | |
| - Efficient knowledge transfer | |
| - Extreme context scaling without positional bottlenecks | |
| --- | |
| # **Quasar-RoPE Hybrid Architecture** | |
| Quasar is a **high-throughput hybrid transformer** designed for **trillion-token scale training**. | |
| It combines: | |
| - **Looped computation** | |
| - **Persistent latent memory** | |
| - **Hybrid attention mechanisms** | |
| - **Stable Mixture-of-Experts routing** | |
| --- | |
| ## 1. Looped Transformer Logic | |
| Instead of increasing depth traditionally, Quasar uses **looped execution**: | |
| - A fixed set of layers is reused multiple times (`num_loops`) | |
| - This multiplies effective depth without increasing VRAM | |
| ### Key Mechanism: | |
| - **Anchor P (Input Injection):** | |
| - Embedding output is stored as `P` | |
| - Injected into the hidden state at every loop | |
| - **Gradient Stabilization:** | |
| - Injection gradients scaled by `1 / num_loops` | |
| - Prevents instability during recirculation | |
| --- | |
| ## 2. Hybrid Layer Composition | |
| Each loop contains a mix of: | |
| ### **Quasar Layers** | |
| - Use **Latent Memory Module** | |
| - Handle long-range dependencies | |
| - Read/write persistent state | |
| ### **GLA Layers (Gated Linear Attention)** | |
| - Fast, RNN-like recurrence | |
| - Efficient local sequence modeling | |
| --- | |
| ## 3. Persistent Latent Memory | |
| A defining component of Quasar: | |
| - **Memory Slots:** | |
| - Fixed parameter banks (e.g., 128–256 slots) | |
| - **Segment Compression:** | |
| - Tokens grouped into segments (default: 64 tokens) | |
| - Reduced noise during updates | |
| - **Saliency Gating:** | |
| - Learns which information is important | |
| - Writes only high-value signals to memory | |
| --- | |
| ## 4. SMEBU (Stability-Maximized Expert Balancing Unit) | |
| Custom Mixture-of-Experts system: | |
| - **Global Bias Buffers** | |
| - Stored outside optimizer | |
| - Prevent routing collapse | |
| - **Zero-Loop Updates** | |
| - Expert balancing done in vectorized pass | |
| - No recursive instability | |
| - **Sparse Activation** | |
| - ~1B active parameters per forward pass | |
| --- | |
| ## 5. Technical Specifications | |
| - **Normalization:** RMSNorm (Pre-Norm) | |
| - **Positional Encoding:** RoPE (`theta = 1,000,000`) | |
| - **Initialization:** Depth-scaled `1/sqrt(2L)` | |
| - **Architecture Type:** Hybrid Transformer + Memory + MoE | |
| --- | |
| # Architecture Overview | |
| ## Core Data Flow | |
| ``` | |
| Token IDs | |
| ↓ | |
| Embedding Layer | |
| ↓ | |
| Anchor P Snapshot | |
| ↓ | |
| ┌──────────────────────────────────────────────┐ | |
| │ Loop (i < num_loops) │ | |
| │ │ | |
| │ Quasar Block │ | |
| │ ↓ │ | |
| │ GLA Block │ | |
| │ ↓ │ | |
| │ SMEBU MoE │ | |
| │ ↓ │ | |
| │ Inject Anchor P (Residual Conditioning) │ | |
| └──────────────────────────────────────────────┘ | |
| ↓ | |
| Next Loop Iteration (state updated) | |
| Final Loop Output | |
| ↓ | |
| RMSNorm | |
| ↓ | |
| LM Head | |
| ↓ | |
| Logits | |
| ``` | |
| --- | |
| ## Latent Memory Update Path | |
| ``` | |
| Hidden States | |
| ↓ | |
| Layer Normalization (RMSNorm) | |
| ↓ | |
| Segment Compressor | |
| ↓ | |
| Segment Representation (Z) | |
| ↓ | |
| ├──────────────→ Saliency Gate (importance scoring) | |
| │ ↓ | |
| │ Write Signal | |
| │ ↓ | |
| └──────────────→ Memory Write Operation | |
| ↓ | |
| Persistent Memory Bank (M) | |
| ↓ | |
| Updated Memory (M') | |
| ↓ | |
| Memory Read Module | |
| ↓ | |
| Memory-Augmented Hidden State | |
| ↓ | |
| Output | |
| ``` | |
| --- | |
| ## SMEBU MoE Stability Flow | |
| ``` | |
| Router Network | |
| ↓ | |
| Token Routing Scores | |
| ↓ | |
| * Global Bias Buffer (non-trainable stability path) | |
| ↓ | |
| Top-K Expert Selection | |
| ↓ | |
| Selected Experts | |
| ↓ | |
| Expert Output Aggregation | |
| ↓ | |
| Final MoE Output | |
| ↓ | |
| Post-Loop Bias Update (vectorized, stabilized) | |
| ``` | |
| --- | |
| # Intended Use | |
| This model is designed as a **foundation base model** for the Quasar ecosystem and is primarily intended for: | |
| - **Bittensor SN24 miners** participating in distributed training and knowledge distillation | |
| - **Distillation pipelines** transferring capabilities from frontier models (e.g., Qwen, GLM) | |
| - **Research on long-context architectures**, especially beyond traditional positional encoding limits | |
| - **Agentic system development**, where persistent memory and long-horizon reasoning are required | |
| --- | |
| # Next Steps | |
| - Training on **SN24** in the coming days | |
| - Miners distill knowledge into this model | |
| - Then we go for **Run 2 — DroPE training** at **5M tokens** |