Instructions to use protobuga/sn24-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use protobuga/sn24-v1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="protobuga/sn24-v1", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("protobuga/sn24-v1", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use protobuga/sn24-v1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "protobuga/sn24-v1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "protobuga/sn24-v1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/protobuga/sn24-v1

SGLang

How to use protobuga/sn24-v1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "protobuga/sn24-v1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "protobuga/sn24-v1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "protobuga/sn24-v1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "protobuga/sn24-v1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use protobuga/sn24-v1 with Docker Model Runner:
```
docker model run hf.co/protobuga/sn24-v1
```

sn24-v1 / README.md

protobuga

Initial upload — track_b_v3/step_300 perturbed (seed 44); validator-aligned corpus

87423e3 verified 1 day ago

preview code

raw

history blame contribute delete

7.06 kB

	---
	language:
	- en
	- ar
	license: mit
	tags:
	- silx-ai
	- quasar
	- foundation-model
	- 3b
	- moe
	- long-context
	- bittensor
	- sn24
	- distillation
	- hybrid-transformer
	pipeline_tag: text-generation
	library_name: transformers
	---

	<p align="center">
	<img src="./Quasar.png" alt="Quasar Foundation Model" width="100%">
	</p>

	# Quasar Foundation Models (RoPE Base)

	Quasar Foundation Models are SILX AI’s core models designed for long-context reasoning, agentic systems, and persistent memory-based intelligence.

	This release is NOT a state-of-the-art final model.
	It is a base pretraining model designed specifically for distributed knowledge distillation on Bittensor (SN24 Quasar subnet).

	The goal is to create a shared architecture where miners continuously distill knowledge from frontier models (e.g., Qwen, GLM) into Quasar.

	---

	## ⚠️ Important Note

	This model is:

	- A base model
	- Pretrained for only a few billion tokens
	- Designed for distillation and scaling, not benchmarking

	Performance will improve through iterative subnet training + distillation cycles.

	---

	## Model Overview

	- Model Name: Quasar 3B (RoPE Base)
	- Organization: SILX AI
	- Architecture: Quasar-RoPE Hybrid Transformer
	- Total Parameters: 3B
	- Active Parameters: ~1B (Mixture-of-Experts)
	- Training Stage: Stage 1 (Base Pretraining)
	- Sequence Length: 16K tokens (RoPE phase)

	---

	## Training Strategy

	Quasar follows a multi-stage training pipeline:

	### Stage 1 — RoPE Pretraining
	- Train using Rotary Positional Embeddings (RoPE)
	- Context length: 16K tokens
	- Objective: stabilize training and build core reasoning

	### Stage 2 — Distillation (SN24)
	- Distributed training on Bittensor subnet (SN24)
	- Miners distill knowledge from:
	- Qwen
	- GLM
	- Target: transfer reasoning + capabilities into Quasar

	### Stage 3 — DroPE Long-Context Training
	- Remove positional embeddings entirely (DroPE phase)
	- Transition to position-free reasoning
	- Train on ultra-long context (up to 5M tokens)

	This staged approach allows:
	- Stable early training
	- Efficient knowledge transfer
	- Extreme context scaling without positional bottlenecks

	---

	# Quasar-RoPE Hybrid Architecture

	Quasar is a high-throughput hybrid transformer designed for trillion-token scale training.

	It combines:
	- Looped computation
	- Persistent latent memory
	- Hybrid attention mechanisms
	- Stable Mixture-of-Experts routing

	---

	## 1. Looped Transformer Logic

	Instead of increasing depth traditionally, Quasar uses looped execution:

	- A fixed set of layers is reused multiple times (`num_loops`)
	- This multiplies effective depth without increasing VRAM

	### Key Mechanism:

	- Anchor P (Input Injection):
	- Embedding output is stored as `P`
	- Injected into the hidden state at every loop
	- Gradient Stabilization:
	- Injection gradients scaled by `1 / num_loops`
	- Prevents instability during recirculation

	---

	## 2. Hybrid Layer Composition

	Each loop contains a mix of:

	### Quasar Layers
	- Use Latent Memory Module
	- Handle long-range dependencies
	- Read/write persistent state

	### GLA Layers (Gated Linear Attention)
	- Fast, RNN-like recurrence
	- Efficient local sequence modeling

	---

	## 3. Persistent Latent Memory

	A defining component of Quasar:

	- Memory Slots:
	- Fixed parameter banks (e.g., 128–256 slots)

	- Segment Compression:
	- Tokens grouped into segments (default: 64 tokens)
	- Reduced noise during updates

	- Saliency Gating:
	- Learns which information is important
	- Writes only high-value signals to memory

	---

	## 4. SMEBU (Stability-Maximized Expert Balancing Unit)

	Custom Mixture-of-Experts system:

	- Global Bias Buffers
	- Stored outside optimizer
	- Prevent routing collapse

	- Zero-Loop Updates
	- Expert balancing done in vectorized pass
	- No recursive instability

	- Sparse Activation
	- ~1B active parameters per forward pass

	---

	## 5. Technical Specifications

	- Normalization: RMSNorm (Pre-Norm)
	- Positional Encoding: RoPE (`theta = 1,000,000`)
	- Initialization: Depth-scaled `1/sqrt(2L)`
	- Architecture Type: Hybrid Transformer + Memory + MoE

	---
	# Architecture Overview

	## Core Data Flow

	```
	Token IDs
	↓
	Embedding Layer
	↓
	Anchor P Snapshot
	↓
	┌──────────────────────────────────────────────┐
	│ Loop (i < num_loops) │
	│ │
	│ Quasar Block │
	│ ↓ │
	│ GLA Block │
	│ ↓ │
	│ SMEBU MoE │
	│ ↓ │
	│ Inject Anchor P (Residual Conditioning) │
	└──────────────────────────────────────────────┘
	↓
	Next Loop Iteration (state updated)

	Final Loop Output
	↓
	RMSNorm
	↓
	LM Head
	↓
	Logits
	```

	---

	## Latent Memory Update Path

	```
	Hidden States
	↓
	Layer Normalization (RMSNorm)
	↓
	Segment Compressor
	↓
	Segment Representation (Z)
	↓
	├──────────────→ Saliency Gate (importance scoring)
	│ ↓
	│ Write Signal
	│ ↓
	└──────────────→ Memory Write Operation
	↓
	Persistent Memory Bank (M)
	↓
	Updated Memory (M')
	↓
	Memory Read Module
	↓
	Memory-Augmented Hidden State
	↓
	Output
	```

	---

	## SMEBU MoE Stability Flow

	```
	Router Network
	↓
	Token Routing Scores
	↓
	* Global Bias Buffer (non-trainable stability path)
	↓
	Top-K Expert Selection
	↓
	Selected Experts
	↓
	Expert Output Aggregation
	↓
	Final MoE Output
	↓
	Post-Loop Bias Update (vectorized, stabilized)
	```

	---

	# Intended Use

	This model is designed as a foundation base model for the Quasar ecosystem and is primarily intended for:

	- Bittensor SN24 miners participating in distributed training and knowledge distillation
	- Distillation pipelines transferring capabilities from frontier models (e.g., Qwen, GLM)
	- Research on long-context architectures, especially beyond traditional positional encoding limits
	- Agentic system development, where persistent memory and long-horizon reasoning are required

	---

	# Next Steps

	- Training on SN24 in the coming days
	- Miners distill knowledge into this model
	- Then we go for Run 2 — DroPE training at 5M tokens