Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,86 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
- ar
|
| 5 |
+
license: mit
|
| 6 |
+
tags:
|
| 7 |
+
- silx-ai
|
| 8 |
+
- quasar
|
| 9 |
+
- foundation-model
|
| 10 |
+
- 10b
|
| 11 |
+
- long-context
|
| 12 |
+
- llm
|
| 13 |
+
- gla
|
| 14 |
+
- linear-attention
|
| 15 |
+
- 2m-context
|
| 16 |
+
pipeline_tag: text-generation
|
| 17 |
+
library_name: transformers
|
| 18 |
+
base_model: Qwen/Qwen3.5-9B-Base
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
<p align="center">
|
| 22 |
+
<img src="./Quasar.png" alt="Quasar Foundation Model" width="100%">
|
| 23 |
+
</p>
|
| 24 |
+
|
| 25 |
+
# Quasar-10B: Fully Linear Foundation Model
|
| 26 |
+
|
| 27 |
+
Quasar-10B is a high-performance foundation model developed by **SILX AI**. It is built upon the **Qwen3.5-9B-Base** architecture, fundamentally re-engineered to support extreme long-context reasoning (2 Million+ tokens) while maintaining high computational efficiency.
|
| 28 |
+
|
| 29 |
+
This model marks a major shift in the Quasar training stack, moving from traditional Softmax-based attention to a **Hybrid Gated Linear Attention (GLA)** architecture.
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
# Model Overview
|
| 34 |
+
|
| 35 |
+
**Model Name:** Quasar-10B
|
| 36 |
+
**Organization:** SILX AI
|
| 37 |
+
**Base Model:** [Qwen3.5-9B-Base](https://huggingface.co/Qwen/Qwen3.5-9B-Base)
|
| 38 |
+
|
| 39 |
+
### Architecture Evolution
|
| 40 |
+
The original Qwen3.5 architecture uses a combination of Gated Delta Attention and Softmax Gated Attention. To support the Quasar design requirements for infinite scaling and efficient state management, we performed a deep architectural swap:
|
| 41 |
+
* **GLA Integration**: Replaced the target attention layers with **Gated Linear Attention (GLA)**.
|
| 42 |
+
* **NOPE (No Positional Embeddings)**: Removed traditional RoPE (Rotary Positional Embeddings) to eliminate positional bias and enable native extrapolation to millions of tokens.
|
| 43 |
+
> [!NOTE]
|
| 44 |
+
> **GLA** was chosen as the core linear mechanism to maintain exact architectural parity with the **Quasar 22B MoE** design. This model is a direct evolution redirected from [silx-ai/Quasar-V1-Base-Stage1](https://huggingface.co/silx-ai/Quasar-V1-Base-Stage1), utilizing Quasar Continuous Time Attention for state-trajectory optimization.
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
# Training Methodology
|
| 49 |
+
|
| 50 |
+
The development of Quasar-10B followed a rigorous two-stage process:
|
| 51 |
+
|
| 52 |
+
### Stage 1: Structural Distillation (10B Tokens)
|
| 53 |
+
To ensure the new GLA layers correctly inherited the capabilities of the original Qwen heads:
|
| 54 |
+
* **Process**: Layer-wise structural distillation. We initialized the student with Qwen3.5 weights and replaced specific layers with GLA units.
|
| 55 |
+
* **Loss**: Hybrid loss combining MSE (Hidden State Mimicry) and Cross-Entropy (Language Modeling).
|
| 56 |
+
* **Volume**: 10 Billion tokens of high-quality reasoning data.
|
| 57 |
+
* **Goal**: Minimize structural divergence and transferpretrained world knowledge into the new linear state.
|
| 58 |
+
|
| 59 |
+
### Stage 2: Native 2M Context Expansion (20B Tokens)
|
| 60 |
+
Once structurally sound, the model was pushed to extreme sequence lengths:
|
| 61 |
+
* **Positionality**: RoPE was fully removed and replaced with **NOPE** (No Positional Embedding).
|
| 62 |
+
* **Context Length**: Native training on **2,097,152 (2M)** sequence lengths.
|
| 63 |
+
* **Volume**: 20 Billion tokens.
|
| 64 |
+
* **Hardware**: Optimized for B200 HBM efficiency, utilizing sub-chunked sequential processing to maintain a 2M token active state.
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
+
|
| 68 |
+
# Features
|
| 69 |
+
* **Infinite Recurrence**: The GLA architecture allows the model to process sequences far beyond its training window with linear complexity.
|
| 70 |
+
* **Reasoning Excellence**: Trained on the **Nemotron-Pretraining-Specialized-v1** mix, focusing on Math, STEM, and code-centric reasoning.
|
| 71 |
+
* **B200 Optimized**: Specifically tuned for maximum throughput on NVIDIA Blackwell hardware.
|
| 72 |
+
|
| 73 |
+
---
|
| 74 |
+
|
| 75 |
+
# Technical Notes
|
| 76 |
+
Quasar-10B represents the first "Recurrent foundation model" in our stack that successfully bridges the gap between Transformer-scale pretraining and RNN-style linear efficiency. By removing positional embeddings, we allow the model to rely entirely on its internal state trajectories for temporal coherence.
|
| 77 |
+
|
| 78 |
+
---
|
| 79 |
+
|
| 80 |
+
# Next Steps
|
| 81 |
+
The Quasar roadmap continues toward even larger scales and deeper MoE integrations. For technical research and integration support, contact the SILX AI team.
|
| 82 |
+
|
| 83 |
+
---
|
| 84 |
+
|
| 85 |
+
> [!IMPORTANT]
|
| 86 |
+
> **Strategic Purpose**: Quasar-10B is designed as a foundational high-context engine. It will be used exclusively to **distill knowledge and generate synthetic reasoning data** for the upcoming **Quasar 22B MoE**, ensuring that the larger mixture-of-experts model inherits superior long-context coherence and refined logical state-trajectories from this fully linear base.
|