eyad-silx commited on
Commit
644775c
·
verified ·
1 Parent(s): 3877397

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -0
README.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - ar
5
+ license: mit
6
+ tags:
7
+ - silx-ai
8
+ - quasar
9
+ - foundation-model
10
+ - 10b
11
+ - long-context
12
+ - llm
13
+ - gla
14
+ - linear-attention
15
+ - 2m-context
16
+ pipeline_tag: text-generation
17
+ library_name: transformers
18
+ base_model: Qwen/Qwen3.5-9B-Base
19
+ ---
20
+
21
+ <p align="center">
22
+ <img src="./Quasar.png" alt="Quasar Foundation Model" width="100%">
23
+ </p>
24
+
25
+ # Quasar-10B: Fully Linear Foundation Model
26
+
27
+ Quasar-10B is a high-performance foundation model developed by **SILX AI**. It is built upon the **Qwen3.5-9B-Base** architecture, fundamentally re-engineered to support extreme long-context reasoning (2 Million+ tokens) while maintaining high computational efficiency.
28
+
29
+ This model marks a major shift in the Quasar training stack, moving from traditional Softmax-based attention to a **Hybrid Gated Linear Attention (GLA)** architecture.
30
+
31
+ ---
32
+
33
+ # Model Overview
34
+
35
+ **Model Name:** Quasar-10B
36
+ **Organization:** SILX AI
37
+ **Base Model:** [Qwen3.5-9B-Base](https://huggingface.co/Qwen/Qwen3.5-9B-Base)
38
+
39
+ ### Architecture Evolution
40
+ The original Qwen3.5 architecture uses a combination of Gated Delta Attention and Softmax Gated Attention. To support the Quasar design requirements for infinite scaling and efficient state management, we performed a deep architectural swap:
41
+ * **GLA Integration**: Replaced the target attention layers with **Gated Linear Attention (GLA)**.
42
+ * **NOPE (No Positional Embeddings)**: Removed traditional RoPE (Rotary Positional Embeddings) to eliminate positional bias and enable native extrapolation to millions of tokens.
43
+ > [!NOTE]
44
+ > **GLA** was chosen as the core linear mechanism to maintain exact architectural parity with the **Quasar 22B MoE** design. This model is a direct evolution redirected from [silx-ai/Quasar-V1-Base-Stage1](https://huggingface.co/silx-ai/Quasar-V1-Base-Stage1), utilizing Quasar Continuous Time Attention for state-trajectory optimization.
45
+
46
+ ---
47
+
48
+ # Training Methodology
49
+
50
+ The development of Quasar-10B followed a rigorous two-stage process:
51
+
52
+ ### Stage 1: Structural Distillation (10B Tokens)
53
+ To ensure the new GLA layers correctly inherited the capabilities of the original Qwen heads:
54
+ * **Process**: Layer-wise structural distillation. We initialized the student with Qwen3.5 weights and replaced specific layers with GLA units.
55
+ * **Loss**: Hybrid loss combining MSE (Hidden State Mimicry) and Cross-Entropy (Language Modeling).
56
+ * **Volume**: 10 Billion tokens of high-quality reasoning data.
57
+ * **Goal**: Minimize structural divergence and transferpretrained world knowledge into the new linear state.
58
+
59
+ ### Stage 2: Native 2M Context Expansion (20B Tokens)
60
+ Once structurally sound, the model was pushed to extreme sequence lengths:
61
+ * **Positionality**: RoPE was fully removed and replaced with **NOPE** (No Positional Embedding).
62
+ * **Context Length**: Native training on **2,097,152 (2M)** sequence lengths.
63
+ * **Volume**: 20 Billion tokens.
64
+ * **Hardware**: Optimized for B200 HBM efficiency, utilizing sub-chunked sequential processing to maintain a 2M token active state.
65
+
66
+ ---
67
+
68
+ # Features
69
+ * **Infinite Recurrence**: The GLA architecture allows the model to process sequences far beyond its training window with linear complexity.
70
+ * **Reasoning Excellence**: Trained on the **Nemotron-Pretraining-Specialized-v1** mix, focusing on Math, STEM, and code-centric reasoning.
71
+ * **B200 Optimized**: Specifically tuned for maximum throughput on NVIDIA Blackwell hardware.
72
+
73
+ ---
74
+
75
+ # Technical Notes
76
+ Quasar-10B represents the first "Recurrent foundation model" in our stack that successfully bridges the gap between Transformer-scale pretraining and RNN-style linear efficiency. By removing positional embeddings, we allow the model to rely entirely on its internal state trajectories for temporal coherence.
77
+
78
+ ---
79
+
80
+ # Next Steps
81
+ The Quasar roadmap continues toward even larger scales and deeper MoE integrations. For technical research and integration support, contact the SILX AI team.
82
+
83
+ ---
84
+
85
+ > [!IMPORTANT]
86
+ > **Strategic Purpose**: Quasar-10B is designed as a foundational high-context engine. It will be used exclusively to **distill knowledge and generate synthetic reasoning data** for the upcoming **Quasar 22B MoE**, ensuring that the larger mixture-of-experts model inherits superior long-context coherence and refined logical state-trajectories from this fully linear base.