Create README.md

Browse files

Files changed (1) hide show

README.md +86 -0

README.md ADDED Viewed

	@@ -0,0 +1,86 @@

+---
+language:
+- en
+- ar
+license: mit
+tags:
+- silx-ai
+- quasar
+- foundation-model
+- 10b
+- long-context
+- llm
+- gla
+- linear-attention
+- 2m-context
+pipeline_tag: text-generation
+library_name: transformers
+base_model: Qwen/Qwen3.5-9B-Base
+---
+<p align="center">
+  <img src="./Quasar.png" alt="Quasar Foundation Model" width="100%">
+</p>
+# Quasar-10B: Fully Linear Foundation Model
+Quasar-10B is a high-performance foundation model developed by **SILX AI**. It is built upon the **Qwen3.5-9B-Base** architecture, fundamentally re-engineered to support extreme long-context reasoning (2 Million+ tokens) while maintaining high computational efficiency.
+This model marks a major shift in the Quasar training stack, moving from traditional Softmax-based attention to a **Hybrid Gated Linear Attention (GLA)** architecture.
+---
+# Model Overview
+**Model Name:** Quasar-10B
+**Organization:** SILX AI
+**Base Model:** [Qwen3.5-9B-Base](https://huggingface.co/Qwen/Qwen3.5-9B-Base)
+### Architecture Evolution
+The original Qwen3.5 architecture uses a combination of Gated Delta Attention and Softmax Gated Attention. To support the Quasar design requirements for infinite scaling and efficient state management, we performed a deep architectural swap:
+* **GLA Integration**: Replaced the target attention layers with **Gated Linear Attention (GLA)**.
+* **NOPE (No Positional Embeddings)**: Removed traditional RoPE (Rotary Positional Embeddings) to eliminate positional bias and enable native extrapolation to millions of tokens.
+  > [!NOTE]
+  > **GLA** was chosen as the core linear mechanism to maintain exact architectural parity with the **Quasar 22B MoE** design. This model is a direct evolution redirected from [silx-ai/Quasar-V1-Base-Stage1](https://huggingface.co/silx-ai/Quasar-V1-Base-Stage1), utilizing Quasar Continuous Time Attention for state-trajectory optimization.
+---
+# Training Methodology
+The development of Quasar-10B followed a rigorous two-stage process:
+### Stage 1: Structural Distillation (10B Tokens)
+To ensure the new GLA layers correctly inherited the capabilities of the original Qwen heads:
+* **Process**: Layer-wise structural distillation. We initialized the student with Qwen3.5 weights and replaced specific layers with GLA units.
+* **Loss**: Hybrid loss combining MSE (Hidden State Mimicry) and Cross-Entropy (Language Modeling).
+* **Volume**: 10 Billion tokens of high-quality reasoning data.
+* **Goal**: Minimize structural divergence and transferpretrained world knowledge into the new linear state.
+### Stage 2: Native 2M Context Expansion (20B Tokens)
+Once structurally sound, the model was pushed to extreme sequence lengths:
+* **Positionality**: RoPE was fully removed and replaced with **NOPE** (No Positional Embedding).
+* **Context Length**: Native training on **2,097,152 (2M)** sequence lengths.
+* **Volume**: 20 Billion tokens.
+* **Hardware**: Optimized for B200 HBM efficiency, utilizing sub-chunked sequential processing to maintain a 2M token active state.
+---
+# Features
+* **Infinite Recurrence**: The GLA architecture allows the model to process sequences far beyond its training window with linear complexity.
+* **Reasoning Excellence**: Trained on the **Nemotron-Pretraining-Specialized-v1** mix, focusing on Math, STEM, and code-centric reasoning.
+* **B200 Optimized**: Specifically tuned for maximum throughput on NVIDIA Blackwell hardware.
+---
+# Technical Notes
+Quasar-10B represents the first "Recurrent foundation model" in our stack that successfully bridges the gap between Transformer-scale pretraining and RNN-style linear efficiency. By removing positional embeddings, we allow the model to rely entirely on its internal state trajectories for temporal coherence.
+---
+# Next Steps
+The Quasar roadmap continues toward even larger scales and deeper MoE integrations. For technical research and integration support, contact the SILX AI team.
+---
+> [!IMPORTANT]
+> **Strategic Purpose**: Quasar-10B is designed as a foundational high-context engine. It will be used exclusively to **distill knowledge and generate synthetic reasoning data** for the upcoming **Quasar 22B MoE**, ensuring that the larger mixture-of-experts model inherits superior long-context coherence and refined logical state-trajectories from this fully linear base.