| --- |
| license: mit |
| datasets: |
| - tokyotech-llm/swallow-code-v2 |
| language: |
| - en |
| - zh |
| metrics: |
| - accuracy |
| base_model: |
| - Qwen/Qwen2.5-Coder-1.5B |
| library_name: transformers |
| tags: |
| - Qwen |
| - HybridArch |
| - sinkAttention |
| - MLA |
| - GQA |
| --- |
| |
|
|
| # PyraCode-1.5B |
|
|
| ## 🌟 Model Overview |
| This is a custom-architected model based on `Qwen2.5-Coder-1.5B`. We introduced a novel **Asymmetric Hybrid Architecture (GQA + MLA)** with **Cross-Layer Shared Latent Gates** and **Attention Sinks**, enabling efficient feature communication and reduced KV-Cache memory footprint. |
|
|
| ## 🏗️ Architecture Innovations |
|
|
|  |
|
|
| Unlike standard Qwen2 models, this `Hybrid-v9` backbone features: |
| 1. **Asymmetric Layers:** |
| * **L0-L6:** Standard GQA (Grouped-Query Attention) for robust low-level feature extraction. |
| * **L7 (Shared Hub):** Generates a global latent vector $c_{kv}$ (Rank 320). |
| * **L8-L27:** Soft MLA (Multi-Head Latent Attention) with SVD-initialized low-rank projections. |
| 2. **Shared Latent Gate:** Deep layers can dynamically access the global latent vector from L7 via a learnable gating mechanism (`warmup_alpha`). |
| 3. **HybridCache & Attention Sinks:** Implements a sliding window (8192) alongside a 64-token attention sink to maintain generation stability at infinite sequence lengths. |
|
|
| ## 🚀 Quick Start |
|
|
| **⚠️ IMPORTANT:** |
| This project is not fully completed yet, and the current weighting is not a very good tradeoff. |
| If I obtain new training results in the future, I will continue to update them here |
|
|
| If you have decided to test this not-so-perfect weight, please be aware: |
| Because this model uses a custom architecture, you **MUST** pass `trust_remote_code=True` when loading it. |
|
|
| ### Prerequisites |
| ```bash |
| pip install transformers torch |