| --- |
| license: mit |
| --- |
| library_name: transformers |
| tags: |
| - custom-code |
| - qwen2 |
| - mla |
| - gqa |
| - attention-sinks |
| license: apache-2.0 |
| language: |
| - en |
| - zh |
| --- |
| |
| # PyraCode-1.5B |
| |
| ## ๐ Model Overview |
| This is a custom-architected model based on `Qwen2.5-Coder-1.5B`. We introduced a novel **Asymmetric Hybrid Architecture (GQA + MLA)** with **Cross-Layer Shared Latent Gates** and **Attention Sinks**, enabling efficient feature communication and reduced KV-Cache memory footprint. |
| |
| ## ๐๏ธ Architecture Innovations |
| *(่ฟ้ๆๅ
ฅไฝ ็จ picture.py ็ๆ็ๆถๆๅพ๏ผๅฏไปฅๆๅพ็ๆ่ฟ Hugging Face ็ฝ้กต็็็ผ่พๆก้่ชๅจ็ๆ้พๆฅ)* |
|  |
| |
| Unlike standard Qwen2 models, this `Hybrid-v9` backbone features: |
| 1. **Asymmetric Layers:** |
| * **L0-L6:** Standard GQA (Grouped-Query Attention) for robust low-level feature extraction. |
| * **L7 (Shared Hub):** Generates a global latent vector $c_{kv}$ (Rank 320). |
| * **L8-L27:** Soft MLA (Multi-Head Latent Attention) with SVD-initialized low-rank projections. |
| 2. **Shared Latent Gate:** Deep layers can dynamically access the global latent vector from L7 via a learnable gating mechanism (`warmup_alpha`). |
| 3. **HybridCache & Attention Sinks:** Implements a sliding window (8192) alongside a 64-token attention sink to maintain generation stability at infinite sequence lengths. |
|
|
| ## ๐ Quick Start |
|
|
| **โ ๏ธ IMPORTANT:** Because this model uses a custom architecture, you **MUST** pass `trust_remote_code=True` when loading it. |
|
|
| ### Prerequisites |
| ```bash |
| pip install transformers torch |