File size: 1,872 Bytes
ce5016a 8443c24 ce5016a 8443c24 ce5016a 8443c24 2f654b9 ce5016a 4c6edb8 92cb64a ce5016a e1d2a0f ce5016a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | ---
license: mit
datasets:
- tokyotech-llm/swallow-code-v2
language:
- en
- zh
metrics:
- accuracy
base_model:
- Qwen/Qwen2.5-Coder-1.5B
library_name: transformers
tags:
- Qwen
- HybridArch
- sinkAttention
- MLA
- GQA
---
# PyraCode-1.5B
## 🌟 Model Overview
This is a custom-architected model based on `Qwen2.5-Coder-1.5B`. We introduced a novel **Asymmetric Hybrid Architecture (GQA + MLA)** with **Cross-Layer Shared Latent Gates** and **Attention Sinks**, enabling efficient feature communication and reduced KV-Cache memory footprint.
## 🏗️ Architecture Innovations

Unlike standard Qwen2 models, this `Hybrid-v9` backbone features:
1. **Asymmetric Layers:**
* **L0-L6:** Standard GQA (Grouped-Query Attention) for robust low-level feature extraction.
* **L7 (Shared Hub):** Generates a global latent vector $c_{kv}$ (Rank 320).
* **L8-L27:** Soft MLA (Multi-Head Latent Attention) with SVD-initialized low-rank projections.
2. **Shared Latent Gate:** Deep layers can dynamically access the global latent vector from L7 via a learnable gating mechanism (`warmup_alpha`).
3. **HybridCache & Attention Sinks:** Implements a sliding window (8192) alongside a 64-token attention sink to maintain generation stability at infinite sequence lengths.
## 🚀 Quick Start
**⚠️ IMPORTANT:**
This project is not fully completed yet, and the current weighting is not a very good tradeoff.
If I obtain new training results in the future, I will continue to update them here
If you have decided to test this not-so-perfect weight, please be aware:
Because this model uses a custom architecture, you **MUST** pass `trust_remote_code=True` when loading it.
### Prerequisites
```bash
pip install transformers torch |