File size: 1,586 Bytes
ce5016a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2f654b9
ce5016a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
---
license: mit
---
library_name: transformers
tags:
- custom-code
- qwen2
- mla
- gqa
- attention-sinks
license: apache-2.0
language:
- en
- zh
---

# PyraCode-1.5B

## ๐ŸŒŸ Model Overview
This is a custom-architected model based on `Qwen2.5-Coder-1.5B`. We introduced a novel **Asymmetric Hybrid Architecture (GQA + MLA)** with **Cross-Layer Shared Latent Gates** and **Attention Sinks**, enabling efficient feature communication and reduced KV-Cache memory footprint.

## ๐Ÿ—๏ธ Architecture Innovations
*(่ฟ™้‡Œๆ’ๅ…ฅไฝ ็”จ picture.py ็”Ÿๆˆ็š„ๆžถๆž„ๅ›พ๏ผŒๅฏไปฅๆŠŠๅ›พ็‰‡ๆ‹–่ฟ› Hugging Face ็ฝ‘้กต็‰ˆ็š„็ผ–่พ‘ๆก†้‡Œ่‡ชๅŠจ็”Ÿๆˆ้“พๆŽฅ)*
![Hybrid Architecture](ๅกซๅ…ฅไฝ ็š„ๅ›พ็‰‡้“พๆŽฅ)

Unlike standard Qwen2 models, this `Hybrid-v9` backbone features:
1. **Asymmetric Layers:** 
   * **L0-L6:** Standard GQA (Grouped-Query Attention) for robust low-level feature extraction.
   * **L7 (Shared Hub):** Generates a global latent vector $c_{kv}$ (Rank 320).
   * **L8-L27:** Soft MLA (Multi-Head Latent Attention) with SVD-initialized low-rank projections.
2. **Shared Latent Gate:** Deep layers can dynamically access the global latent vector from L7 via a learnable gating mechanism (`warmup_alpha`).
3. **HybridCache & Attention Sinks:** Implements a sliding window (8192) alongside a 64-token attention sink to maintain generation stability at infinite sequence lengths.

## ๐Ÿš€ Quick Start

**โš ๏ธ IMPORTANT:** Because this model uses a custom architecture, you **MUST** pass `trust_remote_code=True` when loading it.

### Prerequisites
```bash
pip install transformers torch