File size: 1,763 Bytes
ee4090c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | # SWE-Pruner ONNX (code-pruner)
ONNX-converted version of [ayanami-kitasan/code-pruner](https://huggingface.co/ayanami-kitasan/code-pruner) for efficient CPU inference.
## Source
- **Original Model**: [ayanami-kitasan/code-pruner](https://huggingface.co/ayanami-kitasan/code-pruner) (safetensors)
- **Training Code**: [Ayanami1314/swe-pruner](https://github.com/Ayanami1314/swe-pruner)
## Architecture
- **Backbone**: Qwen/Qwen3-Reranker-0.6B (28 layers, hidden=1024)
- **Multi-layer Fusion**: Early (layer 7) + Middle (layer 14) + Final (layer 28) → fused_hidden=3072
- **Fusion**: 1-layer MultiheadAttention (8 heads) + LayerNorm
- **Compression Head**: CRF-style (LayerNorm → Linear(3072,256) → GELU → Linear(256,2))
- **Output**: `token_scores` — sigmoid scores per token (0-1, higher = keep)
## Files
| File | Description |
|------|-------------|
| `model.onnx` | Quantized ONNX model (uint8, ~607MB) |
| `vocab.json` | BPE vocabulary (Qwen3 tokenizer) |
| `merges.txt` | BPE merge rules |
| `metadata.json` | Model metadata (token IDs, dimensions) |
| `crf_params.npz` | CRF transition parameters (optional, for Viterbi decoding) |
## Usage
```python
import onnxruntime as ort
import numpy as np
sess = ort.InferenceSession("model.onnx")
input_ids = np.array([[...]], dtype=np.int64) # [1, seq_len]
attention_mask = np.array([[...]], dtype=np.int64) # [1, seq_len]
scores = sess.run(None, {"input_ids": input_ids, "attention_mask": attention_mask})[0]
# scores: [1, seq_len] float32, 0-1 range, higher = keep
```
## Conversion Details
- Exported with PyTorch 2.8 + transformers 4.57
- Opset version: 14
- Dynamic axes: batch and seq_len
- Quantized: dynamic uint8 quantization
- Causal mask patched for ONNX trace compatibility
|