# SWE-Pruner ONNX (code-pruner) ONNX-converted version of [ayanami-kitasan/code-pruner](https://huggingface.co/ayanami-kitasan/code-pruner) for efficient CPU inference. ## Source - **Original Model**: [ayanami-kitasan/code-pruner](https://huggingface.co/ayanami-kitasan/code-pruner) (safetensors) - **Training Code**: [Ayanami1314/swe-pruner](https://github.com/Ayanami1314/swe-pruner) ## Architecture - **Backbone**: Qwen/Qwen3-Reranker-0.6B (28 layers, hidden=1024) - **Multi-layer Fusion**: Early (layer 7) + Middle (layer 14) + Final (layer 28) → fused_hidden=3072 - **Fusion**: 1-layer MultiheadAttention (8 heads) + LayerNorm - **Compression Head**: CRF-style (LayerNorm → Linear(3072,256) → GELU → Linear(256,2)) - **Output**: `token_scores` — sigmoid scores per token (0-1, higher = keep) ## Files | File | Description | |------|-------------| | `model.onnx` | Quantized ONNX model (uint8, ~607MB) | | `vocab.json` | BPE vocabulary (Qwen3 tokenizer) | | `merges.txt` | BPE merge rules | | `metadata.json` | Model metadata (token IDs, dimensions) | | `crf_params.npz` | CRF transition parameters (optional, for Viterbi decoding) | ## Usage ```python import onnxruntime as ort import numpy as np sess = ort.InferenceSession("model.onnx") input_ids = np.array([[...]], dtype=np.int64) # [1, seq_len] attention_mask = np.array([[...]], dtype=np.int64) # [1, seq_len] scores = sess.run(None, {"input_ids": input_ids, "attention_mask": attention_mask})[0] # scores: [1, seq_len] float32, 0-1 range, higher = keep ``` ## Conversion Details - Exported with PyTorch 2.8 + transformers 4.57 - Opset version: 14 - Dynamic axes: batch and seq_len - Quantized: dynamic uint8 quantization - Causal mask patched for ONNX trace compatibility