File size: 2,643 Bytes
d8482d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---

library_name: setfit
license: mit
base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
tags:
- setfit
- onnx
- attention-weights
- context-compression
- intent-classification
- multilingual
pipeline_tag: text-classification
---


# SetFit Multilingual OVR Router (ONNX with Attentions)

This is a State-of-the-Art **SetFit** model exported to **ONNX** format, specifically trained to classify LLM tasks into three semantic categories: **Needle** (Fact Retrieval), **Reasoning** (Logic/Analysis), and **Summary** (General Recap).

The model is based on [paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) and has been modified to expose **all 12 layers of raw attention weights**.

## Key Features

- **3-Class Classification:** High-precision separation of intents.
- **Multilingual:** Native support for Russian, English, and 50+ other languages.
- **Attention Output:** Every inference returns a full attention matrix `(batch, heads, seq_len, seq_len)` for all 12 layers.
- **Dual Precision:** Both **FP32** (`model.onnx`) and **INT8 Quantized** (`model_quantized.onnx`) versions are available.
- **Optimized for CPU:** Fast ONNX inference via `onnxruntime`.

## Classification Map
- **Label 0:** Summary (Chatter, Recaps, TL;DR)
- **Label 1:** Needle (Pinpoint facts, parameters, keys, IPs)
- **Label 2:** Reasoning (Comparison, analysis, code debugging, logical chains)

## Project Origin

This model is a core component of the **[WAMP-proxy](https://github.com/naranor/wamp-proxy)** project, an intelligent middleware for research into LLM context optimization.

## Quick Inference (Python)

```python

import numpy as np

import onnxruntime as ort

from transformers import AutoTokenizer

import json



# 1. Load model and weights

session = ort.InferenceSession("model.onnx")

tokenizer = AutoTokenizer.from_pretrained(".")

with open("router_weights_setfit.json", "r") as f:

    weights = json.load(f)



# 2. Prepare Input

text = "What is the database port?"

inputs = tokenizer(text, return_tensors="np")

onnx_inputs = {

    "input_ids": inputs["input_ids"].astype(np.int64),

    "attention_mask": inputs["attention_mask"].astype(np.int64)

}



# 3. Run

outputs = session.run(None, onnx_inputs)

embeddings = np.mean(outputs[0], axis=1) # Mean pooling



# 4. Predict probabilities (LogReg Head)

scores = np.dot(embeddings, np.array(weights["coef"]).T) + weights["intercept"]

probs = np.exp(scores) / np.exp(scores).sum()

print(f"Probabilities: {probs}")

```

## License
MIT License.