--- library_name: setfit license: mit base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 tags: - setfit - onnx - attention-weights - context-compression - intent-classification - multilingual pipeline_tag: text-classification --- # SetFit Multilingual OVR Router (ONNX with Attentions) This is a State-of-the-Art **SetFit** model exported to **ONNX** format, specifically trained to classify LLM tasks into three semantic categories: **Needle** (Fact Retrieval), **Reasoning** (Logic/Analysis), and **Summary** (General Recap). The model is based on [paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) and has been modified to expose **all 12 layers of raw attention weights**. ## Key Features - **3-Class Classification:** High-precision separation of intents. - **Multilingual:** Native support for Russian, English, and 50+ other languages. - **Attention Output:** Every inference returns a full attention matrix `(batch, heads, seq_len, seq_len)` for all 12 layers. - **Dual Precision:** Both **FP32** (`model.onnx`) and **INT8 Quantized** (`model_quantized.onnx`) versions are available. - **Optimized for CPU:** Fast ONNX inference via `onnxruntime`. ## Classification Map - **Label 0:** Summary (Chatter, Recaps, TL;DR) - **Label 1:** Needle (Pinpoint facts, parameters, keys, IPs) - **Label 2:** Reasoning (Comparison, analysis, code debugging, logical chains) ## Project Origin This model is a core component of the **[WAMP-proxy](https://github.com/naranor/wamp-proxy)** project, an intelligent middleware for research into LLM context optimization. ## Quick Inference (Python) ```python import numpy as np import onnxruntime as ort from transformers import AutoTokenizer import json # 1. Load model and weights session = ort.InferenceSession("model.onnx") tokenizer = AutoTokenizer.from_pretrained(".") with open("router_weights_setfit.json", "r") as f: weights = json.load(f) # 2. Prepare Input text = "What is the database port?" inputs = tokenizer(text, return_tensors="np") onnx_inputs = { "input_ids": inputs["input_ids"].astype(np.int64), "attention_mask": inputs["attention_mask"].astype(np.int64) } # 3. Run outputs = session.run(None, onnx_inputs) embeddings = np.mean(outputs[0], axis=1) # Mean pooling # 4. Predict probabilities (LogReg Head) scores = np.dot(embeddings, np.array(weights["coef"]).T) + weights["intercept"] probs = np.exp(scores) / np.exp(scores).sum() print(f"Probabilities: {probs}") ``` ## License MIT License.