| ---
|
| library_name: setfit
|
| license: mit
|
| base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
|
| tags:
|
| - setfit
|
| - onnx
|
| - attention-weights
|
| - context-compression
|
| - intent-classification
|
| - multilingual
|
| pipeline_tag: text-classification
|
| ---
|
|
|
| # SetFit Multilingual OVR Router (ONNX with Attentions)
|
|
|
| This is a State-of-the-Art **SetFit** model exported to **ONNX** format, specifically trained to classify LLM tasks into three semantic categories: **Needle** (Fact Retrieval), **Reasoning** (Logic/Analysis), and **Summary** (General Recap).
|
|
|
| The model is based on [paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) and has been modified to expose **all 12 layers of raw attention weights**.
|
|
|
| ## Key Features
|
|
|
| - **3-Class Classification:** High-precision separation of intents.
|
| - **Multilingual:** Native support for Russian, English, and 50+ other languages.
|
| - **Attention Output:** Every inference returns a full attention matrix `(batch, heads, seq_len, seq_len)` for all 12 layers.
|
| - **Dual Precision:** Both **FP32** (`model.onnx`) and **INT8 Quantized** (`model_quantized.onnx`) versions are available.
|
| - **Optimized for CPU:** Fast ONNX inference via `onnxruntime`.
|
|
|
| ## Classification Map
|
| - **Label 0:** Summary (Chatter, Recaps, TL;DR)
|
| - **Label 1:** Needle (Pinpoint facts, parameters, keys, IPs)
|
| - **Label 2:** Reasoning (Comparison, analysis, code debugging, logical chains)
|
|
|
| ## Project Origin
|
|
|
| This model is a core component of the **[WAMP-proxy](https://github.com/naranor/wamp-proxy)** project, an intelligent middleware for research into LLM context optimization.
|
|
|
| ## Quick Inference (Python)
|
|
|
| ```python
|
| import numpy as np
|
| import onnxruntime as ort
|
| from transformers import AutoTokenizer
|
| import json
|
|
|
| # 1. Load model and weights
|
| session = ort.InferenceSession("model.onnx")
|
| tokenizer = AutoTokenizer.from_pretrained(".")
|
| with open("router_weights_setfit.json", "r") as f:
|
| weights = json.load(f)
|
|
|
| # 2. Prepare Input
|
| text = "What is the database port?"
|
| inputs = tokenizer(text, return_tensors="np")
|
| onnx_inputs = {
|
| "input_ids": inputs["input_ids"].astype(np.int64),
|
| "attention_mask": inputs["attention_mask"].astype(np.int64)
|
| }
|
|
|
| # 3. Run
|
| outputs = session.run(None, onnx_inputs)
|
| embeddings = np.mean(outputs[0], axis=1) # Mean pooling
|
|
|
| # 4. Predict probabilities (LogReg Head)
|
| scores = np.dot(embeddings, np.array(weights["coef"]).T) + weights["intercept"]
|
| probs = np.exp(scores) / np.exp(scores).sum()
|
| print(f"Probabilities: {probs}")
|
| ```
|
|
|
| ## License
|
| MIT License.
|
|
|