File size: 3,497 Bytes
1f8e0c7
 
 
 
 
 
 
 
 
 
 
 
 
34e876e
1f8e0c7
34e876e
 
 
 
 
1f8e0c7
 
 
 
 
 
5a639a5
 
 
 
 
 
1f8e0c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4f4b2e7
 
34e876e
 
 
4f4b2e7
34e876e
4f4b2e7
 
 
 
 
 
 
 
 
 
 
34e876e
4f4b2e7
 
34e876e
 
4f4b2e7
1f8e0c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55edcb5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
tags:
- setfit
- sentence-transformers
- text-classification
- generated_from_setfit_trainer
pipeline_tag: text-classification
library_name: setfit
inference: true
base_model: sentence-transformers/paraphrase-mpnet-base-v2
---


# Onyx Information Content Classification using SetFit with Base sentence-transformers/paraphrase-mpnet-base-v2

The model is for use by the [Onyx Enterprise Search](https://github.com/onyx-dot-app/onyx) system to identify whether a short 
text segment contains information that could be useful by itself to answer a RAG-type question.

It is based on the [SetFit](https://github.com/huggingface/setfit) approach, using [sentence-transformers/paraphrase-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-mpnet-base-v2) as the Sentence Transformer embedding model. 
A trained [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance is used for classification.

The model has been trained using an efficient few-shot learning technique that involves:

1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
2. Training a classification head with features from the fine-tuned Sentence Transformer.

## About Onyx

- **Website:** [Onyx](https://www.onyx.app/)
- **Repository:** [Open Source Gen-AI + Enterprise Search](https://github.com/onyx-dot-app/onyx)


## Model Details

### Core Model Description
- **Model Type:** SetFit
- **Sentence Transformer body:** [sentence-transformers/paraphrase-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-mpnet-base-v2)
- **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
- **Maximum Sequence Length:** 512 tokens
- **Number of Classes:** 2 classes
- **Language:** English


### SetFit Resources

- **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
- **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
- **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)


## Uses

### Use for Inference

The model is for use by the Onyx Enterprise Search system.

To test it locally, first install the SetFit library:

```bash
pip install setfit
```

Then you can load this model and run inference.

```python
from setfit import SetFitModel

# Download from the 🤗 Hub
model = SetFitModel.from_pretrained("onyx-dot-app/information-content-model")
# Run inference
preds = model("Paris is in France")
or:
pred_probability = model.predict_proba("Paris is in France")
```

### Framework Versions
- Python: 3.11.10
- SetFit: 1.1.1
- Sentence Transformers: 3.4.1
- Transformers: 4.49.0
- PyTorch: 2.6.0
- Datasets: 3.3.2
- Tokenizers: 0.21.0

## Citation

### BibTeX (SetFit Approach)
```bibtex
@article{https://doi.org/10.48550/arxiv.2209.11055,
    doi = {10.48550/ARXIV.2209.11055},
    url = {https://arxiv.org/abs/2209.11055},
    author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
    keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
    title = {Efficient Few-Shot Learning Without Prompts},
    publisher = {arXiv},
    year = {2022},
    copyright = {Creative Commons Attribution 4.0 International}
}
```