Commit
·
33ce350
1
Parent(s):
ee3c88c
Update README.md
Browse files
README.md
CHANGED
|
@@ -7,7 +7,9 @@ license: mit
|
|
| 7 |
arXiv link: https://arxiv.org/abs/2203.06875v2
|
| 8 |
Published in [**EMNLP 2022**](https://2022.emnlp.org/)
|
| 9 |
|
| 10 |
-
Our code is modified based on [SimCSE](https://github.com/princeton-nlp/SimCSE) and [P-tuning v2](https://github.com/THUDM/P-tuning-v2/). Here we would like to sincerely thank them for their excellent works.
|
|
|
|
|
|
|
| 11 |
|
| 12 |
We have released our supervised and unsupervised models on huggingface, which acquire **Top 1** results on 1 domain-shifted STS task and 4 standard STS tasks:
|
| 13 |
|
|
@@ -27,31 +29,73 @@ We have released our supervised and unsupervised models on huggingface, which ac
|
|
| 27 |
|
| 28 |
<!-- <img src="https://github.com/YJiangcm/DCPCSE/blob/master/figure/leaderboard.png" width="700" height="380"> -->
|
| 29 |
|
|
|
|
| 30 |
| Model | STS12 | STS13 | STS14 | STS15 | STS16 | STS-B | SICK-R | Avg. |
|
| 31 |
|:-----------------------:|:-----:|:----------:|:---------:|:-----:|:-----:|:-----:|:-----:|:-----:|
|
| 32 |
-
| unsup-
|
| 33 |
-
| sup-
|
| 34 |
-
| sup-
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
|
| 39 |
|
| 40 |
-
##
|
|
|
|
| 41 |
|
| 42 |
-
|
| 43 |
-
[](https://pytorch.org/get-started/previous-versions/)
|
| 44 |
|
| 45 |
-
|
|
|
|
|
|
|
| 46 |
|
|
|
|
| 47 |
```bash
|
| 48 |
-
pip install
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
```
|
| 50 |
|
|
|
|
|
|
|
| 51 |
## Train PromCSE
|
| 52 |
|
| 53 |
In the following section, we describe how to train a PromCSE model by using our code.
|
| 54 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
### Evaluation
|
| 57 |
[](https://colab.research.google.com/drive/1lanXViJzbmGM1bwm8AflNUKmrvDidg_3?usp=sharing)
|
|
@@ -180,84 +224,6 @@ All our experiments are conducted on Nvidia 3090 GPUs.
|
|
| 180 |
| Valid steps | 125 | 125 | 125 | 125 |
|
| 181 |
|
| 182 |
|
| 183 |
-
## Usage
|
| 184 |
-
We provide [tool.py](https://github.com/YJiangcm/PromCSE/blob/master/tool.py) which contains the following functions:
|
| 185 |
-
|
| 186 |
-
**(1) encode sentences into embedding vectors;
|
| 187 |
-
(2) compute cosine simiarities between sentences;
|
| 188 |
-
(3) given queries, retrieval top-k semantically similar sentences for each query.**
|
| 189 |
-
|
| 190 |
-
You can have a try by runing
|
| 191 |
-
```bash
|
| 192 |
-
python tool.py \
|
| 193 |
-
--model_name_or_path YuxinJiang/unsup-promcse-bert-base-uncased \
|
| 194 |
-
--pooler_type cls_before_pooler \
|
| 195 |
-
--pre_seq_len 16
|
| 196 |
-
```
|
| 197 |
-
|
| 198 |
-
which is expected to output the following results.
|
| 199 |
-
```
|
| 200 |
-
=========Calculate cosine similarities between queries and sentences============
|
| 201 |
-
|
| 202 |
-
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.18it/s]100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 42.26it/s][[0.5904227 0.70516586 0.65185255 0.82756 0.6969594 0.85966974
|
| 203 |
-
0.58715546 0.8467339 0.6583321 0.6792214 ]
|
| 204 |
-
[0.6125869 0.73508096 0.61479807 0.6182762 0.6161849 0.59476817
|
| 205 |
-
0.595963 0.61386335 0.694822 0.938746 ]]
|
| 206 |
-
|
| 207 |
-
=========Naive brute force search============
|
| 208 |
-
|
| 209 |
-
2022-10-09 11:59:06,004 : Encoding embeddings for sentences...
|
| 210 |
-
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 46.03it/s]2022-10-09 11:59:06,029 : Building index...
|
| 211 |
-
2022-10-09 11:59:06,029 : Finished
|
| 212 |
-
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 95.40it/s]100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.25it/s]Retrieval results for query: A man is playing music.
|
| 213 |
-
A man plays the piano. (cosine similarity: 0.8597)
|
| 214 |
-
A man plays a guitar. (cosine similarity: 0.8467)
|
| 215 |
-
A man plays the violin. (cosine similarity: 0.8276)
|
| 216 |
-
A woman is reading. (cosine similarity: 0.7051)
|
| 217 |
-
A man is eating food. (cosine similarity: 0.6969)
|
| 218 |
-
A woman is taking a picture. (cosine similarity: 0.6792)
|
| 219 |
-
A woman is slicing a meat. (cosine similarity: 0.6583)
|
| 220 |
-
A man is lifting weights in a garage. (cosine similarity: 0.6518)
|
| 221 |
-
|
| 222 |
-
Retrieval results for query: A woman is making a photo.
|
| 223 |
-
A woman is taking a picture. (cosine similarity: 0.9387)
|
| 224 |
-
A woman is reading. (cosine similarity: 0.7351)
|
| 225 |
-
A woman is slicing a meat. (cosine similarity: 0.6948)
|
| 226 |
-
A man plays the violin. (cosine similarity: 0.6183)
|
| 227 |
-
A man is eating food. (cosine similarity: 0.6162)
|
| 228 |
-
A man is lifting weights in a garage. (cosine similarity: 0.6148)
|
| 229 |
-
A man plays a guitar. (cosine similarity: 0.6139)
|
| 230 |
-
An animal is biting a persons finger. (cosine similarity: 0.6126)
|
| 231 |
-
|
| 232 |
-
|
| 233 |
-
=========Search with Faiss backend============
|
| 234 |
-
|
| 235 |
-
2022-10-09 11:59:06,055 : Loading faiss with AVX2 support.
|
| 236 |
-
2022-10-09 11:59:06,092 : Successfully loaded faiss with AVX2 support.
|
| 237 |
-
2022-10-09 11:59:06,093 : Encoding embeddings for sentences...
|
| 238 |
-
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4.17it/s]2022-10-09 11:59:06,335 : Building index...
|
| 239 |
-
2022-10-09 11:59:06,335 : Use GPU-version faiss
|
| 240 |
-
2022-10-09 11:59:06,447 : Finished
|
| 241 |
-
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 101.44it/s]Retrieval results for query: A man is playing music.
|
| 242 |
-
A man plays the piano. (cosine similarity: 0.8597)
|
| 243 |
-
A man plays a guitar. (cosine similarity: 0.8467)
|
| 244 |
-
A man plays the violin. (cosine similarity: 0.8276)
|
| 245 |
-
A woman is reading. (cosine similarity: 0.7052)
|
| 246 |
-
A man is eating food. (cosine similarity: 0.6970)
|
| 247 |
-
A woman is taking a picture. (cosine similarity: 0.6792)
|
| 248 |
-
A woman is slicing a meat. (cosine similarity: 0.6583)
|
| 249 |
-
A man is lifting weights in a garage. (cosine similarity: 0.6519)
|
| 250 |
-
|
| 251 |
-
Retrieval results for query: A woman is making a photo.
|
| 252 |
-
A woman is taking a picture. (cosine similarity: 0.9387)
|
| 253 |
-
A woman is reading. (cosine similarity: 0.7351)
|
| 254 |
-
A woman is slicing a meat. (cosine similarity: 0.6948)
|
| 255 |
-
A man plays the violin. (cosine similarity: 0.6183)
|
| 256 |
-
A man is eating food. (cosine similarity: 0.6162)
|
| 257 |
-
A man is lifting weights in a garage. (cosine similarity: 0.6148)
|
| 258 |
-
A man plays a guitar. (cosine similarity: 0.6139)
|
| 259 |
-
An animal is biting a persons finger. (cosine similarity: 0.6126)
|
| 260 |
-
```
|
| 261 |
|
| 262 |
|
| 263 |
## Citation
|
|
|
|
| 7 |
arXiv link: https://arxiv.org/abs/2203.06875v2
|
| 8 |
Published in [**EMNLP 2022**](https://2022.emnlp.org/)
|
| 9 |
|
| 10 |
+
Our code is modified based on [SimCSE](https://github.com/princeton-nlp/SimCSE) and [P-tuning v2](https://github.com/THUDM/P-tuning-v2/). Here we would like to sincerely thank them for their excellent works.
|
| 11 |
+
|
| 12 |
+
## Model List
|
| 13 |
|
| 14 |
We have released our supervised and unsupervised models on huggingface, which acquire **Top 1** results on 1 domain-shifted STS task and 4 standard STS tasks:
|
| 15 |
|
|
|
|
| 29 |
|
| 30 |
<!-- <img src="https://github.com/YJiangcm/DCPCSE/blob/master/figure/leaderboard.png" width="700" height="380"> -->
|
| 31 |
|
| 32 |
+
|
| 33 |
| Model | STS12 | STS13 | STS14 | STS15 | STS16 | STS-B | SICK-R | Avg. |
|
| 34 |
|:-----------------------:|:-----:|:----------:|:---------:|:-----:|:-----:|:-----:|:-----:|:-----:|
|
| 35 |
+
| [YuxinJiang/unsup-promcse-bert-base-uncased](https://huggingface.co/YuxinJiang/unsup-promcse-bert-base-uncased) | 73.03 |85.18| 76.70| 84.19 |79.69| 80.62| 70.00| 78.49|
|
| 36 |
+
| [YuxinJiang/sup-promcse-roberta-base](https://huggingface.co/YuxinJiang/sup-promcse-roberta-base) | 76.75 |85.86| 80.98| 86.51 |83.51| 86.58| 80.41| 82.94|
|
| 37 |
+
| [YuxinJiang/sup-promcse-roberta-large](https://huggingface.co/YuxinJiang/sup-promcse-roberta-large) | 79.14 |88.64| 83.73| 87.33 |84.57| 87.84| 82.07| 84.76|
|
| 38 |
|
| 39 |
+
**Naming rules**: `unsup` and `sup` represent "unsupervised" (trained on Wikipedia corpus) and "supervised" (trained on NLI datasets) respectively.
|
| 40 |
|
| 41 |
|
| 42 |
|
| 43 |
+
## Usage
|
| 44 |
+
[](https://colab.research.google.com/drive/1lanXViJzbmGM1bwm8AflNUKmrvDidg_3?usp=sharing)
|
| 45 |
|
| 46 |
+
We provide an easy-to-use python package `promcse` which contains the following functions:
|
|
|
|
| 47 |
|
| 48 |
+
**(1) encode sentences into embedding vectors;
|
| 49 |
+
(2) compute cosine simiarities between sentences;
|
| 50 |
+
(3) given queries, retrieval top-k semantically similar sentences for each query.**
|
| 51 |
|
| 52 |
+
To use the tool, first install the `promcse` package from [PyPI](https://pypi.org/project/promcse/)
|
| 53 |
```bash
|
| 54 |
+
pip install promcse
|
| 55 |
+
```
|
| 56 |
+
After installing the package, you can load our model by two lines of code
|
| 57 |
+
```python
|
| 58 |
+
from promcse import PromCSE
|
| 59 |
+
model = PromCSE("YuxinJiang/unsup-promcse-bert-base-uncased", "cls_before_pooler", 16)
|
| 60 |
+
# model = PromCSE("YuxinJiang/sup-promcse-roberta-base")
|
| 61 |
+
# model = PromCSE("YuxinJiang/sup-promcse-roberta-large")
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
Then you can use our model for **encoding sentences into embeddings**
|
| 65 |
+
```python
|
| 66 |
+
embeddings = model.encode("A woman is reading.")
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
**Compute the cosine similarities** between two groups of sentences
|
| 70 |
+
```python
|
| 71 |
+
sentences_a = ['A woman is reading.', 'A man is playing a guitar.']
|
| 72 |
+
sentences_b = ['He plays guitar.', 'A woman is making a photo.']
|
| 73 |
+
similarities = model.similarity(sentences_a, sentences_b)
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
Or build index for a group of sentences and **search** among them
|
| 77 |
+
```python
|
| 78 |
+
sentences = ['A woman is reading.', 'A man is playing a guitar.']
|
| 79 |
+
model.build_index(sentences)
|
| 80 |
+
results = model.search("He plays guitar.")
|
| 81 |
```
|
| 82 |
|
| 83 |
+
|
| 84 |
+
|
| 85 |
## Train PromCSE
|
| 86 |
|
| 87 |
In the following section, we describe how to train a PromCSE model by using our code.
|
| 88 |
|
| 89 |
+
### Setups
|
| 90 |
+
|
| 91 |
+
[](https://www.python.org/downloads/release/python-382/)
|
| 92 |
+
[](https://pytorch.org/get-started/previous-versions/)
|
| 93 |
+
|
| 94 |
+
Run the following script to install the remaining dependencies,
|
| 95 |
+
|
| 96 |
+
```bash
|
| 97 |
+
pip install -r requirements.txt
|
| 98 |
+
```
|
| 99 |
|
| 100 |
### Evaluation
|
| 101 |
[](https://colab.research.google.com/drive/1lanXViJzbmGM1bwm8AflNUKmrvDidg_3?usp=sharing)
|
|
|
|
| 224 |
| Valid steps | 125 | 125 | 125 | 125 |
|
| 225 |
|
| 226 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 227 |
|
| 228 |
|
| 229 |
## Citation
|