| | --- |
| | license: mit |
| | language: |
| | - zh |
| | pipeline_tag: sentence-similarity |
| | --- |
| | |
| |
|
| | # PromCSE(sup) |
| |
|
| |
|
| |
|
| |
|
| | ## Data List |
| | The following datasets are all in Chinese. |
| | | Data | size(train) | size(valid) | size(test) | |
| | |:----------------------:|:----------:|:----------:|:----------:| |
| | | [ATEC](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1gmnyz9emqOXwaHhSM9CCUA%3Fpwd%3Db17c) | 62477| 20000| 20000| |
| | | [BQ](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1M-e01yyy5NacVPrph9fbaQ%3Fpwd%3Dtis9) | 100000| 10000| 10000| |
| | | [LCQMC](https://pan.baidu.com/s/16DfE7fHrCkk4e8a2j3SYUg?pwd=bc8w ) | 238766| 8802| 12500| |
| | | [PAWSX](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1ox0tJY3ZNbevHDeAqDBOPQ%3Fpwd%3Dmgjn) | 49401| 2000| 2000| |
| | | [STS-B](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/10yfKfTtcmLQ70-jzHIln1A%3Fpwd%3Dgf8y) | 5231| 1458| 1361| |
| | | [*SNLI*](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1NOgA7JwWghiauwGAUvcm7w%3Fpwd%3Ds75v) | 146828| 2699| 2618| |
| | | [*MNLI*](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1xjZKtWk3MAbJ6HX4pvXJ-A%3Fpwd%3D2kte) | 122547| 2932| 2397| |
| |
|
| |
|
| |
|
| |
|
| | ## Model List |
| | The evaluation dataset is in Chinese, and we used the same language model **RoBERTa base** on different methods. In addition, considering that the test set of some datasets is small, which may lead to a large deviation in evaluation accuracy, the evaluation data here uses train, valid and test at the same time, and the final evaluation result adopts the **weighted average (w-avg)** method. |
| | | Model | STS-B(w-avg) | ATEC | BQ | LCQMC | PAWSX | Avg. | |
| | |:-----------------------:|:------------:|:-----------:|:----------|:-------------|:------------:|:----------:| |
| | | BERT-Whitening | 65.27| -| -| -| -| -| |
| | | SimBERT | 70.01| -| -| -| -| -| |
| | | SBERT-Whitening | 71.75| -| -| -| -| -| |
| | | [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | 78.61| -| -| -| -| -| |
| | | [hellonlp/simcse-base-zh](https://huggingface.co/hellonlp/simcse-roberta-base-zh) | 80.96| -| -| -| -| -| |
| | | [hellonlp/promcse-base-zh](https://huggingface.co/hellonlp/promcse-bert-base-zh) | **81.57**| -| -| -| -| -| |
| |
|
| |
|
| |
|
| |
|
| | ## Uses |
| | To use the tool, first install the `promcse` package from [PyPI](https://pypi.org/project/promcse/) |
| | ```bash |
| | pip install promcse |
| | ``` |
| |
|
| | After installing the package, you can load our model by two lines of code |
| | ```python |
| | from promcse import PromCSE |
| | model = PromCSE("hellonlp/promcse-bert-base-zh", "cls", 10) |
| | ``` |
| |
|
| | Then you can use our model for encoding sentences into embeddings |
| | ```python |
| | embeddings = model.encode("武汉是一个美丽的城市。") |
| | print(embeddings.shape) |
| | #torch.Size([768]) |
| | ``` |
| |
|
| | Compute the cosine similarities between two groups of sentences |
| | ```python |
| | sentences_a = ['你好吗'] |
| | sentences_b = ['你怎么样','我吃了一个苹果','你过的好吗','你还好吗','你', |
| | '你好不好','你好不好呢','我不开心','我好开心啊', '你吃饭了吗', |
| | '你好吗','你现在好吗','你好个鬼'] |
| | similarities = model.similarity(sentences_a, sentences_b) |
| | print(similarities) |
| | # [(1.0, '你好吗'), |
| | # (0.9167, '你好不好'), |
| | # (0.8956, '你好不好呢'), |
| | # (0.8431, '你还好吗'), |
| | # (0.7919, '你怎么样'), |
| | # (0.7649, '你现在好吗'), |
| | # (0.7458, '你过的好吗'), |
| | # (0.6844, '你好个鬼'), |
| | # (0.6177, '你'), |
| | # (0.5654, '你吃饭了吗'), |
| | # (0.3612, '我好开心啊'), |
| | # (0.1875, '我不开心'), |
| | # (0.0866, '我吃了一个苹果')] |
| | ``` |