Tao commited on
Commit
8bc9e45
·
1 Parent(s): 4011276

Add SetFit model

Browse files
Files changed (3) hide show
  1. README.md +62 -25
  2. model_head.pkl +2 -2
  3. pytorch_model.bin +1 -1
README.md CHANGED
@@ -5,40 +5,38 @@ tags:
5
  - feature-extraction
6
  - sentence-similarity
7
  - transformers
8
- - semantic-search
9
- - chinese
10
- ---
11
 
12
- # DMetaSoul/sbert-chinese-general-v2
13
 
14
- 此模型基于 [bert-base-chinese](https://huggingface.co/bert-base-chinese) 版本 BERT 模型,在百万级语义相似数据集 [SimCLUE](https://github.com/CLUEbenchmark/SimCLUE) 上进行训练,适用于**通用语义匹配**场景,从效果来看该模型在各种任务上**泛化能力更好**。
15
 
16
- 注:此模型的[轻量化版本](https://huggingface.co/DMetaSoul/sbert-chinese-general-v2-distill),也已经开源啦!
17
 
18
- # Usage
19
 
20
- ## 1. Sentence-Transformers
21
 
22
- 通过 [sentence-transformers](https://www.SBERT.net) 框架来使用该模型,首先进行安装:
23
 
24
  ```
25
  pip install -U sentence-transformers
26
  ```
27
 
28
- 然后使用下面的代码来载入该模型并进行文本表征向量的提取:
29
 
30
  ```python
31
  from sentence_transformers import SentenceTransformer
32
- sentences = ["我的儿子!他猛然间喊道,我的儿子在哪儿?", "我的儿子呢!他突然喊道,我的儿子在哪里?"]
33
 
34
- model = SentenceTransformer('DMetaSoul/sbert-chinese-general-v2')
35
  embeddings = model.encode(sentences)
36
  print(embeddings)
37
  ```
38
 
39
- ## 2. HuggingFace Transformers
40
 
41
- 如果不想使用 [sentence-transformers](https://www.SBERT.net) 的话,也可以通过 HuggingFace Transformers 来载入该模型并进行文本向量抽取:
 
 
42
 
43
  ```python
44
  from transformers import AutoTokenizer, AutoModel
@@ -53,11 +51,11 @@ def mean_pooling(model_output, attention_mask):
53
 
54
 
55
  # Sentences we want sentence embeddings for
56
- sentences = ["我的儿子!他猛然间喊道,我的儿子在哪儿?", "我的儿子呢!他突然喊道,我的儿子在哪里?"]
57
 
58
  # Load model from HuggingFace Hub
59
- tokenizer = AutoTokenizer.from_pretrained('DMetaSoul/sbert-chinese-general-v2')
60
- model = AutoModel.from_pretrained('DMetaSoul/sbert-chinese-general-v2')
61
 
62
  # Tokenize sentences
63
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
@@ -73,17 +71,56 @@ print("Sentence embeddings:")
73
  print(sentence_embeddings)
74
  ```
75
 
76
- ## Evaluation
77
 
78
- 该模型在公开的几个语义匹配数据集上进行了评测,计算了向量相似度跟真实标签之间的相关性系数:
79
 
80
- | | **csts_dev** | **csts_test** | **afqmc** | **lcqmc** | **bqcorpus** | **pawsx** | **xiaobu** |
81
- | ---------------------------- | ------------ | ------------- | ---------- | ---------- | ------------ | ---------- | ---------- |
82
- | **sbert-chinese-general-v1** | **84.54%** | **82.17%** | 23.80% | 65.94% | 45.52% | 11.52% | 48.51% |
83
- | **sbert-chinese-general-v2** | 77.20% | 72.60% | **36.80%** | **76.92%** | **49.63%** | **16.24%** | **63.16%** |
 
84
 
85
- 这里对比了本模型跟之前我们发布 [sbert-chinese-general-v1](https://huggingface.co/DMetaSoul/sbert-chinese-general-v1) 之间的差异,可以看到本模型在多个任务上的泛化能力更好。
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
  ## Citing & Authors
88
 
89
- E-mail: xiaowenbin@dmetasoul.com
 
5
  - feature-extraction
6
  - sentence-similarity
7
  - transformers
 
 
 
8
 
9
+ ---
10
 
11
+ # {MODEL_NAME}
12
 
13
+ This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
14
 
15
+ <!--- Describe your model here -->
16
 
17
+ ## Usage (Sentence-Transformers)
18
 
19
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
20
 
21
  ```
22
  pip install -U sentence-transformers
23
  ```
24
 
25
+ Then you can use the model like this:
26
 
27
  ```python
28
  from sentence_transformers import SentenceTransformer
29
+ sentences = ["This is an example sentence", "Each sentence is converted"]
30
 
31
+ model = SentenceTransformer('{MODEL_NAME}')
32
  embeddings = model.encode(sentences)
33
  print(embeddings)
34
  ```
35
 
 
36
 
37
+
38
+ ## Usage (HuggingFace Transformers)
39
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
40
 
41
  ```python
42
  from transformers import AutoTokenizer, AutoModel
 
51
 
52
 
53
  # Sentences we want sentence embeddings for
54
+ sentences = ['This is an example sentence', 'Each sentence is converted']
55
 
56
  # Load model from HuggingFace Hub
57
+ tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
58
+ model = AutoModel.from_pretrained('{MODEL_NAME}')
59
 
60
  # Tokenize sentences
61
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
 
71
  print(sentence_embeddings)
72
  ```
73
 
 
74
 
 
75
 
76
+ ## Evaluation Results
77
+
78
+ <!--- Describe how your model was evaluated -->
79
+
80
+ For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
81
 
82
+
83
+ ## Training
84
+ The model was trained with the parameters:
85
+
86
+ **DataLoader**:
87
+
88
+ `torch.utils.data.dataloader.DataLoader` of length 765 with parameters:
89
+ ```
90
+ {'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
91
+ ```
92
+
93
+ **Loss**:
94
+
95
+ `sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
96
+
97
+ Parameters of the fit()-Method:
98
+ ```
99
+ {
100
+ "epochs": 1,
101
+ "evaluation_steps": 0,
102
+ "evaluator": "NoneType",
103
+ "max_grad_norm": 1,
104
+ "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
105
+ "optimizer_params": {
106
+ "lr": 2e-05
107
+ },
108
+ "scheduler": "WarmupLinear",
109
+ "steps_per_epoch": 765,
110
+ "warmup_steps": 77,
111
+ "weight_decay": 0.01
112
+ }
113
+ ```
114
+
115
+
116
+ ## Full Model Architecture
117
+ ```
118
+ SentenceTransformer(
119
+ (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel
120
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
121
+ )
122
+ ```
123
 
124
  ## Citing & Authors
125
 
126
+ <!--- Describe where people can find more information -->
model_head.pkl CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:295e6ff83ee8d3109f85127d6fafc14883f6ca76a1c78115508afd2329f21a50
3
- size 394
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6505755c03d5e772300a5208c3ea52bb50e81810662e37acd96630eefa83e065
3
+ size 31758
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ac3038f0a407cc45b3fa5685e47540b7f5a833e741ddf7e915017f96b8405f7d
3
  size 409138989
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3bebf8d13a2547ddd5c999f12ca8ac4ebce0181ead246c93d7061b3fa54011d1
3
  size 409138989