peanderson commited on
Commit
ff649e3
·
1 Parent(s): 75cec6c

First model commit

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ model.safetensors filter=lfs diff=lfs merge=lfs -text
37
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
README.md CHANGED
@@ -1,3 +1,126 @@
1
  ---
2
  license: mit
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - en
5
+ base_model:
6
+ - intfloat/multilingual-e5-base
7
  ---
8
+ ## BAM Embeddings (multilingual-e5-base)
9
+
10
+ Text embeddings specialized for retrieval in the finance domain.
11
+
12
+ [Greenback Bears and Fiscal Hawks: Finance is a Jungle and Text Embeddings Must Adapt](https://aclanthology.org/2024.emnlp-industry.26.pdf).
13
+ Peter Anderson, Mano Vikash Janardhanan, Jason He, Wei Cheng, Charlie Flanagan, EMNLP 2024
14
+
15
+ This model has 12 layers, and the embedding size is 768.
16
+
17
+ ## Usage
18
+
19
+ Below is an example to encode queries and passages for text retrieval.
20
+
21
+ ```python
22
+ import torch.nn.functional as F
23
+
24
+ from torch import Tensor
25
+ from transformers import AutoTokenizer, AutoModel
26
+
27
+
28
+ def average_pool(last_hidden_states: Tensor,
29
+ attention_mask: Tensor) -> Tensor:
30
+ last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
31
+ return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
32
+
33
+
34
+ # Each input text should start with "query: " or "passage: ", even for non-English texts.
35
+ # For tasks other than retrieval, you can simply use the "query: " prefix.
36
+ input_texts = [
37
+ "query: What is a callback provision?",
38
+ "query: EverCommerce revenue headwinds",
39
+ "passage: Beazley PLC/ADR - But they're saying, do you confirm prior to issuing an invoice that this is the correct, or prior to paying an invoice that this is the correct...",
40
+ "passage: EverCommerce Inc\nWe are assuming coverage of EverCommerce, which is among the leading SaaS platforms in the services sector for SMBs..."
41
+ ]
42
+
43
+ tokenizer = AutoTokenizer.from_pretrained('BalyasnyAI/multilingual-e5-base')
44
+ model = AutoModel.from_pretrained('BalyasnyAI/multilingual-e5-base')
45
+
46
+ # Tokenize the input texts
47
+ batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
48
+
49
+ outputs = model(**batch_dict)
50
+ embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
51
+
52
+ # normalize embeddings
53
+ embeddings = F.normalize(embeddings, p=2, dim=1)
54
+ scores = (embeddings[:2] @ embeddings[2:].T) * 100
55
+ print(scores.tolist())
56
+ ```
57
+
58
+ ## Supported Languages
59
+
60
+ This model is initialized from [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base)
61
+ and finetuned on English datasets. Other languages may see lower performance.
62
+
63
+ ## Training Details
64
+
65
+ **Initialization**: [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base)
66
+
67
+ **Finetuning**: contrastive loss with synthetically qenerated queries and hard negatives
68
+
69
+ | Dataset | Weak supervision | # of text pairs |
70
+ |--------------------------------------------------------------------------------------------------------|---------------------------------------|-----------------|
71
+ | BAM internal dataset | (text passage, synthetic query) | 14.3M |
72
+
73
+ ## Support for Sentence Transformers
74
+
75
+ Below is an example for usage with sentence_transformers.
76
+ ```python
77
+ from sentence_transformers import SentenceTransformer
78
+ model = SentenceTransformer('BalyasnyAI/multilingual-e5-base')
79
+ input_texts = [
80
+ "query: What is a callback provision?",
81
+ "query: EverCommerce revenue headwinds",
82
+ "passage: Beazley PLC/ADR - But they're saying, do you confirm prior to issuing an invoice that this is the correct, or prior to paying an invoice that this is the correct...",
83
+ "passage: EverCommerce Inc\nWe are assuming coverage of EverCommerce, which is among the leading SaaS platforms in the services sector for SMBs..."
84
+ ]
85
+ embeddings = model.encode(input_texts, normalize_embeddings=True)
86
+ ```
87
+
88
+ Package requirements
89
+
90
+ `pip install sentence_transformers~=2.2.2`
91
+
92
+ ## TIPS FOR BEST PERFORMANCE
93
+
94
+ **1. Always add the correct text prefix, either "query: " or "passage: " to input texts**
95
+
96
+ This is how the model is trained, otherwise you will see a performance degradation.
97
+
98
+ Here are some rules of thumb:
99
+ - Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval.
100
+
101
+ - Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.
102
+
103
+ - Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.
104
+
105
+ **2. Add Context to Passages**
106
+
107
+ When a document is split into individual text passages for embedding, frequently these text passages are missing crucial information such as the title of the document, or the name and ticker of the company it relates to. To overcome this problem, BAM embeddings are trained to work well with *one line of document context added to the beginning of each text passage* (followed by a newline).
108
+
109
+ It’s up to you what document context you should use. We have had success using combinations of the document title, author name and bio, company name, ticker, event, and date, depending on the application, e.g. “Google GOOG FY23 earnings call\n”. Only one line of document context is needed.
110
+
111
+ **3. Keep passages <=512 tokens**
112
+
113
+ Long texts will be truncated to at most 512 tokens.
114
+
115
+ ## Citation
116
+
117
+ If you find our paper or models helpful, please consider citing as follows:
118
+
119
+ ```
120
+ @inproceedings{anderson-etal-2024-greenback,
121
+ title = "Greenback Bears and Fiscal Hawks: Finance is a Jungle and Text Embeddings Must Adapt",
122
+ author = "Anderson, Peter and Janardhanan, Mano Vikash and He, Jason and Cheng, Wei and Flanagan, Charlie",
123
+ booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track",
124
+ year = "2024",
125
+ }
126
+ ```
config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/local/peanderson/mixed-models/e5-base-v2",
3
+ "architectures": [
4
+ "XLMRobertaModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 3072,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 514,
17
+ "model_type": "xlm-roberta",
18
+ "num_attention_heads": 12,
19
+ "num_hidden_layers": 12,
20
+ "output_past": true,
21
+ "pad_token_id": 1,
22
+ "position_embedding_type": "absolute",
23
+ "torch_dtype": "float32",
24
+ "transformers_version": "4.39.0",
25
+ "type_vocab_size": 1,
26
+ "use_cache": true,
27
+ "vocab_size": 250002
28
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.2.2",
4
+ "transformers": "4.39.0",
5
+ "pytorch": "2.1.2+cu121"
6
+ }
7
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:60caa896327382bca0fe783e1f13d933200f8ec2c6454585facc7010d91689ae
3
+ size 1112197096
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f59925fcb90c92b894cb93e51bb9b4a6105c5c249fe54ce1c704420ac39b81af
3
+ size 17082756
tokenizer_config.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "mask_token": "<mask>",
49
+ "model_max_length": 512,
50
+ "pad_token": "<pad>",
51
+ "sep_token": "</s>",
52
+ "tokenizer_class": "XLMRobertaTokenizer",
53
+ "unk_token": "<unk>"
54
+ }