aminhaeri commited on
Commit
b37cea4
·
verified ·
1 Parent(s): 25e442f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +116 -92
README.md CHANGED
@@ -1,92 +1,116 @@
1
- ---
2
- library_name: sentence-transformers
3
- pipeline_tag: sentence-similarity
4
- tags:
5
- - sentence-transformers
6
- - feature-extraction
7
- - sentence-similarity
8
-
9
- ---
10
-
11
- # {MODEL_NAME}
12
-
13
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
14
-
15
- <!--- Describe your model here -->
16
-
17
- ## Usage (Sentence-Transformers)
18
-
19
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
20
-
21
- ```
22
- pip install -U sentence-transformers
23
- ```
24
-
25
- Then you can use the model like this:
26
-
27
- ```python
28
- from sentence_transformers import SentenceTransformer
29
- sentences = ["This is an example sentence", "Each sentence is converted"]
30
-
31
- model = SentenceTransformer('{MODEL_NAME}')
32
- embeddings = model.encode(sentences)
33
- print(embeddings)
34
- ```
35
-
36
-
37
-
38
- ## Evaluation Results
39
-
40
- <!--- Describe how your model was evaluated -->
41
-
42
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
43
-
44
-
45
- ## Training
46
- The model was trained with the parameters:
47
-
48
- **DataLoader**:
49
-
50
- `torch.utils.data.dataloader.DataLoader` of length 652 with parameters:
51
- ```
52
- {'batch_size': 12, 'sampler': 'torch.utils.data.sampler.SequentialSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
53
- ```
54
-
55
- **Loss**:
56
-
57
- `sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
58
- ```
59
- {'scale': 20.0, 'similarity_fct': 'cos_sim'}
60
- ```
61
-
62
- Parameters of the fit()-Method:
63
- ```
64
- {
65
- "epochs": 2,
66
- "evaluation_steps": 100,
67
- "evaluator": "sentence_transformers.evaluation.InformationRetrievalEvaluator.InformationRetrievalEvaluator",
68
- "max_grad_norm": 1,
69
- "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
70
- "optimizer_params": {
71
- "lr": 2e-05
72
- },
73
- "scheduler": "WarmupLinear",
74
- "steps_per_epoch": null,
75
- "warmup_steps": 130,
76
- "weight_decay": 0.01
77
- }
78
- ```
79
-
80
-
81
- ## Full Model Architecture
82
- ```
83
- SentenceTransformer(
84
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
85
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
86
- (2): Normalize()
87
- )
88
- ```
89
-
90
- ## Citing & Authors
91
-
92
- <!--- Describe where people can find more information -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: sentence-transformers
3
+ pipeline_tag: sentence-similarity
4
+ tags:
5
+ - sentence-transformers
6
+ - feature-extraction
7
+ - sentence-similarity
8
+
9
+ ---
10
+
11
+ # RiskEmbed
12
+
13
+ RiskEmbed is a finetuned Snowflake embedding model (arctic-embed-m) optimized for financial risk-related retrieval tasks.
14
+
15
+ This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space.
16
+
17
+ <!--- Describe your model here -->
18
+ ## Model
19
+
20
+ Our finetuned embedding model achieves state-of-the-art performance (88%) among closed-source models.
21
+ In particular, our model outperforms Google Text-Embedding-004 (84%), Cohere Embed-English-v3.0 (85%), OpenAI Text-Embedding-3-Large (86%), and MistralAI Mistral-Embed (87%), all of which were not finetuned in domain-specific data.
22
+ This result highlights the advantage of finetuning on risk management data, as our model surpasses general-purpose embeddings in retrieval effectiveness.
23
+ Furthermore, despite having the smallest embedding size (768 dimensions, equal to Google’s model but significantly smaller than OpenAI’s 3072 dimensions), our model efficiently encodes domain-specific information without requiring a larger vector space.
24
+ Compared to VoyageAI's Voyage-Finance-2, which is also finetuned but on general financial data, our model achieves the same HR@5 (88%).
25
+ The ability to achieve peak performance with a more compact representation (768 vs. 1024 dimensions for VoyageAI) suggests that our model captures risk-related semantics more effectively.
26
+
27
+ | Model | HR@5 [%] | Improvement [%] | Embedding Size |
28
+ |---------------------------|----------|-----------------|----------------|
29
+ | Google Text-Embedding-004 | 84 | 5 | 768 |
30
+ | Cohere Embed-English-v3.0 | 85 | 4 | 1024 |
31
+ | OpenAI Text-Embedding-3-Large | 86 | 2 | 3072 |
32
+ | MistralAI Mistral-Embed | 87 | 1 | 1024 |
33
+ | VoyageAI Voyage-Finance-2 | 88 | 0 | 1024 |
34
+ | Ours | 88 | - | 768 |
35
+
36
+ ## Usage
37
+
38
+ ### Using Sentence Transformers
39
+
40
+ You can use the sentence-transformers package to use the model, as shown below.
41
+
42
+ ```python
43
+ from sentence_transformers import SentenceTransformer
44
+
45
+ model = SentenceTransformer("aminhaeri/RiskEmbed")
46
+
47
+ queries = ['what is snowflake?', 'Where can I get the best tacos?']
48
+ documents = ['The Data Cloud!', 'Mexico City of Course!']
49
+
50
+ query_embeddings = model.encode(queries, prompt_name="query")
51
+ document_embeddings = model.encode(documents)
52
+
53
+ scores = query_embeddings @ document_embeddings.T
54
+ for query, query_scores in zip(queries, scores):
55
+ doc_score_pairs = list(zip(documents, query_scores))
56
+ doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
57
+ # Output passages & scores
58
+ print("Query:", query)
59
+ for document, score in doc_score_pairs:
60
+ print(score, document)
61
+ ```
62
+
63
+ ### Using Huggingface transformers
64
+
65
+ You can use the transformers package to use the model, as shown below. For optimal retrieval quality, use the CLS token to embed each text portion and use the query prefix below (just on the query).
66
+
67
+ ```python
68
+ import torch
69
+ from transformers import AutoModel, AutoTokenizer
70
+
71
+ tokenizer = AutoTokenizer.from_pretrained('aminhaeri/RiskEmbed')
72
+ model = AutoModel.from_pretrained('aminhaeri/RiskEmbed', add_pooling_layer=False)
73
+ model.eval()
74
+
75
+ query_prefix = 'Represent this sentence for searching relevant passages: '
76
+ queries = ['what is snowflake?', 'Where can I get the best tacos?']
77
+ queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
78
+ query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)
79
+
80
+ documents = ['The Data Cloud!', 'Mexico City of Course!']
81
+ document_tokens = tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=512)
82
+
83
+ # Compute token embeddings
84
+ with torch.no_grad():
85
+ query_embeddings = model(**query_tokens)[0][:, 0]
86
+ document_embeddings = model(**document_tokens)[0][:, 0]
87
+
88
+
89
+ # normalize embeddings
90
+ query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
91
+ document_embeddings = torch.nn.functional.normalize(document_embeddings, p=2, dim=1)
92
+
93
+ scores = torch.mm(query_embeddings, document_embeddings.transpose(0, 1))
94
+ for query, query_scores in zip(queries, scores):
95
+ doc_score_pairs = list(zip(documents, query_scores))
96
+ doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
97
+ #Output passages & scores
98
+ print("Query:", query)
99
+ for document, score in doc_score_pairs:
100
+ print(score, document)
101
+ ```
102
+
103
+ ## Contact
104
+
105
+ Feel free to open an issue or pull request if you have any questions or suggestions about this project.
106
+ You also can email Amin Haeri(me@aminhaeri.com).
107
+
108
+
109
+ ## License
110
+
111
+ Arctic is licensed under the [Apache-2](https://www.apache.org/licenses/LICENSE-2.0). The released models can be used for commercial purposes free of charge.
112
+
113
+
114
+ ## Acknowledgement
115
+
116
+ The authors would like to acknowledge the valuable contributions of the Risk Management team at TD Bank for their expertise in regulatory frameworks, financial risk assessment, and compliance practices, which were instrumental in the development of the finetuning of RiskEmbed.