aminhaeri
/

risk-embed

@@ -1,92 +1,116 @@
----
-library_name: sentence-transformers
-pipeline_tag: sentence-similarity
-tags:
-- sentence-transformers
-- feature-extraction
-- sentence-similarity
----
-# {MODEL_NAME}
-This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
-<!--- Describe your model here -->
-## Usage (Sentence-Transformers)
-Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
-```
-pip install -U sentence-transformers
-```
-Then you can use the model like this:
-```python
-from sentence_transformers import SentenceTransformer
-sentences = ["This is an example sentence", "Each sentence is converted"]
-model = SentenceTransformer('{MODEL_NAME}')
-embeddings = model.encode(sentences)
-print(embeddings)
-```
-## Evaluation Results
-<!--- Describe how your model was evaluated -->
-For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
-## Training
-The model was trained with the parameters:
-**DataLoader**:
-`torch.utils.data.dataloader.DataLoader` of length 652 with parameters:
-```
-{'batch_size': 12, 'sampler': 'torch.utils.data.sampler.SequentialSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
-```
-**Loss**:
-`sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
-  ```
-  {'scale': 20.0, 'similarity_fct': 'cos_sim'}
-  ```
-Parameters of the fit()-Method:
-```
-{
-    "epochs": 2,
-    "evaluation_steps": 100,
-    "evaluator": "sentence_transformers.evaluation.InformationRetrievalEvaluator.InformationRetrievalEvaluator",
-    "max_grad_norm": 1,
-    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
-    "optimizer_params": {
-        "lr": 2e-05
-    },
-    "scheduler": "WarmupLinear",
-    "steps_per_epoch": null,
-    "warmup_steps": 130,
-    "weight_decay": 0.01
-}
-```
-## Full Model Architecture
-```
-SentenceTransformer(
-  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
-  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
-  (2): Normalize()
-)
-```
-## Citing & Authors
-<!--- Describe where people can find more information -->

+---
+library_name: sentence-transformers
+pipeline_tag: sentence-similarity
+tags:
+- sentence-transformers
+- feature-extraction
+- sentence-similarity
+---
+# RiskEmbed
+RiskEmbed is a finetuned Snowflake embedding model (arctic-embed-m) optimized for financial risk-related retrieval tasks.
+This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space.
+<!--- Describe your model here -->
+## Model
+Our finetuned embedding model achieves state-of-the-art performance (88%) among closed-source models.
+In particular, our model outperforms Google Text-Embedding-004 (84%), Cohere Embed-English-v3.0 (85%), OpenAI Text-Embedding-3-Large (86%), and MistralAI Mistral-Embed (87%), all of which were not finetuned in domain-specific data.
+This result highlights the advantage of finetuning on risk management data, as our model surpasses general-purpose embeddings in retrieval effectiveness.
+Furthermore, despite having the smallest embedding size (768 dimensions, equal to Google’s model but significantly smaller than OpenAI’s 3072 dimensions), our model efficiently encodes domain-specific information without requiring a larger vector space.
+Compared to VoyageAI's Voyage-Finance-2, which is also finetuned but on general financial data, our model achieves the same HR@5 (88%).
+The ability to achieve peak performance with a more compact representation (768 vs. 1024 dimensions for VoyageAI) suggests that our model captures risk-related semantics more effectively.
+| Model                     | HR@5 [%] | Improvement [%] | Embedding Size |
+|---------------------------|----------|-----------------|----------------|
+| Google Text-Embedding-004 | 84       | 5               | 768            |
+| Cohere Embed-English-v3.0 | 85       | 4               | 1024           |
+| OpenAI Text-Embedding-3-Large | 86       | 2               | 3072           |
+| MistralAI Mistral-Embed   | 87       | 1               | 1024           |
+| VoyageAI Voyage-Finance-2 | 88       | 0               | 1024           |
+| Ours                      | 88       | -               | 768            |
+## Usage
+### Using Sentence Transformers
+You can use the sentence-transformers package to use the model, as shown below.
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("aminhaeri/RiskEmbed")
+queries = ['what is snowflake?', 'Where can I get the best tacos?']
+documents = ['The Data Cloud!', 'Mexico City of Course!']
+query_embeddings = model.encode(queries, prompt_name="query")
+document_embeddings = model.encode(documents)
+scores = query_embeddings @ document_embeddings.T
+for query, query_scores in zip(queries, scores):
+    doc_score_pairs = list(zip(documents, query_scores))
+    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
+    # Output passages & scores
+    print("Query:", query)
+    for document, score in doc_score_pairs:
+        print(score, document)
+```
+### Using Huggingface transformers
+You can use the transformers package to use the model, as shown below. For optimal retrieval quality, use the CLS token to embed each text portion and use the query prefix below (just on the query).
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained('aminhaeri/RiskEmbed')
+model = AutoModel.from_pretrained('aminhaeri/RiskEmbed', add_pooling_layer=False)
+model.eval()
+query_prefix = 'Represent this sentence for searching relevant passages: '
+queries  = ['what is snowflake?', 'Where can I get the best tacos?']
+queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
+query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)
+documents = ['The Data Cloud!', 'Mexico City of Course!']
+document_tokens =  tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=512)
+# Compute token embeddings
+with torch.no_grad():
+    query_embeddings = model(**query_tokens)[0][:, 0]
+    document_embeddings = model(**document_tokens)[0][:, 0]
+# normalize embeddings
+query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
+document_embeddings = torch.nn.functional.normalize(document_embeddings, p=2, dim=1)
+scores = torch.mm(query_embeddings, document_embeddings.transpose(0, 1))
+for query, query_scores in zip(queries, scores):
+    doc_score_pairs = list(zip(documents, query_scores))
+    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
+    #Output passages & scores
+    print("Query:", query)
+    for document, score in doc_score_pairs:
+        print(score, document)
+```
+## Contact
+Feel free to open an issue or pull request if you have any questions or suggestions about this project.
+You also can email Amin Haeri(me@aminhaeri.com).
+## License
+Arctic is licensed under the [Apache-2](https://www.apache.org/licenses/LICENSE-2.0). The released models can be used for commercial purposes free of charge.
+## Acknowledgement
+The authors would like to acknowledge the valuable contributions of the Risk Management team at TD Bank for their expertise in regulatory frameworks, financial risk assessment, and compliance practices, which were instrumental in the development of the finetuning of RiskEmbed.