WhereIsAI
/

UAE-Code-Large-V1

@@ -1,92 +1,183 @@
 ---
 library_name: sentence-transformers
-pipeline_tag: sentence-similarity
-tags:
-- sentence-transformers
-- feature-extraction
-- sentence-similarity
-- transformers
 ---
 # WhereIsAI/UAE-Code-Large-V1
-This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering or semantic search.
-<!--- Describe your model here -->
-## Usage (Sentence-Transformers)
-Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
-```
-pip install -U sentence-transformers
-```
-Then you can use the model like this:
-```python
-from sentence_transformers import SentenceTransformer
-sentences = ["This is an example sentence", "Each sentence is converted"]
-model = SentenceTransformer('WhereIsAI/UAE-Code-Large-V1')
-embeddings = model.encode(sentences)
-print(embeddings)
-```
-## Usage (HuggingFace Transformers)
-Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
 ```python
-from transformers import AutoTokenizer, AutoModel
-import torch
-def cls_pooling(model_output, attention_mask):
-    return model_output[0][:,0]
-# Sentences we want sentence embeddings for
-sentences = ['This is an example sentence', 'Each sentence is converted']
-# Load model from HuggingFace Hub
-tokenizer = AutoTokenizer.from_pretrained('WhereIsAI/UAE-Code-Large-V1')
-model = AutoModel.from_pretrained('WhereIsAI/UAE-Code-Large-V1')
-# Tokenize sentences
-encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
-# Compute token embeddings
-with torch.no_grad():
-    model_output = model(**encoded_input)
-# Perform pooling. In this case, cls pooling.
-sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])
-print("Sentence embeddings:")
-print(sentence_embeddings)
 ```
-## Evaluation Results
-<!--- Describe how your model was evaluated -->
-For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=WhereIsAI/UAE-Code-Large-V1)
-## Full Model Architecture
 ```
-SentenceTransformer(
-  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
-  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
-)
 ```
-## Citing & Authors
-<!--- Describe where people can find more information -->

 ---
+license: mit
+datasets:
+- WhereIsAI/github-issue-similarity
+language:
+- en
 library_name: sentence-transformers
+pipeline_tag: feature-extraction
 ---
 # WhereIsAI/UAE-Code-Large-V1
+This model is trained on the [GIS: Github Issue Similarity](https://huggingface.co/datasets/WhereIsAI/github-issue-similarity) dataset using [AnglE](https://github.com/SeanLee97/AnglE) loss (https://arxiv.org/abs/2309.12871).
+It can be used to measure **code/issue similarity**.
+Results (test set):
+- Spearman correlation: 71.19
+- Accuracy: 84.37
+## Usage
+### 1. angle-emb
+You can use it via `angle-emb` as follows:
+install:
+```
+python -m pip install -U angle-emb
+```
+example:
 ```python
+from scipy import spatial
+from angle_emb import AnglE
+model = AnglE.from_pretrained('WhereIsAI/UAE-Code-Large-V1').cuda()
+quick_sort = '''# Approach 2: Quicksort using list comprehension
+def quicksort(arr):
+    if len(arr) <= 1:
+        return arr
+    else:
+        pivot = arr[0]
+        left = [x for x in arr[1:] if x < pivot]
+        right = [x for x in arr[1:] if x >= pivot]
+        return quicksort(left) + [pivot] + quicksort(right)
+# Example usage
+arr = [1, 7, 4, 1, 10, 9, -2]
+sorted_arr = quicksort(arr)
+print("Sorted Array in Ascending Order:")
+print(sorted_arr)'''
+bubble_sort = '''def bubblesort(elements):
+    # Looping from size of array from last index[-1] to index [0]
+    for n in range(len(elements)-1, 0, -1):
+        swapped = False
+        for i in range(n):
+            if elements[i] > elements[i + 1]:
+                swapped = True
+                # swapping data if the element is less than next element in the array
+                elements[i], elements[i + 1] = elements[i + 1], elements[i]
+        if not swapped:
+            # exiting the function if we didn't make a single swap
+            # meaning that the array is already sorted.
+            return
+elements = [39, 12, 18, 85, 72, 10, 2, 18]
+print("Unsorted list is,")
+print(elements)
+bubblesort(elements)
+print("Sorted Array is, ")
+print(elements)'''
+vecs = model.encode([
+    'def echo(): print("hello world")',
+    quick_sort,
+    bubble_sort
+])
+print('cos sim (0, 1):', 1 - spatial.distance.cosine(vecs[0], vecs[1]))
+print('cos sim (0, 2)', 1 - spatial.distance.cosine(vecs[0], vecs[2]))
+print('cos sim (1, 2):', 1 - spatial.distance.cosine(vecs[1], vecs[2]))
 ```
+output:
+```
+cos sim (0, 1): 0.34329649806022644
+cos sim (0, 2) 0.3627094626426697
+cos sim (1, 2): 0.6972219347953796
+```
+## sentence-transformers
+You can also use it via `sentence-transformers`
+```python
+from scipy import spatial
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer('WhereIsAI/UAE-Code-Large-V1').cuda()
+quick_sort = '''# Approach 2: Quicksort using list comprehension
+def quicksort(arr):
+    if len(arr) <= 1:
+        return arr
+    else:
+        pivot = arr[0]
+        left = [x for x in arr[1:] if x < pivot]
+        right = [x for x in arr[1:] if x >= pivot]
+        return quicksort(left) + [pivot] + quicksort(right)
+# Example usage
+arr = [1, 7, 4, 1, 10, 9, -2]
+sorted_arr = quicksort(arr)
+print("Sorted Array in Ascending Order:")
+print(sorted_arr)'''
+bubble_sort = '''def bubblesort(elements):
+    # Looping from size of array from last index[-1] to index [0]
+    for n in range(len(elements)-1, 0, -1):
+        swapped = False
+        for i in range(n):
+            if elements[i] > elements[i + 1]:
+                swapped = True
+                # swapping data if the element is less than next element in the array
+                elements[i], elements[i + 1] = elements[i + 1], elements[i]
+        if not swapped:
+            # exiting the function if we didn't make a single swap
+            # meaning that the array is already sorted.
+            return
+elements = [39, 12, 18, 85, 72, 10, 2, 18]
+print("Unsorted list is,")
+print(elements)
+bubblesort(elements)
+print("Sorted Array is, ")
+print(elements)'''
+vecs = model.encode([
+    'def echo(): print("hello world")',
+    quick_sort,
+    bubble_sort
+])
+print('cos sim (0, 1):', 1 - spatial.distance.cosine(vecs[0], vecs[1]))
+print('cos sim (0, 2)', 1 - spatial.distance.cosine(vecs[0], vecs[2]))
+print('cos sim (1, 2):', 1 - spatial.distance.cosine(vecs[1], vecs[2]))
+```
+output:
 ```
+cos sim (0, 1): 0.34329649806022644
+cos sim (0, 2) 0.3627094626426697
+cos sim (1, 2): 0.6972219347953796
 ```
+# Citation
+```bibtex
+@article{li2023angle,
+  title={AnglE-optimized Text Embeddings},
+  author={Li, Xianming and Li, Jing},
+  journal={arXiv preprint arXiv:2309.12871},
+  year={2023}
+}
+```