Update README.md
Browse files
README.md
CHANGED
|
@@ -5,4 +5,88 @@ language:
|
|
| 5 |
tags:
|
| 6 |
- biology
|
| 7 |
- medical
|
| 8 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
tags:
|
| 6 |
- biology
|
| 7 |
- medical
|
| 8 |
+
---
|
| 9 |
+
# Introduce
|
| 10 |
+
## Installation 🔥
|
| 11 |
+
- We recommend `python 3.9` or higher, `torch 2.0.0` or higher, `transformers 4.31.0` or higher.
|
| 12 |
+
|
| 13 |
+
- Currently, you can only download from the source, however, in the future, we will upload it to PyPI. RagE can be installed from source with the following commands:
|
| 14 |
+
```
|
| 15 |
+
git clone https://github.com/anti-aii/RagE.git
|
| 16 |
+
cd RagE
|
| 17 |
+
pip install -e .
|
| 18 |
+
```
|
| 19 |
+
## Quick start 🥮
|
| 20 |
+
- [1. Initialize the model](#initialize_model)
|
| 21 |
+
- [2. Load model from Huggingface Hub](#download_hf)
|
| 22 |
+
- [3. List of pretrained models](#list_pretrained)
|
| 23 |
+
|
| 24 |
+
We have detailed instructions for using our models for inference. See [notebook](notebook)
|
| 25 |
+
### 1. Initialize the model
|
| 26 |
+
<a name= 'initialize_model'></a>
|
| 27 |
+
Let's initalize the SentenceEmbedding model
|
| 28 |
+
|
| 29 |
+
```python
|
| 30 |
+
>>> import torch
|
| 31 |
+
>>> from pyvi import ViTokenizer
|
| 32 |
+
>>> from rage import SentenceEmbedding
|
| 33 |
+
>>> device= torch.device('cuda' if torch.cuda.is_available() else 'cpu')
|
| 34 |
+
>>> model= SentenceEmbedding(model_name= "vinai/phobert-base-v2", torch_dtype= torch.float32, aggregation_hidden_states= False, strategy_pooling= "dense_first")
|
| 35 |
+
>>> model.to(device)
|
| 36 |
+
SentenceEmbeddingConfig(model_base: {'model_type_base': 'RobertaModel', 'model_name': 'vinai/phobert-base-v2', 'type_backbone': 'mlm', 'required_grad_base_model': True, 'aggregation_hidden_states': False, 'concat_embeddings': False, 'dropout': 0.1, 'quantization_config': None}, pooling: {'strategy_pooling': 'dense_first'})
|
| 37 |
+
```
|
| 38 |
+
Then, we can show the number of parameters in the model.
|
| 39 |
+
```python
|
| 40 |
+
>>> model.summary_params()
|
| 41 |
+
trainable params: 135588864 || all params: 135588864 || trainable%: 100.0
|
| 42 |
+
>>> model.summary()
|
| 43 |
+
+---------------------------+-------------+------------------+
|
| 44 |
+
| Layer (type) | Params | Trainable params |
|
| 45 |
+
+---------------------------+-------------+------------------+
|
| 46 |
+
| model (RobertaModel) | 134,998,272 | 134998272 |
|
| 47 |
+
| pooling (PoolingStrategy) | 590,592 | 590592 |
|
| 48 |
+
| drp1 (Dropout) | 0 | 0 |
|
| 49 |
+
+---------------------------+-------------+------------------+
|
| 50 |
+
```
|
| 51 |
+
Now we can use the SentenceEmbedding model to encode the input words. The output of the model will be a matrix in the shape of (batch, dim). Additionally, we can load weights that we have previously trained and saved.
|
| 52 |
+
``` python
|
| 53 |
+
>>> model.load("best_sup_general_embedding_phobert2.pt", key= False)
|
| 54 |
+
>>> sentences= ["Tôi đang đi học", "Bạn tên là gì?",]
|
| 55 |
+
>>> sentences= list(map(lambda x: ViTokenizer.tokenize(x), sentences))
|
| 56 |
+
>>> model.encode(sentences, batch_size= 1, normalize_embedding= "l2", return_tensors= "np", verbose= 1)
|
| 57 |
+
2/2 [==============================] - 0s 43ms/Sample
|
| 58 |
+
array([[ 0.00281098, -0.00829096, -0.01582766, ..., 0.00878178,
|
| 59 |
+
0.01830498, -0.00459659],
|
| 60 |
+
[ 0.00249859, -0.03076724, 0.00033016, ..., 0.01299141,
|
| 61 |
+
-0.00984358, -0.00703243]], dtype=float32)
|
| 62 |
+
```
|
| 63 |
+
### 2. Load model from Huggingface Hub
|
| 64 |
+
<a name= 'download_hf'> </a>
|
| 65 |
+
|
| 66 |
+
First, download a pretrained model.
|
| 67 |
+
```python
|
| 68 |
+
>>> model= SentenceEmbedding.from_pretrained('anti-ai/VieSemantic-base')
|
| 69 |
+
```
|
| 70 |
+
Then, we encode the input sentences and compare their similarity.
|
| 71 |
+
```python
|
| 72 |
+
>>> sentences = ["Nó rất thú_vị", "Nó không thú_vị ."]
|
| 73 |
+
>>> output= model.encode(sentences, batch_size= 1, return_tensors= 'pt')
|
| 74 |
+
>>> torch.cosine_similarity(output[0].view(1, -1), output[1].view(1, -1)).cpu().tolist()
|
| 75 |
+
2/2 [==============================] - 0s 40ms/Sample
|
| 76 |
+
[0.5605039596557617]
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
### 3. List of pretrained models
|
| 80 |
+
<a name= 'list_pretrained'></a>
|
| 81 |
+
This list will be updated with our prominent models. Our models will primarily aim to support Vietnamese language.
|
| 82 |
+
Additionally, you can access our datasets and pretrained models by visiting https://huggingface.co/anti-ai.
|
| 83 |
+
|
| 84 |
+
| Model Name | Model Type | #params | checkpoint|
|
| 85 |
+
| - | - | - | - |
|
| 86 |
+
| anti-ai/ViEmbedding-base | SentenceEmbedding | 135.5M |[model](https://huggingface.co/anti-ai/ViEmbedding-base) |
|
| 87 |
+
| anti-ai/BioViEmbedding-base-unsup | SentenceEmbedding | 135.5M |[model](https://huggingface.co/anti-ai/BioViEmbedding-base-unsup) |
|
| 88 |
+
| anti-ai/VieSemantic-base | SentenceEmbedding | 135.5M |[model](https://huggingface.co/anti-ai/VieSemantic-base) |
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
## Contacts
|
| 92 |
+
If you have any questions about this repo, please contact me (nduc0231@gmail.com)
|