Commit ·
172a9d5
1
Parent(s): 229b4c2
add model
Browse files- .gitattributes +0 -5
- training_assets/Readme.md +37 -0
- training_assets/cross_silver_scores_v3.pkl +0 -3
- training_assets/gold_eval_dataloader.pkl +0 -3
- training_assets/gold_train_dataloader.pkl +0 -3
- training_assets/silver_cross_samples.pkl +0 -3
- training_assets/silver_data.pkl +0 -3
- training_assets/{2_train_sts_cross_bm25.py → silver_database.py} +1 -1
- training_assets/train_augmented_bert.ipynb +0 -0
.gitattributes
CHANGED
|
@@ -25,8 +25,3 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 25 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 26 |
*.zstandard filter=lfs diff=lfs merge=lfs -text
|
| 27 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
-
training_assets/cross_silver_scores_v3.pkl filter=lfs diff=lfs merge=lfs -text
|
| 29 |
-
training_assets/silver_cross_samples.pkl filter=lfs diff=lfs merge=lfs -text
|
| 30 |
-
training_assets/silver_data.pkl filter=lfs diff=lfs merge=lfs -text
|
| 31 |
-
training_assets/gold_eval_dataloader.pkl filter=lfs diff=lfs merge=lfs -text
|
| 32 |
-
training_assets/gold_train_dataloader.pkl filter=lfs diff=lfs merge=lfs -text
|
|
|
|
| 25 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 26 |
*.zstandard filter=lfs diff=lfs merge=lfs -text
|
| 27 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
training_assets/Readme.md
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Scripts para retreino do BERT para o contexto da Anatel
|
| 2 |
+
|
| 3 |
+
Os scripts seguem a estratégia descrita no [Augmented SBERT](https://www.sbert.net/examples/training/data_augmentation/README.html) para retreinar o modelo BERT para o contexto da Anatel. O objetivo final é retreinar o modelo BERT mesmo com poucos textos rotulados para a tarefa desejada. Para a execução do script é necessário ter o modelo cross-encoder treinado conforme descrito na documentação [/Users/enniobastos/Documents/Anatel-local/sei-similaridade/scripts/treino_cross_encoder](TO-DO arrumar o path no git).
|
| 4 |
+
|
| 5 |
+
Os script descreve como treinar o modelo BERT bi-encoder para o Augmented SBERT.
|
| 6 |
+
|
| 7 |
+
Para a realização deste treinamento em tempo hábil foi utilizada a GPU do Google Colab.
|
| 8 |
+
|
| 9 |
+
## Etapas:
|
| 10 |
+
### 1- [Criar a base de dados Silver](silver_database.py)
|
| 11 |
+
|
| 12 |
+
Para aumentar a base de dados `gold` nos vamos utilizar toda o restante da base de dados dos documentos do tipo análise que possuem o campo `assunto`. Essa nova base será chamada de `silver`. O script realiza query na instância local do Solr para obter as sentenças. A base de dados com as sentenças estão no [silver_database.joblib](TO-DO). Para cada sentença pedimos a predição do modelo cross-encoder treinado para o contexto da Anatel e o score resultante será a métrica de similaridade.
|
| 13 |
+
|
| 14 |
+
### 2- [Treinar o BERT bi-encoder](finetuning_bert_biencoder.ipynb)
|
| 15 |
+
|
| 16 |
+
O script realiza o treinamento do modelo BERT bi-encoder para o contexto da Anatel. As base de dados `gold` e a `silver` são concatenadas e o modelo BERT bi-encoder é retreinado O modelo BERT base utilizado é o `Luciano/bert-base-portuguese-cased-finetuned-tcu-acordaos`. O modelo final está salvo no [huggingface_hub](anatel/bert-augmented-pt-anatel).
|
| 17 |
+
|
| 18 |
+
Config :
|
| 19 |
+
Total de exemplos de treino = 646437
|
| 20 |
+
Total de exemplos de validação = 6530.
|
| 21 |
+
Epochs = 3
|
| 22 |
+
max_length = 512
|
| 23 |
+
train_batch_size = 8
|
| 24 |
+
Tempo de duração ~ 18h
|
| 25 |
+
Métricas = Cosine-Similarity : Pearson: 0.9359 Spearman: 0.8874
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
|
training_assets/cross_silver_scores_v3.pkl
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:cd9d6a0296f0a1e9589ac8550d6095d9f53985ecd3fc3a8f1e4398426acb84d0
|
| 3 |
-
size 239383791
|
|
|
|
|
|
|
|
|
|
|
|
training_assets/gold_eval_dataloader.pkl
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:8901155d353af2a1fad078daafb7eabab5bc6779d69ddb3768a359ac2b50bdad
|
| 3 |
-
size 127396
|
|
|
|
|
|
|
|
|
|
|
|
training_assets/gold_train_dataloader.pkl
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:c53ce2f26aab328f08d2db38d11718bb3579048ced91a9acb6a607b95228eaa2
|
| 3 |
-
size 3586422
|
|
|
|
|
|
|
|
|
|
|
|
training_assets/silver_cross_samples.pkl
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:e6a9119d292328000dd27b5f674e0cf86c708d1b9042a9b8911c03a6726c2e50
|
| 3 |
-
size 239072747
|
|
|
|
|
|
|
|
|
|
|
|
training_assets/silver_data.pkl
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:68e0f9acd5aea86b2e45ae6eee6840c1df4f70705b21b1b9a535ecae0580e5fc
|
| 3 |
-
size 303024365
|
|
|
|
|
|
|
|
|
|
|
|
training_assets/{2_train_sts_cross_bm25.py → silver_database.py}
RENAMED
|
@@ -13,7 +13,7 @@ from solr_query_params import params
|
|
| 13 |
############################################################################
|
| 14 |
|
| 15 |
|
| 16 |
-
cross_encoder_path = '
|
| 17 |
gold_sample_index = set()
|
| 18 |
with open('gold_sample_index.txt', 'r') as f:
|
| 19 |
for line in f:
|
|
|
|
| 13 |
############################################################################
|
| 14 |
|
| 15 |
|
| 16 |
+
cross_encoder_path = 'anatel/cross-encoder-pt-anatel-metadados-assunto'
|
| 17 |
gold_sample_index = set()
|
| 18 |
with open('gold_sample_index.txt', 'r') as f:
|
| 19 |
for line in f:
|
training_assets/train_augmented_bert.ipynb
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|