Update README.md
Browse files
README.md
CHANGED
|
@@ -79,61 +79,3 @@ print(*outputs)
|
|
| 79 |
|
| 80 |
# [C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][=Branch1][C][=O][O][C][C@@H1][Branch2][Ring1][=Branch2][C][O][C][=Branch1][C][=O][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][O][C][=Branch1][C][=O][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][Branch1][C][C][C][C]
|
| 81 |
```
|
| 82 |
-
|
| 83 |
-
## Training Details
|
| 84 |
-
|
| 85 |
-
### Training Data
|
| 86 |
-
|
| 87 |
-
The model was trained on the **Neeze/LPM-24-extra-extend** dataset, which contains extended biomedical and scientific text samples in the Life/Pharma/Medical (LPM) domain. The dataset provides high-quality molecule-centric descriptions used to train generative mapping from natural language to chemical representations.
|
| 88 |
-
|
| 89 |
-
#### Training Hyperparameters
|
| 90 |
-
|
| 91 |
-
- **Epochs:** 20
|
| 92 |
-
- **Batch size:** 8
|
| 93 |
-
- **Gradient accumulation:** 96
|
| 94 |
-
- **Warmup ratio:** 0.0
|
| 95 |
-
- **Learning rate:** 5e-4
|
| 96 |
-
- **Number of devices:** 4
|
| 97 |
-
- **Precision:** fp32
|
| 98 |
-
- **Gradient clipping:** 1.0
|
| 99 |
-
- **Pretrained backbone:** QizhiPei/biot5-plus-base
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
### Testing Data, Factors & Metrics
|
| 103 |
-
|
| 104 |
-
#### Testing Data
|
| 105 |
-
|
| 106 |
-
The evaluation was conducted on the **Neeze/LPM-24-eval-extend** dataset, which contains extended test samples for biomedical and scientific language tasks. It is designed to assess model performance on text generation and structured prediction within the Life/Pharma/Medical (LPM) domain.
|
| 107 |
-
|
| 108 |
-
#### Metrics
|
| 109 |
-
|
| 110 |
-
The following evaluation metrics were used to assess model performance:
|
| 111 |
-
|
| 112 |
-
- **BLEU** — Measures n-gram overlap between generated and reference text, commonly used in machine translation and generation quality.
|
| 113 |
-
- **ROUGE (1/2/L)** — Measures recall-oriented overlap of unigrams, bigrams, and longest common subsequence between generated and reference sequences.
|
| 114 |
-
- **METEOR** — Considers semantic similarity via stemming and synonymy, providing better sensitivity to meaning preservation than BLEU.
|
| 115 |
-
- **Exact Match (EM)** — Strict score representing the proportion of predictions that exactly match reference sequences (useful for structured output).
|
| 116 |
-
- **Levenshtein Distance** — Edit distance score showing how many insertions/deletions/substitutions are needed to transform the model output into the reference. Lower values indicate higher similarity.
|
| 117 |
-
- **Validity** — Đo tỷ lệ đầu ra hợp lệ theo một tiêu chí xác định trước (ví dụ: cú pháp hợp lệ, cấu trúc hợp lệ, hoặc ràng buộc domain). Giá trị cao cho thấy mô hình ít tạo ra output lỗi.
|
| 118 |
-
- **MACCS Fingerprint Similarity** — So sánh mức độ tương đồng giữa các phân tử dựa trên bộ fingerprint MACCS (166 bit), thường dùng trong hóa dược để đánh giá độ giống nhau về cấu trúc.
|
| 119 |
-
- **RDK (RDKit) Fingerprint Similarity** — Đánh giá tương đồng cấu trúc hóa học dựa trên fingerprint của RDKit. Giúp xác định mức độ gần nhau về đặc trưng hóa học của hai phân tử.
|
| 120 |
-
- **Morgan Fingerprint Similarity** — Dựa trên vòng lặp mở rộng (ECFP), đo mức độ tương đồng giữa hai phân tử theo đặc trưng cục bộ xung quanh từng nguyên tử. Thường dùng trong tác vụ phân tử và QSAR/QSPR.
|
| 121 |
-
|
| 122 |
-
### Results
|
| 123 |
-
|
| 124 |
-
| Metric | Score (%) |
|
| 125 |
-
| ----------- | --------- |
|
| 126 |
-
| BLEU | 69.77 |
|
| 127 |
-
| BLEU-2 | 73.66 |
|
| 128 |
-
| BLEU-4 | 69.77 |
|
| 129 |
-
| ROUGE-1 | 76.10 |
|
| 130 |
-
| ROUGE-2 | 70.46 |
|
| 131 |
-
| ROUGE-L | 69.39 |
|
| 132 |
-
| METEOR | 75.05 |
|
| 133 |
-
| Exact Match | 0.01 |
|
| 134 |
-
| Levenshtein | 31.28 |
|
| 135 |
-
| Validity | 99.98 |
|
| 136 |
-
| MACCS FTs | 70.95 |
|
| 137 |
-
| RDK FTs | 63.51 |
|
| 138 |
-
| Morgan FTs | 45.84 |
|
| 139 |
-
|
|
|
|
| 79 |
|
| 80 |
# [C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][=Branch1][C][=O][O][C][C@@H1][Branch2][Ring1][=Branch2][C][O][C][=Branch1][C][=O][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][O][C][=Branch1][C][=O][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][Branch1][C][C][C][C]
|
| 81 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|