OpenFace-CQUPT
commited on
Update README.md
Browse files
README.md
CHANGED
|
@@ -21,13 +21,13 @@ We developed a domain-speciffc large language-vision assistant (PA-LLaVA) for pa
|
|
| 21 |
### Introduction
|
| 22 |
These public datasets contain substantial amounts of data unrelated to human pathology. To obtain the human pathology image-text data, we performed two cleaning processes on the raw data, as illustrated in the follow figture: (1) Removing nonpathological images. (2) Removing nonhuman pathology data. Additionally, we excluded image-text pairs with textual descriptions of fewer than 20 words. Ultimately, we obtained 518,413 image-text pairs (named "PCaption-0.5M" ) for the aligned training dataset.
|
| 23 |
|
| 24 |
-
Instruction fine-tuning phase we only cleaned PMC-VQA in the same way and obtained 15,788 question-answer pairs related to human pathology. Lastly, we combined PathVQA and Human pathology data obtained from PMC-VQA, thereby constructing a dataset of 35543 question-answer pairs
|
| 25 |
|
| 26 |
#### Data Cleaning Process
|
| 27 |
|
| 28 |

|
| 29 |
|
| 30 |
-
|
| 31 |
|
| 32 |
### Step 1 Download the public datasets.
|
| 33 |
Here we only provide the download link for the public dataset and expose the image id index of our cleaned dataset on HuggingFace.
|
|
|
|
| 21 |
### Introduction
|
| 22 |
These public datasets contain substantial amounts of data unrelated to human pathology. To obtain the human pathology image-text data, we performed two cleaning processes on the raw data, as illustrated in the follow figture: (1) Removing nonpathological images. (2) Removing nonhuman pathology data. Additionally, we excluded image-text pairs with textual descriptions of fewer than 20 words. Ultimately, we obtained 518,413 image-text pairs (named "PCaption-0.5M" ) for the aligned training dataset.
|
| 23 |
|
| 24 |
+
Instruction fine-tuning phase we only cleaned PMC-VQA in the same way and obtained 15,788 question-answer pairs related to human pathology. Lastly, we combined PathVQA and Human pathology data obtained from PMC-VQA, thereby constructing a dataset of 35543 question-answer pairs data.
|
| 25 |
|
| 26 |
#### Data Cleaning Process
|
| 27 |
|
| 28 |

|
| 29 |
|
| 30 |
+
## Get the Dataset
|
| 31 |
|
| 32 |
### Step 1 Download the public datasets.
|
| 33 |
Here we only provide the download link for the public dataset and expose the image id index of our cleaned dataset on HuggingFace.
|