File size: 8,241 Bytes

3b3b8bc
be2cc43
3b3b8bc
2c4e46d
3b3b8bc
 
 
2c4e46d
 
 
 
3b3b8bc
 
 
 
 
 
 
 
354bd9d
3b3b8bc
75e1668
 
140a71c
354bd9d
 
3b3b8bc
75e1668
3eed719
4ed5e3b
3b3b8bc
140a71c
3b3b8bc
 
 
 
 
 
 
 
 
4ed5e3b
3b3b8bc
 
 
 
 
 
 
 
 
 
 
 
 
bbdec92
3b3b8bc
 
 
 
 
 
 
 
 
 
 
 
 
bbdec92
3b3b8bc
 
 
 
 
 
 
 
 
 
 
 
 
3eed719
2c4e46d
 
 
3eed719
2c4e46d
3eed719
 
b05bb9b
3eed719
ed4c15a
3eed719
b05bb9b
3b3b8bc
 
 
 
 
 
 
 
 
 
 
 
 
2c4e46d
 
de31eb7
2c4e46d
 
 
 
 
 
 
 
e80c387
2c4e46d
 
354bd9d
 
e80c387
 
 
 
 
 
 
 
 
 
 
 
 
 
354bd9d

---
license: llama3.1
datasets:
- CNX-PathLLM/GTEx-WSI-CloseQA-Balanced
- CNX-PathLLM/GTEx-WSI-OpenQA
- CNX-PathLLM/TCGA-WSI-CloseQA-Balanced
- CNX-PathLLM/TCGA-WSI-OpenQA
- CNX-PathLLM/TCGA-BRCA-Details-CloseQA
- CNX-PathLLM/TCGA-BRCA-Details-OpenQA
- CNX-PathLLM/PathChat_CloseQA_Balanced
- CNX-PathLLM/PathChat_OpenQA
language:
- en
metrics:
- accuracy
- f1
base_model:
- meta-llama/Llama-3.1-8B-Instruct
---
# ALPaCA: Adapting Llama for Pathology Context Analysis

Welcome to ALPaCA, a multimodal training framework tailored for slide-level question answering in computational pathology. ALPaCA integrates Llama3.1-8B-Instruct as the language backbone and CONCH as the vision encoder.
This repository aims to provide a straightforward reproduction of the ALPaCA framework. 
The model trained using this framework is named **Llama-slideQA**.

To run ALPaCA, please first download **Llama3.1-8b-instruct** as the base model.

For data from TCGA and GTEx, you can visit the [GDC Data Portal Homepage](https://portal.gdc.cancer.gov/) and [GTEx Portal](https://www.gtexportal.org/) to download and extract patch features yourself by [CONCH](https://huggingface.co/MahmoodLab/CONCH). The data processing code is available at https://github.com/ZeyuGaoAi/SMMILe. 

Alternatively, you can use the features we have already extracted based on CONCH: `CNX-PathLLM/GTEx-TCGA-Embeddings`, `CNX-PathLLM/GTEx-TCGA-KMeans-Embeddings`, `CNX-PathLLM/GMM_Embeddings`. After downloading, please unzip them into the respective folders for `TCGA-Embedding` and `GMM_Embedding`.

Please ensure you have access to all the datasets.

After completing all the setups mentioned above and setting up the correct Python environment, you can start the training process using the provided shell script, e.g., `run_wsi_stage*.sh`, or follow the instructions in the [Train Step](#train-step-1) section below.

Do not forget to adjust the TCGA and GMM embedding paths to reflect your own file locations.


## Settings

### Different Aggregate Strategies
You can change aggregate strategies using the `--agg_strategy` flag, such as `sample`, `kmeans`, `gmm`, `abmil`, `qformer`, and `longnet`. You can also reproduce the `hybrid` method described in our paper by setting `--agg_strategy gmm,longnet` in the `.sh` script.

### Configurable Settings

```
--vision_adaptor False   (vision-query-question interaction)
--vision_adaptor True    (vision-query interaction)

--hierarchical_adaptor False   (same adaptor for all levels)
--hierarchical_adaptor True    (different adaptors for different levels)
```

## Train Step 1 ##
```
accelerate launch --config_file=./accelerate_configs/deepspeed_zero2.yaml run_wsi.py --learning_rate 1e-4 --num_train_epochs 20 --warmup_steps 1000\
        --gpu 2 --train_batch_size 4 --eval_batch_size 2 --max_seq_length 512 \
        --agg_strategy gmm,longnet --embed_dim 512 --vision_adaptor False --hierachical_token True --hierachical_adaptor True\
        --n_heads 32,16,8 --llm_requires_grad False --resume_from_checkpoint False \
        --llm_name /data_local/pxb/LLM_models/llama3/llama3.1-8b-instruct \
        --dataset_name_list CNX-PathLLM/TCGA-WSI-Description-4onew,CNX-PathLLM/TCGA-WSI-Description-4omini,CNX-PathLLM/GTEx-WSI-Description \
        --data_cache_dir /data_local/pxb/CNX-PathLLM/.cache \
        --fea_root /path/to/CNX-PathLLM/GTEx-TCGA-Embeddings \
        --gmm_root /path/to/GMM_Embeddings\
        --output_dir path/to/output/of/step2
```

## Train Step 2 ##
```
accelerate launch --config_file=./accelerate_configs/deepspeed_zero2.yaml run_wsi.py --num_train_epochs 5 --warmup_steps 1000\
        --gpu 2 --train_batch_size 8 --eval_batch_size 2 --max_seq_length 256 \
        --agg_strategy gmm,longnet --embed_dim 512 --vision_adaptor False --hierachical_token True --hierachical_adaptor True\
        --n_heads 32,16,8 --llm_requires_grad True --resume_from_checkpoint False \
        --llm_name /data_local/pxb/LLM_models/llama3/llama3.1-8b-instruct \
        --dataset_name_list CNX-PathLLM/TCGA-WSI-CloseQA-Balanced,CNX-PathLLM/GTEx-WSI-CloseQA-Balanced,CNX-PathLLM/TCGA-WSI-OpenQA,CNX-PathLLM/GTEx-WSI-OpenQA \
        --data_cache_dir /data_local/pxb/CNX-PathLLM/.cache \
        --fea_root /path/to/CNX-PathLLM/GTEx-TCGA-Embeddings \
        --gmm_root /path/to/GMM_Embeddings\
        --output_dir path/to/output/of/step2\
        --ckpt_path path/to/ckpt.bin/of/step1
```

## Train Step 3 ##

You can continue training (--ckpt_path path/to/ckpt.bin/of/step2) with the specific detailed TCGA-BRCA dataset (`CNX-PathLLM/TCGA-BRCA-Details-CloseQA, CNX-PathLLM/TCGA-BRCA-Details-OpenQA`).

You can also continue training (--ckpt_path path/to/ckpt.bin/of/step2) with the morphological description generated by [PathChat](https://www.nature.com/articles/s41586-024-07618-3) for TCGA-STAD, TCGA-KIRC and TCGA-OV using `CNX-PathLLM/PathChat_CloseQA_Balanced,CNX-PathLLM/PathChat_OpenQA`! 

Make sure you can access the dataset and change the above commands with the dataset you want.

## Checkpoints:
Llama-slideQA.bin: Trained with general QA following [Train Step 2](#train-step-2).

Llama-slideQA-morphology.bin: Trained with detailed morphological QA generated by PathChat following [Train Step 3](#train-step-3).

Llama-slideQA-BRCA.bin: Trained with detailed TCGA-BRCA dataset following [Train Step 3](#train-step-3).

## Test of Step2 General QA ##

```
python test_wsi.py --max_seq_length 128 --batch_size 1 --select_data_num -1 --eval_sample_size -1 --n_heads 32,16,8 --llm_name /data_local/pxb/LLM_models/llama3/llama3.1-8b-instruct --vision_adaptor False --hierachical_token True --hierachical_adaptor True \
                    --shuffle False --data_cache_dir /data_local/pxb/CNX-PathLLM/.cache\
                    --dataset_name_list CNX-PathLLM/TCGA-WSI-CloseQA-Balanced,CNX-PathLLM/GTEx-WSI-CloseQA-Balanced,CNX-PathLLM/TCGA-WSI-OpenQA,CNX-PathLLM/GTEx-WSI-OpenQA\
                    --agg_strategy gmm,longnet --embed_dim 512\
                    --fea_root /path/to/CNX-PathLLM/GTEx-TCGA-Embeddings \
                    --gmm_root /path/to/GMM_Embeddings\
                    --ckpt_path path/to/ckpt.bin/of/step2\
                    --results_save_path /path/to/the/output.csv\
                    --use_peft False
```

## Test of Step3 Specific QA ##

```
python test_wsi.py --max_seq_length 128 --batch_size 1 --select_data_num -1 --eval_sample_size -1 --n_heads 32,16,8 --llm_name /data_local/pxb/LLM_models/llama3/llama3.1-8b-instruct --vision_adaptor False --hierachical_token True --hierachical_adaptor True \
                    --shuffle False --data_cache_dir /data_local/pxb/CNX-PathLLM/.cache\
                    --dataset_name_list CNX-PathLLM/TCGA-BRCA-Details-CloseQA,CNX-PathLLM/TCGA-BRCA-Details-OpenQA (CNX-PathLLM/PathChat_CloseQA_Balanced,CNX-PathLLM/PathChat_OpenQA)\
                    --agg_strategy gmm,longnet --embed_dim 512\
                    --fea_root /path/to/CNX-PathLLM/GTEx-TCGA-Embeddings \
                    --gmm_root /path/to/GMM_Embeddings\
                    --ckpt_path path/to/ckpt.bin/of/step3\
                    --results_save_path /path/to/the/output.csv\
                    --use_peft False
```

## Toy test case

For the demo test, you can try this small dataset; no need to download the full tcga & gtex embeddings.

Embeddings: CNX-PathLLM/Toy-GTEx-TCGA-Embeddings, CNX-PathLLM/Toy_GMM_Embeddings
Datasets (Slide-QA): CNX-PathLLM/CloseQA-Toy, CNX-PathLLM/OpenQA-Toy

Follow the same instruction as ## Test of Step2 General QA ##, set 
```
--dataset_name_list  CNX-PathLLM/CloseQA-Toy, CNX-PathLLM/OpenQA-Toy
--fea_root /path/to/CNX-PathLLM/Toy-GTEx-TCGA-Embeddings \
--gmm_root /path/to/Toy_GMM_Embeddings \
```

## Disclaimer

This repository and all associated models are intended solely for academic research and non-commercial use. The model involves medical data (e.g., TCGA, GTEx) and pathology-related tasks, but is not approved for clinical diagnosis or medical decision-making.
The developers are not responsible for any misuse of this code or model in medical or commercial contexts.

## License

This model is developed using Meta’s LLaMA 3 model as part of its architecture. Following the LLaMA 3.1 License.