biopaws / readme.md
marisming's picture
Upload folder using huggingface_hub
34a6d4f verified
## Code Repository for the Paper
### 1. Basic Data Preparation: `1-data`
- `1-get_sample_uniprot_sprot.ipynb`: Sample 10,000 protein sequences from UniProtKB/Swiss-Prot
- `2-get_non_homologous_pairs.ipynb`: Generate non-homologous protein sequence pairs
- `3-get_homology_pairs.ipynb`: Generate homologous protein sequence pairs
- `4-get_distant_homology_pairs.ipynb`: Generate distantly homologous protein sequence pairs
- `mysql_part`: Engineering implementation using MySQL tables to accelerate data processing; includes ready-to-import SQL dump files
### 2. GPT-2 Fine-tuning and Interpretability Experiments: `2-gpt_ft_test_explain`
- `1-gpt2_ft_en_test_protein_confusion.ipynb`: Fine-tune GPT-2 on English PAWS-X dataset and evaluate on protein sequences (with confusion matrix)
- `2-gpt2_test_protein.ipynb`: Directly test pretrained GPT-2 on protein homology tasks (with confusion matrix)
- `3-acc_distribution.ipynb`: Accuracy distribution analysis for both fine-tuned and base models
- `4-explain_***`: Interpretability studies on cross-domain language capability transfer
- `batch_run`: Scripts for batch execution of experiments
### 3. LLaMA-3 Fine-tuning and Evaluation: `3-llama_sft_test`
- `1-llama_sft_**`: Fine-tuning code for LLaMA-3.1 with various quantization strategies
- `2-llama_sft_test.py`: Evaluate fine-tuned models on protein homology classification
- `3-llama**`: Benchmark results using official pretrained and fine-tuned LLaMA models
- `4-*_standard_protein`: Performance of state-of-the-art (SOTA) large models on standard protein homology detection
- `5-*_remote_protein`: Performance of SOTA large models on **distant homology** detection
- `6-qwen3_explain-`: Chain-of-Thought (CoT)-based interpretability analysis
### 4. BioPAWS Dataset Evaluation: `4-biopaws`
- `1-qwen3_dna`: DNA sequence homology classification
- `2-qwen3_dna_protein`: Assessing DNA–protein coding relationship
- `3-qwen3_dna_single`: Single DNA sequence classification
- `4-qwen3_protein_single`: Single protein sequence classification
---
> **Note**: Wildcards (`*`) in original notebook filenames have been preserved or generalized for clarity while maintaining semantic meaning.
---