| ## Code Repository for the Paper | |
| ### 1. Basic Data Preparation: `1-data` | |
| - `1-get_sample_uniprot_sprot.ipynb`: Sample 10,000 protein sequences from UniProtKB/Swiss-Prot | |
| - `2-get_non_homologous_pairs.ipynb`: Generate non-homologous protein sequence pairs | |
| - `3-get_homology_pairs.ipynb`: Generate homologous protein sequence pairs | |
| - `4-get_distant_homology_pairs.ipynb`: Generate distantly homologous protein sequence pairs | |
| - `mysql_part`: Engineering implementation using MySQL tables to accelerate data processing; includes ready-to-import SQL dump files | |
| ### 2. GPT-2 Fine-tuning and Interpretability Experiments: `2-gpt_ft_test_explain` | |
| - `1-gpt2_ft_en_test_protein_confusion.ipynb`: Fine-tune GPT-2 on English PAWS-X dataset and evaluate on protein sequences (with confusion matrix) | |
| - `2-gpt2_test_protein.ipynb`: Directly test pretrained GPT-2 on protein homology tasks (with confusion matrix) | |
| - `3-acc_distribution.ipynb`: Accuracy distribution analysis for both fine-tuned and base models | |
| - `4-explain_***`: Interpretability studies on cross-domain language capability transfer | |
| - `batch_run`: Scripts for batch execution of experiments | |
| ### 3. LLaMA-3 Fine-tuning and Evaluation: `3-llama_sft_test` | |
| - `1-llama_sft_**`: Fine-tuning code for LLaMA-3.1 with various quantization strategies | |
| - `2-llama_sft_test.py`: Evaluate fine-tuned models on protein homology classification | |
| - `3-llama**`: Benchmark results using official pretrained and fine-tuned LLaMA models | |
| - `4-*_standard_protein`: Performance of state-of-the-art (SOTA) large models on standard protein homology detection | |
| - `5-*_remote_protein`: Performance of SOTA large models on **distant homology** detection | |
| - `6-qwen3_explain-`: Chain-of-Thought (CoT)-based interpretability analysis | |
| ### 4. BioPAWS Dataset Evaluation: `4-biopaws` | |
| - `1-qwen3_dna`: DNA sequence homology classification | |
| - `2-qwen3_dna_protein`: Assessing DNA–protein coding relationship | |
| - `3-qwen3_dna_single`: Single DNA sequence classification | |
| - `4-qwen3_protein_single`: Single protein sequence classification | |
| --- | |
| > **Note**: Wildcards (`*`) in original notebook filenames have been preserved or generalized for clarity while maintaining semantic meaning. | |
| --- |