biopaws / readme.md
marisming's picture
Upload folder using huggingface_hub
34a6d4f verified

Code Repository for the Paper

1. Basic Data Preparation: 1-data

  • 1-get_sample_uniprot_sprot.ipynb: Sample 10,000 protein sequences from UniProtKB/Swiss-Prot
  • 2-get_non_homologous_pairs.ipynb: Generate non-homologous protein sequence pairs
  • 3-get_homology_pairs.ipynb: Generate homologous protein sequence pairs
  • 4-get_distant_homology_pairs.ipynb: Generate distantly homologous protein sequence pairs
  • mysql_part: Engineering implementation using MySQL tables to accelerate data processing; includes ready-to-import SQL dump files

2. GPT-2 Fine-tuning and Interpretability Experiments: 2-gpt_ft_test_explain

  • 1-gpt2_ft_en_test_protein_confusion.ipynb: Fine-tune GPT-2 on English PAWS-X dataset and evaluate on protein sequences (with confusion matrix)
  • 2-gpt2_test_protein.ipynb: Directly test pretrained GPT-2 on protein homology tasks (with confusion matrix)
  • 3-acc_distribution.ipynb: Accuracy distribution analysis for both fine-tuned and base models
  • 4-explain_***: Interpretability studies on cross-domain language capability transfer
  • batch_run: Scripts for batch execution of experiments

3. LLaMA-3 Fine-tuning and Evaluation: 3-llama_sft_test

  • 1-llama_sft_**: Fine-tuning code for LLaMA-3.1 with various quantization strategies
  • 2-llama_sft_test.py: Evaluate fine-tuned models on protein homology classification
  • 3-llama**: Benchmark results using official pretrained and fine-tuned LLaMA models
  • 4-*_standard_protein: Performance of state-of-the-art (SOTA) large models on standard protein homology detection
  • 5-*_remote_protein: Performance of SOTA large models on distant homology detection
  • 6-qwen3_explain-: Chain-of-Thought (CoT)-based interpretability analysis

4. BioPAWS Dataset Evaluation: 4-biopaws

  • 1-qwen3_dna: DNA sequence homology classification
  • 2-qwen3_dna_protein: Assessing DNA–protein coding relationship
  • 3-qwen3_dna_single: Single DNA sequence classification
  • 4-qwen3_protein_single: Single protein sequence classification

Note: Wildcards (*) in original notebook filenames have been preserved or generalized for clarity while maintaining semantic meaning.