--- license: mit base_model: - nferruz/ProtGPT2 pipeline_tag: text-classification library_name: adapter-transformers --- # Healdette: Secure Multi-Ethnic Antibody Sequence Generation Pipeline https://doi.org/10.5281/zenodo.17213886 A secure and flexible computational pipeline for generating and validating antibody sequences with multi-ethnic support. The pipeline integrates ProtGPT2 for sequence generation with BioPython for structural analysis and includes multi-ethnic HLA frequency data for immunogenicity assessment, with optimizations for various population-specific binding motifs. ## Features ### Core Functionality - Antibody sequence generation using ProtGPT2 with template-based constraints - Multi-ethnic binding motif optimization with population-specific parameters - Comprehensive validation and analysis pipeline ### Multi-Interface Support - Modern web interface for easy configuration management - Command-line interface for automation and scripting - Python API for programmatic access ### Security Features - Comprehensive input validation and sanitization - CSRF protection and rate limiting - Secure file operations with integrity checks - Detailed security and audit logging - Automated backup system with validation - Population-specific sequence validation parameters: Celtic: - Aromatic content: 15-27% - Hydrophobic content: 35-45% - Net charge: +5 to +15 Asian: - Aromatic content: 12-25% - Hydrophobic content: 30-40% - Net charge: +3 to +12 Mediterranean: - Aromatic content: 18-30% - Hydrophobic content: 32-42% - Net charge: +4 to +14 - Population-specific immunogenicity assessment using HLA frequency data - Biophysical property analysis using BioPython - Structured output in JSON format with detailed analysis results ## Requirements - Python 3.8 or higher - CUDA-capable GPU (recommended for ProtGPT2) - Required Python packages listed in `requirements.txt` ## Installation 1. Clone the repository: ```bash git clone https://github.com/Raiff1982/healdette.git cd healdette ``` 2. Create and activate a virtual environment: ```bash python -m venv .venv # On Windows: .venv\Scripts\activate # On Unix/MacOS: source .venv/bin/activate ``` 3. Install dependencies: ```bash pip install -r requirements.txt ``` ## Multi-Ethnic Configuration Healdette now supports ancestry-weighted validation for multiple ethnic populations. The system uses: 1. Population-specific binding motifs and parameters 2. Ancestry weights from genetic analysis 3. HLA frequency data for immunogenicity assessment ### Configuration Structure Configuration files follow this structure: ```json { "global_params": { "sequence_length": { "min": 40, "max": 70 }, "structural_params": { "helix_propensity": { "min": 20, "max": 50 }, "sheet_propensity": { "min": 10, "max": 40 } }, "homopolymer_threshold": 4 }, "populations": { "french_german": { "ancestry_weight": 0.298, "binding_motifs": ["WY", "RF", "KH", "YF"], "biophysical_params": { "aromatic_content": { "min": 16, "max": 28 }, "hydrophobic_content": { "min": 33, "max": 43 }, "net_charge": { "min": 4, "max": 13 } }, "hla_frequencies": { "hla_a": {}, "hla_b": {}, "hla_c": {} } } } } ``` ### Ancestry-Weighted Validation The validation system considers: 1. **Ancestry Weights**: Each population's contribution is weighted by ancestry percentage 2. **Blended Parameters**: Biophysical parameters are blended based on ancestry weights 3. **Multiple Binding Motifs**: Scores binding motifs from all relevant populations 4. **HLA Compatibility**: Considers population-specific HLA frequencies ### Population-Specific Parameters Each population can define: - **Binding Motifs**: Amino acid pairs crucial for binding - **Biophysical Parameters**: - Aromatic content ranges - Hydrophobic content ranges - Net charge requirements - **HLA Frequencies**: Population-specific HLA allele distributions ## Usage 1. Create a configuration file following the schema (see `examples/` directory): ```json { "global_params": { "sequence_length": { "min": 40, "max": 70 } }, "populations": { "french_german": { "ancestry_weight": 0.298, "binding_motifs": ["WY", "RF", "KH", "YF"], "biophysical_params": { "aromatic_content": { "min": 16, "max": 28 } } }, "finnish": { "ancestry_weight": 0.057, "binding_motifs": ["WH", "RF", "KY", "FF"], "biophysical_params": { "aromatic_content": { "min": 14, "max": 26 } } } } } ``` 2. Validate sequences using the weighted validator: ```python from modules.weighted_validator import WeightedSequenceValidator from modules.config_validator import ConfigValidator # Load and validate configuration config_validator = ConfigValidator() config = "path/to/config.json" if config_validator.validate_file(config)['valid']: # Create validator with ancestry-weighted parameters validator = WeightedSequenceValidator(sequence, config) # Get detailed validation results results = validator.validate_sequence() # Check population-specific scores pop_scores = results['population_scores'] for pop, score in pop_scores.items(): print(f"{pop}: {score['score']} (weight: {score['weight']})") ``` 3. Run the pipeline with multi-ethnic configuration: ```bash python main.py config.json output.json --num-candidates 15 ``` ### Example Configurations Complete example configurations are available in the `examples/` directory: - `european_populations_config.json`: Configuration for European population clusters - `multi_ethnic_config.json`: General multi-ethnic configuration template - `celtic_test_input.json`: Celtic-specific test configuration ### Understanding Validation Results The weighted validator provides detailed results: ```json { "valid": true, "warnings": [], "metrics": { "aromatic_content": 22.5, "hydrophobic_content": 38.2, "binding_motifs": { "scores": { "french_german": {"score": 0.75, "weighted_score": 0.223}, "finnish": {"score": 0.5, "weighted_score": 0.029} }, "total_score": 0.252 } }, "population_scores": { "french_german": { "score": 0.8, "weight": 0.298 }, "finnish": { "score": 0.6, "weight": 0.057 } } } ], "num_sequences": 10, "global_validation_params": { "min_sequence_length": 40, "max_sequence_length": 70, "allow_homopolymers": false, "structure_requirements": { "helix_propensity": { "min": 0.2, "max": 0.5 }, "sheet_propensity": { "min": 0.1, "max": 0.4 } } } } ``` 2. Run the pipeline: ```bash python main.py --config input_config.json ``` ## Output Files The pipeline generates two types of output files in the `output` directory: 1. Detailed JSON output (`antibody_designs_{timestamp}.json`): - Generated antibody sequences with framework and CDR regions - Celtic binding motif analysis - Biophysical properties (hydrophobicity, charge, stability) - Aromatic content and distribution - Population-specific immunogenicity scores - Validation results against therapeutic antibodies 2. Summary report (`antibody_summary_{timestamp}.txt`): - Key metrics for each generated sequence - Celtic motif occurrence statistics - Population coverage statistics - Validation summary ## Reproducibility To reproduce the results: 1. Use the same random seed for ProtGPT2: ```python import torch torch.manual_seed(42) ``` 2. Ensure consistent data sources: - HLA frequency data: NetMHCpan 4.1 database - Therapeutic antibody dataset: THERAb database v2.0 - Framework templates: IMGT database - Celtic binding motif templates: Custom database 3. Run validation tests: ```bash python -m unittest discover tests ``` ## License MIT License. See LICENSE file for details. ## Citation If you use this software in your research, please cite: ```bibtex @software{healdette2025, title = {Healdette: Celtic-Optimized Antibody Generation Pipeline}, author = {Raiff, et al.}, year = {2025}, version = {1.0.0}, url = {https://github.com/Raiff1982/healdette} } ``` Harrison, J. (2025). Healdette: A Population-Aware Antibody Design Pipeline. GitHub repository: https://github.com/Raiff1982/healdette ``` ## Author Jonathan Harrison (Raiff1982)