healdette / README.md

Raiff1982

Update README.md

d2213aa verified 5 months ago

9.36 kB

	---
	license: mit
	base_model:
	- nferruz/ProtGPT2
	pipeline_tag: text-classification
	library_name: adapter-transformers
	---

	# Healdette: Secure Multi-Ethnic Antibody Sequence Generation Pipeline

	https://doi.org/10.5281/zenodo.17213886

	A secure and flexible computational pipeline for generating and validating antibody sequences with multi-ethnic support. The pipeline integrates ProtGPT2 for sequence generation with BioPython for structural analysis and includes multi-ethnic HLA frequency data for immunogenicity assessment, with optimizations for various population-specific binding motifs.

	## Features

	### Core Functionality
	- Antibody sequence generation using ProtGPT2 with template-based constraints
	- Multi-ethnic binding motif optimization with population-specific parameters
	- Comprehensive validation and analysis pipeline

	### Multi-Interface Support
	- Modern web interface for easy configuration management
	- Command-line interface for automation and scripting
	- Python API for programmatic access

	### Security Features
	- Comprehensive input validation and sanitization
	- CSRF protection and rate limiting
	- Secure file operations with integrity checks
	- Detailed security and audit logging
	- Automated backup system with validation
	- Population-specific sequence validation parameters:
	Celtic:
	- Aromatic content: 15-27%
	- Hydrophobic content: 35-45%
	- Net charge: +5 to +15
	Asian:
	- Aromatic content: 12-25%
	- Hydrophobic content: 30-40%
	- Net charge: +3 to +12
	Mediterranean:
	- Aromatic content: 18-30%
	- Hydrophobic content: 32-42%
	- Net charge: +4 to +14
	- Population-specific immunogenicity assessment using HLA frequency data
	- Biophysical property analysis using BioPython
	- Structured output in JSON format with detailed analysis results

	## Requirements

	- Python 3.8 or higher
	- CUDA-capable GPU (recommended for ProtGPT2)
	- Required Python packages listed in `requirements.txt`

	## Installation

	1. Clone the repository:
	```bash
	git clone https://github.com/Raiff1982/healdette.git
	cd healdette
	```

	2. Create and activate a virtual environment:
	```bash
	python -m venv .venv
	# On Windows:
	.venv\Scripts\activate
	# On Unix/MacOS:
	source .venv/bin/activate
	```

	3. Install dependencies:
	```bash
	pip install -r requirements.txt
	```

	## Multi-Ethnic Configuration

	Healdette now supports ancestry-weighted validation for multiple ethnic populations. The system uses:
	1. Population-specific binding motifs and parameters
	2. Ancestry weights from genetic analysis
	3. HLA frequency data for immunogenicity assessment

	### Configuration Structure

	Configuration files follow this structure:
	```json
	{
	"global_params": {
	"sequence_length": {
	"min": 40,
	"max": 70
	},
	"structural_params": {
	"helix_propensity": {
	"min": 20,
	"max": 50
	},
	"sheet_propensity": {
	"min": 10,
	"max": 40
	}
	},
	"homopolymer_threshold": 4
	},
	"populations": {
	"french_german": {
	"ancestry_weight": 0.298,
	"binding_motifs": ["WY", "RF", "KH", "YF"],
	"biophysical_params": {
	"aromatic_content": {
	"min": 16,
	"max": 28
	},
	"hydrophobic_content": {
	"min": 33,
	"max": 43
	},
	"net_charge": {
	"min": 4,
	"max": 13
	}
	},
	"hla_frequencies": {
	"hla_a": {},
	"hla_b": {},
	"hla_c": {}
	}
	}
	}
	}
	```

	### Ancestry-Weighted Validation

	The validation system considers:
	1. Ancestry Weights: Each population's contribution is weighted by ancestry percentage
	2. Blended Parameters: Biophysical parameters are blended based on ancestry weights
	3. Multiple Binding Motifs: Scores binding motifs from all relevant populations
	4. HLA Compatibility: Considers population-specific HLA frequencies

	### Population-Specific Parameters

	Each population can define:
	- Binding Motifs: Amino acid pairs crucial for binding
	- Biophysical Parameters:
	- Aromatic content ranges
	- Hydrophobic content ranges
	- Net charge requirements
	- HLA Frequencies: Population-specific HLA allele distributions

	## Usage

	1. Create a configuration file following the schema (see `examples/` directory):
	```json
	{
	"global_params": {
	"sequence_length": {
	"min": 40,
	"max": 70
	}
	},
	"populations": {
	"french_german": {
	"ancestry_weight": 0.298,
	"binding_motifs": ["WY", "RF", "KH", "YF"],
	"biophysical_params": {
	"aromatic_content": {
	"min": 16,
	"max": 28
	}
	}
	},
	"finnish": {
	"ancestry_weight": 0.057,
	"binding_motifs": ["WH", "RF", "KY", "FF"],
	"biophysical_params": {
	"aromatic_content": {
	"min": 14,
	"max": 26
	}
	}
	}
	}
	}
	```

	2. Validate sequences using the weighted validator:
	```python
	from modules.weighted_validator import WeightedSequenceValidator
	from modules.config_validator import ConfigValidator

	# Load and validate configuration
	config_validator = ConfigValidator()
	config = "path/to/config.json"
	if config_validator.validate_file(config)['valid']:
	# Create validator with ancestry-weighted parameters
	validator = WeightedSequenceValidator(sequence, config)

	# Get detailed validation results
	results = validator.validate_sequence()

	# Check population-specific scores
	pop_scores = results['population_scores']
	for pop, score in pop_scores.items():
	print(f"{pop}: {score['score']} (weight: {score['weight']})")
	```

	3. Run the pipeline with multi-ethnic configuration:
	```bash
	python main.py config.json output.json --num-candidates 15
	```

	### Example Configurations

	Complete example configurations are available in the `examples/` directory:
	- `european_populations_config.json`: Configuration for European population clusters
	- `multi_ethnic_config.json`: General multi-ethnic configuration template
	- `celtic_test_input.json`: Celtic-specific test configuration

	### Understanding Validation Results

	The weighted validator provides detailed results:
	```json
	{
	"valid": true,
	"warnings": [],
	"metrics": {
	"aromatic_content": 22.5,
	"hydrophobic_content": 38.2,
	"binding_motifs": {
	"scores": {
	"french_german": {"score": 0.75, "weighted_score": 0.223},
	"finnish": {"score": 0.5, "weighted_score": 0.029}
	},
	"total_score": 0.252
	}
	},
	"population_scores": {
	"french_german": {
	"score": 0.8,
	"weight": 0.298
	},
	"finnish": {
	"score": 0.6,
	"weight": 0.057
	}
	}
	}
	],
	"num_sequences": 10,
	"global_validation_params": {
	"min_sequence_length": 40,
	"max_sequence_length": 70,
	"allow_homopolymers": false,
	"structure_requirements": {
	"helix_propensity": {
	"min": 0.2,
	"max": 0.5
	},
	"sheet_propensity": {
	"min": 0.1,
	"max": 0.4
	}
	}
	}
	}
	```

	2. Run the pipeline:
	```bash
	python main.py --config input_config.json
	```

	## Output Files

	The pipeline generates two types of output files in the `output` directory:

	1. Detailed JSON output (`antibody_designs_{timestamp}.json`):
	- Generated antibody sequences with framework and CDR regions
	- Celtic binding motif analysis
	- Biophysical properties (hydrophobicity, charge, stability)
	- Aromatic content and distribution
	- Population-specific immunogenicity scores
	- Validation results against therapeutic antibodies

	2. Summary report (`antibody_summary_{timestamp}.txt`):
	- Key metrics for each generated sequence
	- Celtic motif occurrence statistics
	- Population coverage statistics
	- Validation summary

	## Reproducibility

	To reproduce the results:

	1. Use the same random seed for ProtGPT2:
	```python
	import torch
	torch.manual_seed(42)
	```

	2. Ensure consistent data sources:
	- HLA frequency data: NetMHCpan 4.1 database
	- Therapeutic antibody dataset: THERAb database v2.0
	- Framework templates: IMGT database
	- Celtic binding motif templates: Custom database

	3. Run validation tests:
	```bash
	python -m unittest discover tests
	```

	## License

	MIT License. See LICENSE file for details.

	## Citation

	If you use this software in your research, please cite:
	```bibtex
	@software{healdette2025,
	title = {Healdette: Celtic-Optimized Antibody Generation Pipeline},
	author = {Raiff, et al.},
	year = {2025},
	version = {1.0.0},
	url = {https://github.com/Raiff1982/healdette}
	}
	```
	Harrison, J. (2025). Healdette: A Population-Aware Antibody Design Pipeline.
	GitHub repository: https://github.com/Raiff1982/healdette
	```

	## Author

	Jonathan Harrison (Raiff1982)