---
license: mit
base_model:
- nferruz/ProtGPT2
pipeline_tag: text-classification
library_name: adapter-transformers
---

# Healdette: Secure Multi-Ethnic Antibody Sequence Generation Pipeline

https://doi.org/10.5281/zenodo.17213886

A secure and flexible computational pipeline for generating and validating antibody sequences with multi-ethnic support. The pipeline integrates ProtGPT2 for sequence generation with BioPython for structural analysis and includes multi-ethnic HLA frequency data for immunogenicity assessment, with optimizations for various population-specific binding motifs.

## Features

### Core Functionality
- Antibody sequence generation using ProtGPT2 with template-based constraints
- Multi-ethnic binding motif optimization with population-specific parameters
- Comprehensive validation and analysis pipeline

### Multi-Interface Support
- Modern web interface for easy configuration management
- Command-line interface for automation and scripting
- Python API for programmatic access

### Security Features
- Comprehensive input validation and sanitization
- CSRF protection and rate limiting
- Secure file operations with integrity checks
- Detailed security and audit logging
- Automated backup system with validation
- Population-specific sequence validation parameters:
  Celtic:
    - Aromatic content: 15-27%
    - Hydrophobic content: 35-45%
    - Net charge: +5 to +15
  Asian:
    - Aromatic content: 12-25%
    - Hydrophobic content: 30-40%
    - Net charge: +3 to +12
  Mediterranean:
    - Aromatic content: 18-30%
    - Hydrophobic content: 32-42%
    - Net charge: +4 to +14
- Population-specific immunogenicity assessment using HLA frequency data
- Biophysical property analysis using BioPython
- Structured output in JSON format with detailed analysis results

## Requirements

- Python 3.8 or higher
- CUDA-capable GPU (recommended for ProtGPT2)
- Required Python packages listed in `requirements.txt`

## Installation

1. Clone the repository:
```bash
git clone https://github.com/Raiff1982/healdette.git
cd healdette
```

2. Create and activate a virtual environment:
```bash
python -m venv .venv
# On Windows:
.venv\Scripts\activate
# On Unix/MacOS:
source .venv/bin/activate
```

3. Install dependencies:
```bash
pip install -r requirements.txt
```

## Multi-Ethnic Configuration

Healdette now supports ancestry-weighted validation for multiple ethnic populations. The system uses:
1. Population-specific binding motifs and parameters
2. Ancestry weights from genetic analysis
3. HLA frequency data for immunogenicity assessment

### Configuration Structure

Configuration files follow this structure:
```json
{
    "global_params": {
        "sequence_length": {
            "min": 40,
            "max": 70
        },
        "structural_params": {
            "helix_propensity": {
                "min": 20,
                "max": 50
            },
            "sheet_propensity": {
                "min": 10,
                "max": 40
            }
        },
        "homopolymer_threshold": 4
    },
    "populations": {
        "french_german": {
            "ancestry_weight": 0.298,
            "binding_motifs": ["WY", "RF", "KH", "YF"],
            "biophysical_params": {
                "aromatic_content": {
                    "min": 16,
                    "max": 28
                },
                "hydrophobic_content": {
                    "min": 33,
                    "max": 43
                },
                "net_charge": {
                    "min": 4,
                    "max": 13
                }
            },
            "hla_frequencies": {
                "hla_a": {},
                "hla_b": {},
                "hla_c": {}
            }
        }
    }
}
```

### Ancestry-Weighted Validation

The validation system considers:
1. **Ancestry Weights**: Each population's contribution is weighted by ancestry percentage
2. **Blended Parameters**: Biophysical parameters are blended based on ancestry weights
3. **Multiple Binding Motifs**: Scores binding motifs from all relevant populations
4. **HLA Compatibility**: Considers population-specific HLA frequencies

### Population-Specific Parameters

Each population can define:
- **Binding Motifs**: Amino acid pairs crucial for binding
- **Biophysical Parameters**:
  - Aromatic content ranges
  - Hydrophobic content ranges
  - Net charge requirements
- **HLA Frequencies**: Population-specific HLA allele distributions

## Usage

1. Create a configuration file following the schema (see `examples/` directory):
```json
{
    "global_params": {
        "sequence_length": {
            "min": 40,
            "max": 70
        }
    },
    "populations": {
        "french_german": {
            "ancestry_weight": 0.298,
            "binding_motifs": ["WY", "RF", "KH", "YF"],
            "biophysical_params": {
                "aromatic_content": {
                    "min": 16,
                    "max": 28
                }
            }
        },
        "finnish": {
            "ancestry_weight": 0.057,
            "binding_motifs": ["WH", "RF", "KY", "FF"],
            "biophysical_params": {
                "aromatic_content": {
                    "min": 14,
                    "max": 26
                }
            }
        }
    }
}
```

2. Validate sequences using the weighted validator:
```python
from modules.weighted_validator import WeightedSequenceValidator
from modules.config_validator import ConfigValidator

# Load and validate configuration
config_validator = ConfigValidator()
config = "path/to/config.json"
if config_validator.validate_file(config)['valid']:
    # Create validator with ancestry-weighted parameters
    validator = WeightedSequenceValidator(sequence, config)
    
    # Get detailed validation results
    results = validator.validate_sequence()
    
    # Check population-specific scores
    pop_scores = results['population_scores']
    for pop, score in pop_scores.items():
        print(f"{pop}: {score['score']} (weight: {score['weight']})")
```

3. Run the pipeline with multi-ethnic configuration:
```bash
python main.py config.json output.json --num-candidates 15
```

### Example Configurations

Complete example configurations are available in the `examples/` directory:
- `european_populations_config.json`: Configuration for European population clusters
- `multi_ethnic_config.json`: General multi-ethnic configuration template
- `celtic_test_input.json`: Celtic-specific test configuration

### Understanding Validation Results

The weighted validator provides detailed results:
```json
{
    "valid": true,
    "warnings": [],
    "metrics": {
        "aromatic_content": 22.5,
        "hydrophobic_content": 38.2,
        "binding_motifs": {
            "scores": {
                "french_german": {"score": 0.75, "weighted_score": 0.223},
                "finnish": {"score": 0.5, "weighted_score": 0.029}
            },
            "total_score": 0.252
        }
    },
    "population_scores": {
        "french_german": {
            "score": 0.8,
            "weight": 0.298
        },
        "finnish": {
            "score": 0.6,
            "weight": 0.057
        }
    }
}
    ],
    "num_sequences": 10,
    "global_validation_params": {
        "min_sequence_length": 40,
        "max_sequence_length": 70,
        "allow_homopolymers": false,
        "structure_requirements": {
            "helix_propensity": {
                "min": 0.2,
                "max": 0.5
            },
            "sheet_propensity": {
                "min": 0.1,
                "max": 0.4
            }
        }
    }
}
```

2. Run the pipeline:
```bash
python main.py --config input_config.json
```

## Output Files

The pipeline generates two types of output files in the `output` directory:

1. Detailed JSON output (`antibody_designs_{timestamp}.json`):
   - Generated antibody sequences with framework and CDR regions
   - Celtic binding motif analysis
   - Biophysical properties (hydrophobicity, charge, stability)
   - Aromatic content and distribution
   - Population-specific immunogenicity scores
   - Validation results against therapeutic antibodies

2. Summary report (`antibody_summary_{timestamp}.txt`):
   - Key metrics for each generated sequence
   - Celtic motif occurrence statistics
   - Population coverage statistics
   - Validation summary

## Reproducibility

To reproduce the results:

1. Use the same random seed for ProtGPT2:
```python
import torch
torch.manual_seed(42)
```

2. Ensure consistent data sources:
   - HLA frequency data: NetMHCpan 4.1 database
   - Therapeutic antibody dataset: THERAb database v2.0
   - Framework templates: IMGT database
   - Celtic binding motif templates: Custom database

3. Run validation tests:
```bash
python -m unittest discover tests
```

## License

MIT License. See LICENSE file for details.

## Citation

If you use this software in your research, please cite:
```bibtex
@software{healdette2025,
  title = {Healdette: Celtic-Optimized Antibody Generation Pipeline},
  author = {Raiff, et al.},
  year = {2025},
  version = {1.0.0},
  url = {https://github.com/Raiff1982/healdette}
}
```
Harrison, J. (2025). Healdette: A Population-Aware Antibody Design Pipeline.
GitHub repository: https://github.com/Raiff1982/healdette
```

## Author

Jonathan Harrison (Raiff1982)