cfDNA-Gen
Conditional Causal Transformer for Cell-Free DNA Sequence Generation
A 120M parameter transformer model that generates realistic synthetic cell-free DNA (cfDNA) sequences for NIPT simulation, benchmark development, and genomics research.
Model Description
cfDNA-Gen is trained on real cell-free DNA data and generates synthetic sequences with controllable properties:
- Fragment length: Control the length of generated fragments (typically 50-250bp)
- GC content: Target specific GC content (typical cfDNA: ~42%)
- Fetal fraction: Simulate different fetal fractions for NIPT applications (0-40%)
The model captures realistic patterns found in cfDNA including:
- Bimodal fragment length distribution (fetal ~144bp, maternal ~167bp)
- Nucleosome-associated 10bp periodicity
- Position-specific nucleotide preferences
- Characteristic end motifs
Architecture
- Parameters: 132M
- Layers: 14 transformer blocks
- Hidden dimension: 768
- Attention heads: 12
- FFN dimension: 3072
- Position encoding: RoPE (Rotary Position Embeddings)
- Activation: SwiGLU
- Normalization: RMSNorm
- Attention: Flash Attention (SDPA)
Usage
from cfdna_gen import CfDNAGenerator
# Load model
generator = CfDNAGenerator.from_pretrained("EabhaSeq/cfdna-gen")
# Generate sequences
sequences = generator.generate(
n_sequences=100,
fragment_lengths=165,
target_gc=0.42,
target_ff=0.10,
)
for seq in sequences[:5]:
print(seq)
Installation
pip install cfdna-gen[hub]
Validation Results
| Metric | Score |
|---|---|
| Overall Similarity | 92.9% |
| Fragment Length Match | 98.4% |
| GC Content Match | 93.3% |
| Nucleotide Frequency | 99.0% |
| Bimodal Peaks Detection | 100% |
| Nucleosome Periodicity | 100% |
Use Cases
- NIPT Simulation: Generate synthetic samples with known conditions for algorithm development
- Benchmarking: Create standardized test datasets for cfDNA analysis pipelines
- Training Data: Augment real datasets for machine learning applications
- Method Development: Test new analysis methods on controlled synthetic data
Citation
@software{cfdna_gen,
title={cfDNA-Gen: Conditional Causal Transformer for cfDNA Sequence Generation},
author={Redelinghuys, Kyle},
year={2025},
url={https://github.com/eabhaseq/cfdna-gen}
}
License
This model is dual-licensed:
- Free for academic, research, educational, and personal non-commercial use under the PolyForm Noncommercial License 1.0.0.
- Commercial use requires a separate paid license. Contact kyle@eabhaseq.com for details.
See the LICENSE file for full terms.
- Downloads last month
- 12