cfDNA-Gen

Conditional Causal Transformer for Cell-Free DNA Sequence Generation

A 120M parameter transformer model that generates realistic synthetic cell-free DNA (cfDNA) sequences for NIPT simulation, benchmark development, and genomics research.

Model Description

cfDNA-Gen is trained on real cell-free DNA data and generates synthetic sequences with controllable properties:

  • Fragment length: Control the length of generated fragments (typically 50-250bp)
  • GC content: Target specific GC content (typical cfDNA: ~42%)
  • Fetal fraction: Simulate different fetal fractions for NIPT applications (0-40%)

The model captures realistic patterns found in cfDNA including:

  • Bimodal fragment length distribution (fetal ~144bp, maternal ~167bp)
  • Nucleosome-associated 10bp periodicity
  • Position-specific nucleotide preferences
  • Characteristic end motifs

Architecture

  • Parameters: 132M
  • Layers: 14 transformer blocks
  • Hidden dimension: 768
  • Attention heads: 12
  • FFN dimension: 3072
  • Position encoding: RoPE (Rotary Position Embeddings)
  • Activation: SwiGLU
  • Normalization: RMSNorm
  • Attention: Flash Attention (SDPA)

Usage

from cfdna_gen import CfDNAGenerator

# Load model
generator = CfDNAGenerator.from_pretrained("EabhaSeq/cfdna-gen")

# Generate sequences
sequences = generator.generate(
    n_sequences=100,
    fragment_lengths=165,
    target_gc=0.42,
    target_ff=0.10,
)

for seq in sequences[:5]:
    print(seq)

Installation

pip install cfdna-gen[hub]

Validation Results

Metric Score
Overall Similarity 92.9%
Fragment Length Match 98.4%
GC Content Match 93.3%
Nucleotide Frequency 99.0%
Bimodal Peaks Detection 100%
Nucleosome Periodicity 100%

Use Cases

  • NIPT Simulation: Generate synthetic samples with known conditions for algorithm development
  • Benchmarking: Create standardized test datasets for cfDNA analysis pipelines
  • Training Data: Augment real datasets for machine learning applications
  • Method Development: Test new analysis methods on controlled synthetic data

Citation

@software{cfdna_gen,
  title={cfDNA-Gen: Conditional Causal Transformer for cfDNA Sequence Generation},
  author={Redelinghuys, Kyle},
  year={2025},
  url={https://github.com/eabhaseq/cfdna-gen}
}

License

This model is dual-licensed:

See the LICENSE file for full terms.

Downloads last month
12
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support