GENATATOR-PIPELINE

GENATATOR-PIPELINE is a Hugging Face pipeline for ab initio gene annotation from genomic DNA. It accepts a FASTA file, finds candidate transcript intervals, assigns transcript type, predicts exon and CDS structure, and writes a GFF3 annotation file.

The pipeline combines interval discovery, transcript-type classification, segmentation, filtering, and GFF generation in one transformers.pipeline call. The output of the call is a Python string containing the path to the written GFF file.

GENATATOR-PIPELINE currently supports CUDA GPU execution only and float32 inference only. CPU execution and lower precision modes are not supported in the current release.

Hugging Face pipeline usage

Basic example

from transformers import pipeline

pipe = pipeline(
    task="genatator-pipeline",
    model="AIRI-Institute/genatator-pipeline",
    trust_remote_code=True,
    device=0,
    dtype="float32",
)

output_path = pipe(
    "genome.fasta",
    output_gff_path="genome.gff",
)
print(output_path)

Example with all main supported parameters defined

from transformers import pipeline

pipe = pipeline(
    task="genatator-pipeline",
    model="AIRI-Institute/genatator-pipeline",
    trust_remote_code=True,
    device=0,
    dtype="float32",
    edge_model_path="AIRI-Institute/genatator-moderngena-base-multispecies-edge-model",
    region_model_path="AIRI-Institute/genatator-moderngena-base-multispecies-region-model",
    transcript_type_model_path="AIRI-Institute/genatator-caduceus-ps-multispecies-transcript-type",
    segmentation_model_path="AIRI-Institute/genatator-caduceus-ps-multispecies-segmentation",
    edge_context_length=1024,
    region_context_length=8192,
    transcript_type_context_length=250000,
    segmentation_context_length=250000,
    edge_average_token_length=9.0,
    region_average_token_length=9.0,
    edge_max_genomic_chunk_ratio=1.5,
    region_max_genomic_chunk_ratio=1.5,
    edge_drop_last=False,
    region_drop_last=False,
    edge_apply_sigmoid=False,
    region_apply_sigmoid=False,
    transcript_type_apply_sigmoid=True,
    segmentation_apply_sigmoid=True,
    edge_gap_token_id=5,
    region_gap_token_id=5,
)

output_path = pipe(
    "genome.fasta",
    output_gff_path="genome.gff",
    edge_context_fraction=0.5,
    region_context_fraction=0.5,
    gene_finding_use_reverse_complement=True,
    transcript_type_use_reverse_complement=True,
    segmentation_use_reverse_complement=True,
    lp_frac=0.05,
    pk_prom=0.1,
    pk_dist=50,
    pk_height=None,
    interval_window_size=2_000_000,
    max_pairs_per_seed=10,
    gene_finding_global_chunk_size=70_000_000,
    prob_threshold=0.5,
    zero_fraction_drop_threshold=0.01,
    transcript_type_threshold=0.5,
    splice_filter=True,
    deduplicate=True,
    intronic_filtering=True,
    keep_longest_terminal_variant=True,
    predict_internal_structure=True,
    transcript_coloring_thresholds="auto",
    use_cds_heuristic=True,
    save_intermediate_files=False,
    intermediate_output_dir=None,
    pairing_progress_every=1000,
    chunk_log_every=1000,
    shift=None,
)
print(output_path)

All four stage models run with batch size 1.

Parameter reference

Parameters can be provided either when creating the Hugging Face pipeline or when calling it, depending on how your application is structured. Values supplied in the pipeline call override the stored defaults for that call.

Pipeline setup arguments

  • task β€” Hugging Face task name for this custom pipeline. Use "genatator-pipeline".
  • model β€” Hugging Face repository or local directory containing the GENATATOR pipeline wrapper. For the published version, use "AIRI-Institute/genatator-pipeline".
  • trust_remote_code β€” Must be True because the pipeline uses custom Python code from the model repository. Without it, Transformers will not load the custom pipeline class.
  • device β€” CUDA GPU index used for inference, for example 0 for the first GPU. CPU execution with device=-1 is not supported currently.
  • dtype β€” Tensor dtype used when loading the stage models. Only "float32" is supported currently.

Model repositories

  • edge_model_path β€” Repository or local path for the edge model. This model predicts transcript boundary signals, namely TSS and PolyA signals on both strands.
  • region_model_path β€” Repository or local path for the region model. This model predicts strand-specific intragenic signal that is used to remove weak candidate transcript intervals.
  • transcript_type_model_path β€” Repository or local path for the transcript-type classifier. This model labels each retained candidate interval as mRNA or lnc_RNA.
  • segmentation_model_path β€” Repository or local path for the segmentation model. This model predicts the internal exon, intron, and CDS structure of each retained interval.

Context-length and chunking parameters

  • edge_context_length β€” Token length of each edge-model input window, including tokenizer system tokens. Larger values give the edge model more context, but also increase memory use.
  • region_context_length β€” Token length of each region-model input window, including tokenizer system tokens. The default is 8192 tokens, matching the gene-finding benchmark and manuscript configuration.
  • transcript_type_context_length β€” Maximum token length passed to the transcript-type model for each candidate interval. Only the leading prefix up to this context is evaluated, so sequence beyond this limit is ignored by the classifier.
  • segmentation_context_length β€” Nucleotide length of each segmentation-model block inside a retained interval. Segmentation blocks are processed consecutively and without overlap.
  • edge_average_token_length β€” Estimated average number of nucleotides represented by one edge-model tokenizer token. The pipeline uses this value to convert token context length into genomic window length before tokenization.
  • region_average_token_length β€” Estimated average number of nucleotides represented by one region-model tokenizer token. The pipeline uses this value to convert token context length into genomic window length before tokenization.
  • edge_max_genomic_chunk_ratio β€” Maximum expansion ratio for edge-model genomic extraction before tokenizer truncation. It gives the tokenizer extra nucleotide sequence so the final tokenized window can be filled reliably.
  • region_max_genomic_chunk_ratio β€” Maximum expansion ratio for region-model genomic extraction before tokenizer truncation. It plays the same role as edge_max_genomic_chunk_ratio, but for the region model.
  • edge_context_fraction β€” Fractional overlap between consecutive edge-model genomic windows. Higher overlap can smooth boundary predictions, but it increases the number of model calls.
  • region_context_fraction β€” Fractional overlap between consecutive region-model genomic windows. Higher overlap can make intragenic masks more stable, but it increases computation time.
  • edge_drop_last β€” If True, the final incomplete edge-model window is omitted. The default is False, which keeps the final window so the end of the sequence is still processed.
  • region_drop_last β€” If True, the final incomplete region-model window is omitted. The default is False, which keeps the final window so the end of the sequence is still processed.
  • edge_gap_token_id β€” Token ID used to correct edge-model offset mappings for gap tokens. Most users should keep the default unless they change the tokenizer.
  • region_gap_token_id β€” Token ID used to correct region-model offset mappings for gap tokens. Most users should keep the default unless they change the tokenizer.
  • gene_finding_global_chunk_size β€” Maximum nucleotide length of each global chunk used by the edge model only. For each global chunk, the pipeline computes edge predictions, runs FFT smoothing and peak calling inside that chunk, keeps only sparse peak coordinates, discards the raw edge predictions, and then moves to the next chunk if the DNA sequence is longer.

Interval-discovery parameters

  • lp_frac β€” Fraction of the Fourier spectrum retained by the low-pass smoother before peak detection. Smaller values produce smoother boundary tracks and can remove local noise.
  • pk_prom β€” Minimum peak prominence used during TSS and PolyA boundary detection. Higher values make peak calling more conservative.
  • pk_dist β€” Minimum nucleotide distance between neighboring peaks of the same boundary class. This helps avoid calling several nearby peaks for one broad signal.
  • pk_height β€” Optional minimum peak height after smoothing. Use None to disable this extra height filter.
  • interval_window_size β€” Maximum distance allowed when pairing a TSS peak with a PolyA peak on the same strand. Candidate transcript intervals longer than this pairing window are not created.
  • max_pairs_per_seed β€” Maximum number of nearest PolyA partners retained for each TSS seed. Larger values create more candidate intervals and can increase downstream computation.
  • prob_threshold β€” Threshold used to convert region-model intragenic signal into a binary mask. A base is considered intragenic only when the model signal is above this threshold.
  • zero_fraction_drop_threshold β€” Maximum tolerated fraction of non-intragenic bases inside a candidate interval. Intervals with a larger fraction below prob_threshold are discarded.

Reverse-complement options

  • gene_finding_use_reverse_complement β€” Enables reverse-complement averaging for the edge and region models. This can improve strand-aware interval discovery, but it roughly doubles gene-finding model calls.
  • transcript_type_use_reverse_complement β€” Enables reverse-complement averaging for transcript-type classification. The forward and reverse-complement scores are averaged before the final mRNA or lnc_RNA decision.
  • segmentation_use_reverse_complement β€” Enables reverse-complement averaging for segmentation. This can stabilize structure prediction, but it increases segmentation compute cost.

Activation options

  • edge_apply_sigmoid β€” Applies an additional sigmoid to edge-model output channels before token-to-nucleotide projection. The default is False, because the published edge model outputs are already expected in the correct scale.
  • region_apply_sigmoid β€” Applies an additional sigmoid to region-model output channels before token-to-nucleotide projection. The default is False, because the published region model outputs are already expected in the correct scale.
  • transcript_type_apply_sigmoid β€” Applies sigmoid to single-logit transcript-type outputs before thresholding. For multi-logit outputs, the pipeline uses softmax instead.
  • segmentation_apply_sigmoid β€” Applies an additional sigmoid to segmentation-model output channels before structural decoding. The default is True for the published segmentation setup.

Transcript and segmentation parameters

  • transcript_type_threshold β€” Threshold applied to the predicted lnc_RNA probability. Intervals at or above this value are labeled lnc_RNA, and intervals below it are labeled mRNA.
  • splice_filter β€” Enables splice-motif filtering and terminal splice-boundary correction for exon and CDS segments. This post-processing step can remove or adjust segments that disagree with expected splice signals.
  • deduplicate β€” Removes duplicate final transcript predictions. This is applied near the end of GFF generation to avoid repeated identical transcripts.
  • intronic_filtering β€” Drops transcript predictions whose segmentation starts or ends with the intron class. This removes predictions that appear to begin or end inside an intron.
  • keep_longest_terminal_variant β€” For overlapping transcripts with the same internal structure, keeps the longest terminal variant. This reduces redundant terminal variants that differ mainly by transcript ends.
  • predict_internal_structure β€” Controls whether the pipeline continues past interval discovery into transcript-type classification, segmentation, and GFF generation. Keep it as True for normal annotation output.
  • use_cds_heuristic β€” Replaces predicted CDS segments with the exon-derived CDS heuristic used in the accompanying benchmark code. This affects mRNA transcripts only, and no CDS is emitted for lnc_RNA transcripts.
  • transcript_coloring_thresholds β€” Controls transcript color bins in the output GFF. Use "auto" to split the observed segmentation-confidence range into four bins, or provide a custom list of exactly four thresholds.

The hardcoded transcript color map is applied to the final transcript set after filtering, deduplication, longest-terminal-variant selection, and optional CDS heuristic processing.

  • Lowest bin, #66cc66, light green.
  • Second bin, #006400, dark green.
  • Third bin, #dcdcff, light blue.
  • Top bin, #0c0c78, dark blue.

Intermediate-output, logging, and coordinate parameters

  • save_intermediate_files β€” If True, writes gene-finding intermediate artifacts for each FASTA record. In the memory-efficient path, these are compact edge peak .npz files, compact intragenic-mask .npz files, .bed interval files, and a compressed .h5 debug dump when h5py is installed.
  • intermediate_output_dir β€” Output directory for intermediate artifacts. If omitted, intermediate files are written next to the input FASTA file.
  • pairing_progress_every β€” Logging interval, measured in TSS seeds, during candidate interval construction. Increase it for less frequent logs.
  • chunk_log_every β€” Logging interval, measured in genomic chunks, during edge and region inference. Increase it for less frequent logs on large genomes.
  • shift β€” Coordinate offset applied to final GFF coordinates. Use an integer offset directly, or use "UCSC" to infer the offset from FASTA headers of the form chrom:start-end.
  • output_gff_path β€” Path of the GFF file written by the pipeline call. If you do not provide it, the pipeline writes a default GFF path derived from the input FASTA path.

What the pipeline does

1. Interval discovery

The first stage identifies candidate transcript intervals with two strand-aware DNA language models.

  • The edge model detects transcription start site and polyadenylation signals, abbreviated as TSS and PolyA.
  • The region model predicts intragenic signal, which is used to filter candidate intervals.

Edge prediction is processed in global chunks controlled by gene_finding_global_chunk_size. After each global chunk is peak-called, the raw edge predictions are discarded and only sparse peak coordinates are kept.

Region prediction uses streaming thresholded intragenic masks. The pipeline does not need to keep full chromosome-length float32 region tracks in memory.

Candidate intervals are formed by pairing strand-compatible TSS and PolyA peaks. Intervals with too much non-intragenic sequence are removed before transcript-type classification and segmentation.

2. Transcript-type assignment

Each retained interval is classified by the transcript-type model as either mRNA or lnc_RNA. Only the leading token prefix defined by transcript_type_context_length is evaluated.

When reverse-complement averaging is enabled for this stage, forward and reverse-complement predictions are averaged. The final decision is controlled by transcript_type_threshold.

3. Segmentation

Each retained interval is segmented into nucleotide-level structural classes by the segmentation model. Exons are derived from exon-versus-intron competition, and CDS segments are derived from CDS-versus-non-CDS competition.

Segmentation is stitched from non-overlapping interval blocks. When tokenizer offset mappings are available, token-level outputs are projected to nucleotide coordinates.

4. GFF generation

The final annotation contains these feature types.

  • gene
  • mRNA or lnc_RNA
  • exon
  • CDS, for mRNA transcripts only

No CDS is emitted for lnc_RNA transcripts.

The GFF transcript attributes include lncRNA_probability, mRNA_probability, exon_segmentation_confidence, cds_segmentation_confidence, segmentation_confidence, and color. Exon and CDS features include mean_probability, and intron features are not emitted in the output GFF.

Default model repositories

  • edge_model_path, AIRI-Institute/genatator-moderngena-base-multispecies-edge-model
  • region_model_path, AIRI-Institute/genatator-moderngena-base-multispecies-region-model
  • transcript_type_model_path, AIRI-Institute/genatator-caduceus-ps-multispecies-transcript-type
  • segmentation_model_path, AIRI-Institute/genatator-caduceus-ps-multispecies-segmentation

Input and output

Input

  • Path to a FASTA file.
  • The FASTA file may contain one record or multiple records.

Output

  • A single Python string, the path to the written .gff file.
  • The file contents follow the GFF3 specification.

Dependencies

Create the Conda environment from environment.yml before running the pipeline locally. This project currently requires a CUDA-capable GPU.

conda env create -f environment.yml
conda activate genatator_pipeline

If the simple setup fails, use the robust staged setup. This follows the same strategy as Docker startup.

conda env create -n genatator_pipeline -f docker/conda-core.yml
conda activate genatator_pipeline

pip install torch==2.2.2+cu121 torchvision==0.17.2+cu121 torchaudio==2.2.2+cu121 --index-url https://download.pytorch.org/whl/cu121
pip install causal-conv1d==1.4.0 --no-build-isolation
pip install mamba-ssm==2.2.2 --no-build-isolation
pip install packaging==26.0 ninja==1.13.0 psutil==7.2.2
pip install flash-attn==2.6.3 --no-build-isolation
pip install -r docker/requirements.txt

Output annotation

The written GFF file contains one gene feature for each predicted gene locus and one transcript feature for each predicted transcript. Exons and CDS features are derived from the segmentation stage, and CDS features are emitted only for transcripts classified as mRNA.

The attribute field of each transcript includes transcript-type probabilities and segmentation-confidence values. The lncRNA_probability attribute stores the score produced by the transcript-type model.

Docker deployment

All Docker assets are in docker/.

Build.

docker build -f docker/Dockerfile -t genatator-pipeline:latest .

Run.

docker run --gpus all --rm -p 3000:3000 -v "$(pwd)":/generated genatator-pipeline:latest

API endpoint.

  • POST /api/genatator-pipeline/upload
  • Input, multipart file containing FASTA, or form field dna.
  • Output JSON fields, fasta_file, fai_file, gff_file, and archive.

Example.

curl -X POST "http://localhost:3000/api/genatator-pipeline/upload" -F "file=@genome.fasta"
Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including AIRI-Institute/genatator-pipeline