Instructions to use AIRI-Institute/genatator-pipeline with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AIRI-Institute/genatator-pipeline with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-feature-extraction", model="AIRI-Institute/genatator-pipeline", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AIRI-Institute/genatator-pipeline", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
GENATATOR-PIPELINE
GENATATOR-PIPELINE is a Hugging Face pipeline for ab initio gene annotation from genomic DNA. It accepts a FASTA file, finds candidate transcript intervals, assigns transcript type, predicts exon and CDS structure, and writes a GFF3 annotation file.
The pipeline combines interval discovery, transcript-type classification, segmentation, filtering, and GFF generation in one transformers.pipeline call. The output of the call is a Python string containing the path to the written GFF file.
GENATATOR-PIPELINE currently supports CUDA GPU execution only and float32 inference only. CPU execution and lower precision modes are not supported in the current release.
Hugging Face pipeline usage
Basic example
from transformers import pipeline
pipe = pipeline(
task="genatator-pipeline",
model="AIRI-Institute/genatator-pipeline",
trust_remote_code=True,
device=0,
dtype="float32",
)
output_path = pipe(
"genome.fasta",
output_gff_path="genome.gff",
)
print(output_path)
Example with all main supported parameters defined
from transformers import pipeline
pipe = pipeline(
task="genatator-pipeline",
model="AIRI-Institute/genatator-pipeline",
trust_remote_code=True,
device=0,
dtype="float32",
edge_model_path="AIRI-Institute/genatator-moderngena-base-multispecies-edge-model",
region_model_path="AIRI-Institute/genatator-moderngena-base-multispecies-region-model",
transcript_type_model_path="AIRI-Institute/genatator-caduceus-ps-multispecies-transcript-type",
segmentation_model_path="AIRI-Institute/genatator-caduceus-ps-multispecies-segmentation",
edge_context_length=1024,
region_context_length=8192,
transcript_type_context_length=250000,
segmentation_context_length=250000,
edge_average_token_length=9.0,
region_average_token_length=9.0,
edge_max_genomic_chunk_ratio=1.5,
region_max_genomic_chunk_ratio=1.5,
edge_drop_last=False,
region_drop_last=False,
edge_apply_sigmoid=False,
region_apply_sigmoid=False,
transcript_type_apply_sigmoid=True,
segmentation_apply_sigmoid=True,
edge_gap_token_id=5,
region_gap_token_id=5,
)
output_path = pipe(
"genome.fasta",
output_gff_path="genome.gff",
edge_context_fraction=0.5,
region_context_fraction=0.5,
gene_finding_use_reverse_complement=True,
transcript_type_use_reverse_complement=True,
segmentation_use_reverse_complement=True,
lp_frac=0.05,
pk_prom=0.1,
pk_dist=50,
pk_height=None,
interval_window_size=2_000_000,
max_pairs_per_seed=10,
gene_finding_global_chunk_size=70_000_000,
prob_threshold=0.5,
zero_fraction_drop_threshold=0.01,
transcript_type_threshold=0.5,
splice_filter=True,
deduplicate=True,
intronic_filtering=True,
keep_longest_terminal_variant=True,
predict_internal_structure=True,
transcript_coloring_thresholds="auto",
use_cds_heuristic=True,
save_intermediate_files=False,
intermediate_output_dir=None,
pairing_progress_every=1000,
chunk_log_every=1000,
shift=None,
)
print(output_path)
All four stage models run with batch size 1.
Parameter reference
Parameters can be provided either when creating the Hugging Face pipeline or when calling it, depending on how your application is structured. Values supplied in the pipeline call override the stored defaults for that call.
Pipeline setup arguments
taskβ Hugging Face task name for this custom pipeline. Use"genatator-pipeline".modelβ Hugging Face repository or local directory containing the GENATATOR pipeline wrapper. For the published version, use"AIRI-Institute/genatator-pipeline".trust_remote_codeβ Must beTruebecause the pipeline uses custom Python code from the model repository. Without it, Transformers will not load the custom pipeline class.deviceβ CUDA GPU index used for inference, for example0for the first GPU. CPU execution withdevice=-1is not supported currently.dtypeβ Tensor dtype used when loading the stage models. Only"float32"is supported currently.
Model repositories
edge_model_pathβ Repository or local path for the edge model. This model predicts transcript boundary signals, namely TSS and PolyA signals on both strands.region_model_pathβ Repository or local path for the region model. This model predicts strand-specific intragenic signal that is used to remove weak candidate transcript intervals.transcript_type_model_pathβ Repository or local path for the transcript-type classifier. This model labels each retained candidate interval asmRNAorlnc_RNA.segmentation_model_pathβ Repository or local path for the segmentation model. This model predicts the internal exon, intron, and CDS structure of each retained interval.
Context-length and chunking parameters
edge_context_lengthβ Token length of each edge-model input window, including tokenizer system tokens. Larger values give the edge model more context, but also increase memory use.region_context_lengthβ Token length of each region-model input window, including tokenizer system tokens. The default is 8192 tokens, matching the gene-finding benchmark and manuscript configuration.transcript_type_context_lengthβ Maximum token length passed to the transcript-type model for each candidate interval. Only the leading prefix up to this context is evaluated, so sequence beyond this limit is ignored by the classifier.segmentation_context_lengthβ Nucleotide length of each segmentation-model block inside a retained interval. Segmentation blocks are processed consecutively and without overlap.edge_average_token_lengthβ Estimated average number of nucleotides represented by one edge-model tokenizer token. The pipeline uses this value to convert token context length into genomic window length before tokenization.region_average_token_lengthβ Estimated average number of nucleotides represented by one region-model tokenizer token. The pipeline uses this value to convert token context length into genomic window length before tokenization.edge_max_genomic_chunk_ratioβ Maximum expansion ratio for edge-model genomic extraction before tokenizer truncation. It gives the tokenizer extra nucleotide sequence so the final tokenized window can be filled reliably.region_max_genomic_chunk_ratioβ Maximum expansion ratio for region-model genomic extraction before tokenizer truncation. It plays the same role asedge_max_genomic_chunk_ratio, but for the region model.edge_context_fractionβ Fractional overlap between consecutive edge-model genomic windows. Higher overlap can smooth boundary predictions, but it increases the number of model calls.region_context_fractionβ Fractional overlap between consecutive region-model genomic windows. Higher overlap can make intragenic masks more stable, but it increases computation time.edge_drop_lastβ IfTrue, the final incomplete edge-model window is omitted. The default isFalse, which keeps the final window so the end of the sequence is still processed.region_drop_lastβ IfTrue, the final incomplete region-model window is omitted. The default isFalse, which keeps the final window so the end of the sequence is still processed.edge_gap_token_idβ Token ID used to correct edge-model offset mappings for gap tokens. Most users should keep the default unless they change the tokenizer.region_gap_token_idβ Token ID used to correct region-model offset mappings for gap tokens. Most users should keep the default unless they change the tokenizer.gene_finding_global_chunk_sizeβ Maximum nucleotide length of each global chunk used by the edge model only. For each global chunk, the pipeline computes edge predictions, runs FFT smoothing and peak calling inside that chunk, keeps only sparse peak coordinates, discards the raw edge predictions, and then moves to the next chunk if the DNA sequence is longer.
Interval-discovery parameters
lp_fracβ Fraction of the Fourier spectrum retained by the low-pass smoother before peak detection. Smaller values produce smoother boundary tracks and can remove local noise.pk_promβ Minimum peak prominence used during TSS and PolyA boundary detection. Higher values make peak calling more conservative.pk_distβ Minimum nucleotide distance between neighboring peaks of the same boundary class. This helps avoid calling several nearby peaks for one broad signal.pk_heightβ Optional minimum peak height after smoothing. UseNoneto disable this extra height filter.interval_window_sizeβ Maximum distance allowed when pairing a TSS peak with a PolyA peak on the same strand. Candidate transcript intervals longer than this pairing window are not created.max_pairs_per_seedβ Maximum number of nearest PolyA partners retained for each TSS seed. Larger values create more candidate intervals and can increase downstream computation.prob_thresholdβ Threshold used to convert region-model intragenic signal into a binary mask. A base is considered intragenic only when the model signal is above this threshold.zero_fraction_drop_thresholdβ Maximum tolerated fraction of non-intragenic bases inside a candidate interval. Intervals with a larger fraction belowprob_thresholdare discarded.
Reverse-complement options
gene_finding_use_reverse_complementβ Enables reverse-complement averaging for the edge and region models. This can improve strand-aware interval discovery, but it roughly doubles gene-finding model calls.transcript_type_use_reverse_complementβ Enables reverse-complement averaging for transcript-type classification. The forward and reverse-complement scores are averaged before the finalmRNAorlnc_RNAdecision.segmentation_use_reverse_complementβ Enables reverse-complement averaging for segmentation. This can stabilize structure prediction, but it increases segmentation compute cost.
Activation options
edge_apply_sigmoidβ Applies an additional sigmoid to edge-model output channels before token-to-nucleotide projection. The default isFalse, because the published edge model outputs are already expected in the correct scale.region_apply_sigmoidβ Applies an additional sigmoid to region-model output channels before token-to-nucleotide projection. The default isFalse, because the published region model outputs are already expected in the correct scale.transcript_type_apply_sigmoidβ Applies sigmoid to single-logit transcript-type outputs before thresholding. For multi-logit outputs, the pipeline uses softmax instead.segmentation_apply_sigmoidβ Applies an additional sigmoid to segmentation-model output channels before structural decoding. The default isTruefor the published segmentation setup.
Transcript and segmentation parameters
transcript_type_thresholdβ Threshold applied to the predictedlnc_RNAprobability. Intervals at or above this value are labeledlnc_RNA, and intervals below it are labeledmRNA.splice_filterβ Enables splice-motif filtering and terminal splice-boundary correction for exon and CDS segments. This post-processing step can remove or adjust segments that disagree with expected splice signals.deduplicateβ Removes duplicate final transcript predictions. This is applied near the end of GFF generation to avoid repeated identical transcripts.intronic_filteringβ Drops transcript predictions whose segmentation starts or ends with the intron class. This removes predictions that appear to begin or end inside an intron.keep_longest_terminal_variantβ For overlapping transcripts with the same internal structure, keeps the longest terminal variant. This reduces redundant terminal variants that differ mainly by transcript ends.predict_internal_structureβ Controls whether the pipeline continues past interval discovery into transcript-type classification, segmentation, and GFF generation. Keep it asTruefor normal annotation output.use_cds_heuristicβ Replaces predicted CDS segments with the exon-derived CDS heuristic used in the accompanying benchmark code. This affectsmRNAtranscripts only, and no CDS is emitted forlnc_RNAtranscripts.transcript_coloring_thresholdsβ Controls transcript color bins in the output GFF. Use"auto"to split the observed segmentation-confidence range into four bins, or provide a custom list of exactly four thresholds.
The hardcoded transcript color map is applied to the final transcript set after filtering, deduplication, longest-terminal-variant selection, and optional CDS heuristic processing.
- Lowest bin,
#66cc66, light green. - Second bin,
#006400, dark green. - Third bin,
#dcdcff, light blue. - Top bin,
#0c0c78, dark blue.
Intermediate-output, logging, and coordinate parameters
save_intermediate_filesβ IfTrue, writes gene-finding intermediate artifacts for each FASTA record. In the memory-efficient path, these are compact edge peak.npzfiles, compact intragenic-mask.npzfiles,.bedinterval files, and a compressed.h5debug dump whenh5pyis installed.intermediate_output_dirβ Output directory for intermediate artifacts. If omitted, intermediate files are written next to the input FASTA file.pairing_progress_everyβ Logging interval, measured in TSS seeds, during candidate interval construction. Increase it for less frequent logs.chunk_log_everyβ Logging interval, measured in genomic chunks, during edge and region inference. Increase it for less frequent logs on large genomes.shiftβ Coordinate offset applied to final GFF coordinates. Use an integer offset directly, or use"UCSC"to infer the offset from FASTA headers of the formchrom:start-end.output_gff_pathβ Path of the GFF file written by the pipeline call. If you do not provide it, the pipeline writes a default GFF path derived from the input FASTA path.
What the pipeline does
1. Interval discovery
The first stage identifies candidate transcript intervals with two strand-aware DNA language models.
- The edge model detects transcription start site and polyadenylation signals, abbreviated as TSS and PolyA.
- The region model predicts intragenic signal, which is used to filter candidate intervals.
Edge prediction is processed in global chunks controlled by gene_finding_global_chunk_size. After each global chunk is peak-called, the raw edge predictions are discarded and only sparse peak coordinates are kept.
Region prediction uses streaming thresholded intragenic masks. The pipeline does not need to keep full chromosome-length float32 region tracks in memory.
Candidate intervals are formed by pairing strand-compatible TSS and PolyA peaks. Intervals with too much non-intragenic sequence are removed before transcript-type classification and segmentation.
2. Transcript-type assignment
Each retained interval is classified by the transcript-type model as either mRNA or lnc_RNA. Only the leading token prefix defined by transcript_type_context_length is evaluated.
When reverse-complement averaging is enabled for this stage, forward and reverse-complement predictions are averaged. The final decision is controlled by transcript_type_threshold.
3. Segmentation
Each retained interval is segmented into nucleotide-level structural classes by the segmentation model. Exons are derived from exon-versus-intron competition, and CDS segments are derived from CDS-versus-non-CDS competition.
Segmentation is stitched from non-overlapping interval blocks. When tokenizer offset mappings are available, token-level outputs are projected to nucleotide coordinates.
4. GFF generation
The final annotation contains these feature types.
genemRNAorlnc_RNAexonCDS, formRNAtranscripts only
No CDS is emitted for lnc_RNA transcripts.
The GFF transcript attributes include lncRNA_probability, mRNA_probability, exon_segmentation_confidence, cds_segmentation_confidence, segmentation_confidence, and color. Exon and CDS features include mean_probability, and intron features are not emitted in the output GFF.
Default model repositories
edge_model_path,AIRI-Institute/genatator-moderngena-base-multispecies-edge-modelregion_model_path,AIRI-Institute/genatator-moderngena-base-multispecies-region-modeltranscript_type_model_path,AIRI-Institute/genatator-caduceus-ps-multispecies-transcript-typesegmentation_model_path,AIRI-Institute/genatator-caduceus-ps-multispecies-segmentation
Input and output
Input
- Path to a FASTA file.
- The FASTA file may contain one record or multiple records.
Output
- A single Python string, the path to the written
.gfffile. - The file contents follow the GFF3 specification.
Dependencies
Create the Conda environment from environment.yml before running the pipeline locally. This project currently requires a CUDA-capable GPU.
conda env create -f environment.yml
conda activate genatator_pipeline
If the simple setup fails, use the robust staged setup. This follows the same strategy as Docker startup.
conda env create -n genatator_pipeline -f docker/conda-core.yml
conda activate genatator_pipeline
pip install torch==2.2.2+cu121 torchvision==0.17.2+cu121 torchaudio==2.2.2+cu121 --index-url https://download.pytorch.org/whl/cu121
pip install causal-conv1d==1.4.0 --no-build-isolation
pip install mamba-ssm==2.2.2 --no-build-isolation
pip install packaging==26.0 ninja==1.13.0 psutil==7.2.2
pip install flash-attn==2.6.3 --no-build-isolation
pip install -r docker/requirements.txt
Output annotation
The written GFF file contains one gene feature for each predicted gene locus and one transcript feature for each predicted transcript. Exons and CDS features are derived from the segmentation stage, and CDS features are emitted only for transcripts classified as mRNA.
The attribute field of each transcript includes transcript-type probabilities and segmentation-confidence values. The lncRNA_probability attribute stores the score produced by the transcript-type model.
Docker deployment
All Docker assets are in docker/.
Build.
docker build -f docker/Dockerfile -t genatator-pipeline:latest .
Run.
docker run --gpus all --rm -p 3000:3000 -v "$(pwd)":/generated genatator-pipeline:latest
API endpoint.
POST /api/genatator-pipeline/upload- Input, multipart
filecontaining FASTA, or form fielddna. - Output JSON fields,
fasta_file,fai_file,gff_file, andarchive.
Example.
curl -X POST "http://localhost:3000/api/genatator-pipeline/upload" -F "file=@genome.fasta"
- Downloads last month
- 11