maxentscan-score5 / README.md
ZhiyuanChen's picture
Upload folder using huggingface_hub
a7a99d8 verified
metadata
language: rna
library_name: multimolecule
license: agpl-3.0
pipeline: splice-site
pipeline_tag: other
tags:
  - Biology
  - RNA
  - Splicing
widget:
  - example_title: microRNA 21
    pipeline_tag: splice-site
    sequence_type: ncRNA
    task: splice-site
    text: UAGCUUAUCAGACUGAUGUUGA
  - example_title: microRNA 146a
    pipeline_tag: splice-site
    sequence_type: ncRNA
    task: splice-site
    text: UGAGAACUGAAUUCCAUGGGUU
  - example_title: microRNA 155
    pipeline_tag: splice-site
    sequence_type: ncRNA
    task: splice-site
    text: UUAAUGCUAAUCGUGAUAGGGGUU
  - example_title: RNA component of mitochondrial RNA processing endoribonuclease
    pipeline_tag: splice-site
    sequence_type: ncRNA
    task: splice-site
    text: >-
      GGUUCGUGCUGAAGGCCUGUAUCCUAGGCUACACACUGAGGACUCUGUUCCUCCCCUUUCCGCCUAGGGGAAAGUCCCCGGACCUCGGGCAGAGAGUGCCACGUGCAUACGCACGUAGACAUUCCCCGCUUCCCACUCCAAAGUCCGCCAAGAAGCGUAUCCCGCUGAGCGGCGUGGCGCGGGGGCGUCAUCCGUCAGCUCCCUCUAGUUACGCAGGCAGUGCGUGUCCGCGCACCAACCACACGGGGCUCAUUCUCAGCGCGGCUGUAAAAAAAAA
  - example_title: 7SK small nuclear RNA
    pipeline_tag: splice-site
    sequence_type: ncRNA
    task: splice-site
    text: >-
      GGAUGUGAGGGCGAUCUGGCUGCGACAUCUGUCACCCCAUUGAUCGCCAGGGUUGAUUCGGCUGAUCUGGCUGGCUAGGCGGGUGUCCCCUUCCUCCCUCACCGCUCCAUGUGCGUCCCUCCCGAAGCUGCGCGCUCGGUCGAAGAGGACGACCAUCCCCGAUAGAGGAGGACCGGUCUUCGGUCAAGGGUAUACGAGUAGCUGCGCUCCCCUGCUAGAACCUCCAAACAAGCUCUCAAGGUCCAUUUGUAGGAGAACGUAGGGUAGUCAAGCUUCCAAGACUCCAGACACAUCCAAAUGAGGCGCUGCAUGUGGCAGUCUGCCUUUCUUUU
  - example_title: telomerase RNA component
    pipeline_tag: splice-site
    sequence_type: ncRNA
    task: splice-site
    text: >-
      GGGUUGCGGAGGGUGGGCCUGGGAGGGGUGGUGGCCAUUUUUUGUCUAACCCUAACUGAGAAGGGCGUAGGCGCCGUGCUUUUGCUCCCCGCGCGCUGUUUUUCUCGCUGACUUUCAGCGGGCGGAAAAGCCUCGGCCUGCCGCCUUCCACCGUUCAUUCUAGAGCAAACAAAAAAUGUCAGCUGCUGGCCCGUUCGCCCCUCCCGGGGACCUGCGGCGGGUCGCCUGCCCAGCCCCCGAACCCCGCCUGGAGGCCGCGGUCGGCCCGGGGCUUCUCCGGAGGCACCCACUGCCACCGCGAAGAGUUGGGCUCUGUCAGCCGCGGGUCUCUCGGGGGCGAGGGCGAGGUUCAGGCCUUUCAGGCCGCAGGAAGAGGAACGGAGCGAGUCCCCGCGCGCGGCGCGAUUCCCUGAGCUGUGGGACGUGCACCCAGGACUCGGCUCACACAUGC
  - example_title: vault RNA 2-1
    pipeline_tag: splice-site
    sequence_type: ncRNA
    task: splice-site
    text: >-
      CGGGUCGGAGUUAGCUCAAGCGGUUACCUCCUCAUGCCGGACUUUCUAUCUGUCCAUCUCUGUGCUGGGGUUCGAGACCCGCGGGUGCUUACUGACCCUUUUAUGCAA
  - example_title: brain cytoplasmic RNA 1
    pipeline_tag: splice-site
    sequence_type: ncRNA
    task: splice-site
    text: >-
      GGCCGGGCGCGGUGGCUCACGCCUGUAAUCCCAGCUCUCAGGGAGGCUAAGAGGCGGGAGGAUAGCUUGAGCCCAGGAGUUCGAGACCUGCCUGGGCAAUAUAGCGAGACCCCGUUCUCCAGAAAAAGGAAAAAAAAAAACAAAAGACAAAAAAAAAAUAAGCGUAACUUCCCUCAAAGCAACAACCCCCCCCCCCCUUU
  - example_title: HIV-1 TAR-WT
    pipeline_tag: splice-site
    sequence_type: ncRNA
    task: splice-site
    text: GGUCUCUCUGGUUAGACCAGAUCUGAGCCUGGGAGCUCUCUGGCUAACUAGGGAACC
  - example_title: prion protein (Kanno blood group)
    pipeline_tag: splice-site
    sequence_type: mRNA
    task: splice-site
    text: AUGGCGAACCUUGGCUGCUGGAUGCUGGUUCUCUUUGUGGCCACAUGGAGUGACCUGGGCCUCUGC
  - example_title: interleukin 10
    pipeline_tag: splice-site
    sequence_type: mRNA
    task: splice-site
    text: AUGCACAGCUCAGCACUGCUCUGUUGCCUGGUCCUCCUGACUGGGGUGAGGGCC
  - example_title: Zaire ebolavirus
    pipeline_tag: splice-site
    sequence_type: mRNA
    task: splice-site
    text: >-
      AAUGUUCAAACACUUUGUGAAGCUCUGUUAGCUGAUGGUCUUGCUAAAGCAUUUCCUAGCAAUAUGAUGGUAGUCACAGAGCGUGAGCAAAAAGAAAGCUUAUUGCAUCAAGCAUCAUGGCACCACACAAGUGAUGAUUUUGGUGAGCAUGCCACAGUUAGAGGGAGUAGCUUUGUAACUGAUUUAGAGAAAUACAAUCUUGCAUUUAGAUAUGAGUUUACAGCACCUUUUAUAGAAUAUUGUAACCGUUGCUAUGGUGUUAAGAAUGUUUUUAAUUGGAUGCAUUAUACAAUCCCACAGUGUUAU
  - example_title: SARS coronavirus
    pipeline_tag: splice-site
    sequence_type: mRNA
    task: splice-site
    text: >-
      AUGUUUAUUUUCUUAUUAUUUCUUACUCUCACUAGUGGUAGUGACCUUGACCGGUGCACCACUUUUGAUGAUGUUCAAGCUCCUAAUUACACUCAACAUACUUCAUCUAUGAGGGGGGUUUACUAUCCUGAUGAAAUUUUUAGAUCAGACACUCUUUAUUUAACUCAGGAUUUAUUUCUUCCAUUUUAUUCUAAUGUUACAGGGUUUCAUACUAUUAAUCAUACGUUUGACAACCCUGUCAUACCUUUUAAGGAUGGUAUUUAUUUUGCUGCCACAGAGAAAUCAAAUGUUGUCCGUGGUUGGGUUUUUGGUUCUACCAUGAACAACAAGUCACAGUCGGUGAUUAUUAUUAACAAUUCUACUAAUGUUGUUAUACGAGCAUGUAACUUUGAAUUGUGUGACAACCCUUUCUUUGCUGUUUCUAAACCCAUGGGUACACAGACACAUACUAUGAUAUUCGAUAAUGCAUUUAAAUGCACUUUCGAGUACAUAUCU
  - example_title: insulin
    pipeline_tag: splice-site
    sequence_type: mRNA
    task: splice-site
    text: >-
      AUGGCCCUGUGGAUGCGCCUCCUGCCCCUGCUGGCGCUGCUGGCCCUCUGGGGACCUGACCCAGCCGCAGCCUUUGUGAACCAACACCUGUGCGGCUCACACCUGGUGGAAGCUCUCUACCUAGUGUGCGGGGAACGAGGCUUCUUCUACACACCCAAGACCCGCCGGGAGGCAGAGGACCUGCAGGUGGGGCAGGUGGAGCUGGGCGGGGGCCCUGGUGCAGGCAGCCUGCAGCCCUUGGCCCUGGAGGGGUCCCUGCAGAAGCGUGGCAUUGUGGAACAAUGCUGUACCAGCAUCUGCUCCCUCUACCAGCUGGAGAACUACUGCAACUAG
  - example_title: cyclin dependent kinase inhibitor 2A
    pipeline_tag: splice-site
    sequence_type: mRNA
    task: splice-site
    text: >-
      AUGGAGCCGGCGGCGGGGAGCAGCAUGGAGCCUUCGGCUGACUGGCUGGCCACGGCCGCGGCCCGGGGUCGGGUAGAGGAGGUGCGGGCGCUGCUGGAGGCGGGGGCGCUGCCCAACGCACCGAAUAGUUACGGUCGGAGGCCGAUCCAGGUCAUGAUGAUGGGCAGCGCCCGAGUGGCGGAGCUGCUGCUGCUCCACGGCGCGGAGCCCAACUGCGCCGACCCCGCCACUCUCACCCGACCCGUGCACGACGCUGCCCGGGAGGGCUUCCUGGACACGCUGGUGGUGCUGCACCGGGCCGGGGCGCGGCUGGACGUGCGCGAUGCCUGGGGCCGUCUGCCCGUGGACCUGGCUGAGGAGCUGGGCCAUCGCGAUGUCGCACGGUACCUGCGCGCGGCUGCGGGGGGCACCAGAGGCAGUAACCAUGCCCGCAUAGAUGCCGCGGAAGGUCCCUCAGACAUCCCCGAUUGA
  - example_title: human papillomavirus type 16 E6
    pipeline_tag: splice-site
    sequence_type: mRNA
    task: splice-site
    text: >-
      AUGCACCAAAAGAGAACUGCAAUGUUUCAGGACCCACAGGAGCGACCCAGAAAGUUACCACAGUUAUGCACAGAGCUGCAAACAACUAUACAUGAUAUAAUAUUAGAAUGUGUGUACUGCAAGCAACAGUUACUGCGACGUGAGGUAUAUGACUUUGCUUUUCGGGAUUUAUGCAUAGUAUAUAGAGAUGGGAAUCCAUAUGCUGUAUGUGAUAAAUGUUUAAAGUUUUAUUCUAAAAUUAGUGAGUAUAGACAUUAUUGUUAUAGUUUGUAUGGAACAACAUUAGAACAGCAAUACAACAAACCGUUGUGUGAUUUGUUAAUUAGGUGUAUUAACUGUCAAAAGCCACUGUGUCCUGAAGAAAAGCAAAGACAUCUGGACAAAAAGCAAAGAUUCCAUAAUAUAAGGGGUCGGUGGACCGGUCGAUGUAUGUCUUGUUGCAGAUCAUCAAGAACACGUAGAGAAACCCAGCUGUAA
  - example_title: NRAS proto-oncogene
    pipeline_tag: splice-site
    sequence_type: 5' UTR
    task: splice-site
    text: >-
      GGGGCCGGAAGUGCCGCUCCUUGGUGGGGGCUGUUCAUGGCGGUUCCGGGGUCUCCAACAUUUUUCCCGGCUGUGGUCCUAAAUCUGUCCAAAGCAGAGGCAGUGGAGCUUGAGGUUCUUGCUGGUGUGAA
  - example_title: amyloid beta precursor protein
    pipeline_tag: splice-site
    sequence_type: 5' UTR
    task: splice-site
    text: >-
      GUCAGUUUCCUCGGCAGCGGUAGGCGAGAGCACGCGGAGGAGCGUGCGCGGGGGCCCCGGGAGACGGCGGCGGUGGCGGCGCGGGCAGAGCAAGGACGCGGCGGAUCCCACUCGCACAGCAGCGCACUCGGUGCCCCGCGCAGGGUCGCG
  - example_title: RUNX family transcription factor 1
    pipeline_tag: splice-site
    sequence_type: 5' UTR
    task: splice-site
    text: >-
      ACUUCUUUGGGCCUCAUAAACAACCACAGAACCACAAGUUGGGUAGCCUGGCAGUGUCAGAAGUCUGAACCCAGCAUAGUGGUCAGCAGGCAGGACGAAUCACACUGAAUGCAAACCACAGGGUUUCGCAGCGUGGUAAAAGAAAUCAUUGAGUCCCCCGCCUUCAGAAGAGGGUGCAUUUUCAGGAGGAAGCG
  - example_title: fragile X messenger ribonucleoprotein 1
    pipeline_tag: splice-site
    sequence_type: 5' UTR
    task: splice-site
    text: >-
      CUCAGUCAGGCGCUCAGCUCCGUUUCGGUUUCACUUCCGGUGGAGGGCCGCCUCUGAGCGGGCGGCGGGCCGACGGCGAGCGCGGGCGGCGGCGGUGACGGAGGCGCCGCUGCCAGGGGGCGUGCGGCAGCGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGAGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCUGGGCCUCGAGCGCCCGCAGCCCACCUCUCGGGGGCGGGCUCCCGGCGCUAGCAGGGCUGAAGAGAAG
  - example_title: MYC proto-oncogene
    pipeline_tag: splice-site
    sequence_type: 5' UTR
    task: splice-site
    text: >-
      AACUCGCUGUAGUAAUUCCAGCGAGAGGCAGAGGGAGCGAGCGGGCGGCCGGCUAGGGUGGAAGAGCCGGGCGAGCAGAGCUGCGCUGCGGGCGUCCUGGGAAGGGAGAUCCGGAGCGAAUAGGGGGCUUCGCCUCUGGCCCAGCCCUCCCGCUGAUCCCCCAGCCAGCGGUCCGCAACCCUUGCCGCAUCCACGAAACUUUGCCCAUAGCAGCGGGCGGGCACUUUGCACUGGAACUUACAACACCCGAGCAAGGACGCGACUCUCCCGACGCGGGGAGGCUAUUCUGCCCAUUUGGGGACACUUCCCCGCCGCUGCCAGGACCCGCUUCUCUGAAAGGCUCUCCUUGCAGCUGCUUAGACG
  - example_title: activating transcription factor 4
    pipeline_tag: splice-site
    sequence_type: 5' UTR
    task: splice-site
    text: >-
      CAUUUCUACUUUGCCCGCCCACAGAUGUAGUUUUCUCUGCGCGUGUGCGUUUUCCCUCCUCCCCGCCCUCAGGGUCCACGGCCACCAUGGCGUAUUAGGGGCAGCAGUGCCUGCGGCAGCAUUGGCCUUUGCAGCGGCGGCAGCAGCACCAGGCUCUGCAGCGGCAACCCCCAGCGGCUUAAGCCAUGGCGCUUCUCACGGCAUUCAGCAGCAGCGUUGCUGUAACCGACAAAGACACCUUCGAAUUAAGCACAUUCCUCGAUUCCAGCAAAGCACCGCAAC
  - example_title: Human GPI protein p137
    pipeline_tag: splice-site
    sequence_type: 3' UTR
    task: splice-site
    text: >-
      UUUUUAAAAGGAAAAGAUACCAAAUGCCUGCUGCUACCACCCUUUUCAAUUGCUAUGUUUUGAAAGGCACCAGUAUGUGUUUUAGAUUGAUUUAAAUGUUUCAUUUAAAUCACGGACAGUAGUUUCAGUUCUGAUGGUAUAAGCAAAACAAAUAAAACGUUUAUAAAAGUUGUAUCUUGAAACACUGGUGUUCAACAGCUAGCAGCUUAUGUGAUUCACCCCAUGCCACGUUAGUGUCACAAAUUUUAUGGUUUAUCUCCAGCAACAUUUCUCUAGUACUUGCACUUAUUAUCUGAAUUC
  - example_title: nucleophosmin 1
    pipeline_tag: splice-site
    sequence_type: 3' UTR
    task: splice-site
    text: >-
      GAAAAUAGUUUAAACAAUUUGUUAAAAAAUUUUCCGUCUUAUUUCAUUUCUGUAACAGUUGAUAUCUGGCUGUCCUUUUUAUAAUGCAGAGUGAGAACUUUCCCUACCGUGUUUGAUAAAUGUUGUCCAGGUUCUAUUGCCAAGAAUGUGUUGUCCAAAAUGCCUGUUUAGUUUUUAAAGAUGGAACUCCACCCUUUGCUUGGUUUUAAGUAUGUAUGGAAUGUUAUGAUAGGACAUAGUAGUAGCGGUGGUCAGACAUGGAAAUGGUGGGGAGACAAAAAUAUACAUGUGAAAUAAAACUCAGUAUUUUAAUAAAGUAGCACGGUUUCUAUUGA
  - example_title: superoxide dismutase 1
    pipeline_tag: splice-site
    sequence_type: 3' UTR
    task: splice-site
    text: >-
      ACAUUCCCUUGGAUGUAGUCUGAGGCCCCUUAACUCAUCUGUUAUCCUGCUAGCUGUAGAAAUGUAUCCUGAUAAACAUUAAACACUGUAAUCUUAAAAGUGUAAUUGUGUGACUUUUUCAGAGUUGCUUUAAAGUACCUGUAGUGAGAAACUGAUUUAUGAUCACUUGGAAGAUUUGUAUAGUUUUAUAAAACUCAGUUAAAAUGUCUGUUUCAAUGACCUGUAUUUUGCCAGACUUAAAUCACAGAUGGGUAUUAAACUUGUCAGAAUUUCUUUGUCAUUCAAGCCUGUGAAUAAAAACCCUGUAUGGCACUUAUUAUGAGGCUAUUAAAAGAAUCCAAAUUCAAACUAAA
  - example_title: hemoglobin subunit alpha 2
    pipeline_tag: splice-site
    sequence_type: 3' UTR
    task: splice-site
    text: >-
      CUGGAGCCUCGGUAGCCGUUCCUCCUGCCCGCUGGGCCUCCCAACGGGCCCUCCUCCCCUCCUUGCACCGGCCCUUCCUGGUCUUUGAAUAAAGUCUGAGUGGGCAGCA
  - example_title: BRAF proto-oncogene
    pipeline_tag: splice-site
    sequence_type: 3' UTR
    task: splice-site
    text: >-
      AACAAAUGAGUGAGAGAGUUCAGGAGAGUAGCAACAAAAGGAAAAUAAAUGAACAUAUGUUUGCUUAUAUGUUAAAUUGAAUAAAAUACUCUCUUUUUUUUUAAGGUGAACCAAAGAACACUUGUGUGGUUAAAGACUAGAUAUAAUUUUUCCCCAAACUAAAAUUUAUACUUAACAUUGGAUUUUUAACAUCCAAGGGUUAAAAUACAUAGACAUUGCUAAAAAUUGGCAGAGCCUCUUCUAGAGGCUUUACUUUCUGUUCCGGGUUUGUAUCAUUCACUUGGUUAUUUUAAGUAGUAAACUUCAGUUUCUCAUGCAACUUUUGUUGCCAGCUAUCACAUGUCCACUAGGGACUCCAGAAGAAGACCCUACCUAUGCCUGUGUUUGCAGGUGAGAAGUUGGCAGUCGGUUAGCCUGGG
  - example_title: H3 clustered histone 1
    pipeline_tag: splice-site
    sequence_type: 3' UTR
    task: splice-site
    text: UUACUGUGGUCUCUCUGACGGUCCAAGCAAAGGCUCUUUUCAGAGCCACCACCUUUUC

MaxEntScan

Maximum-entropy model for scoring short sequence motifs at RNA splice sites.

Disclaimer

This is an UNOFFICIAL implementation of Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals by Gene Yeo, et al.

The OFFICIAL distribution of MaxEntScan is at the Burge Lab MaxEntScan page.

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing MaxEntScan did not write this model card for this model so this model card has been written by the MultiMolecule team.

Model Details

MaxEntScan is a maximum-entropy model for the splice donor (5') and splice acceptor (3') sequence motifs. It is not a neural network and has no trainable weights. The model parameters are fixed maximum-entropy probability tables estimated by Yeo & Burge (2004) from human splice-site sequences.

Model Specification

MaxEntScan is a parameter-free maximum-entropy model. It performs fixed table lookups and contains no learnable weights or floating-point arithmetic that the profiler can attribute to a module.

Mode Window Num Parameters (M) FLOPs (G) MACs (G)
score5 9 0.00 0.00 0.00
score3 23

Links

Usage

The model file depends on the multimolecule library. You can install it using pip:

pip install multimolecule

Direct Use

5' Splice-Site Scoring

>>> import torch
>>> from multimolecule import RnaTokenizer, MaxEntScanModel, MaxEntScanConfig

>>> config = MaxEntScanConfig()
>>> model = MaxEntScanModel(config)
>>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/maxentscan-score5")
>>> # MaxEntScan scores a raw fixed-length window; do not add special tokens.
>>> input = tokenizer("CAGGUAAGU", add_special_tokens=False, return_tensors="pt")["input_ids"]
>>> output = model(input)
>>> output.logits.shape
torch.Size([1, 1])

3' Splice-Site Scoring

>>> config = MaxEntScanConfig(mode="score3")
>>> model = MaxEntScanModel(config)
>>> output = model(torch.randint(4, (1, config.window)))
>>> output.logits.shape
torch.Size([1, 1])

Interface

  • Input length: 9 nt fixed window for score5; 23 nt fixed window for score3
  • Alphabet: ACGU only; unknown / N tokens are clamped onto A before table lookup
  • Special tokens: do not add (add_special_tokens=False)
  • inputs_embeds: not supported; the model scores discrete token windows only
  • Output: single scalar splice-site log-odds score per window

Training Details

MaxEntScan is not trained. Its maximum-entropy probability tables were estimated once by Yeo & Burge (2004) from a set of human constitutive splice-site sequences using an iterative maximum-entropy procedure. The published tables are reused verbatim.

Scoring Modes

  • score5: scores 5' (donor) splice sites over a 9-nucleotide window (3 exonic + 6 intronic nucleotides). The score is read from the published me2x5 maximum-entropy probability table combined with the consensus background ratios.
  • score3: scores 3' (acceptor) splice sites over a 23-nucleotide window. The 23-mer is decomposed into nine overlapping maximum-entropy submodels following the published maximum-entropy decomposition; the score is the log-ratio of the numerator and denominator submodel products.

Training Data

  • Source: human RefSeq splice-site sequences as described in Yeo & Burge (2004).
  • Maximum-entropy constraints: pairwise and higher-order positional dependencies within the splice-site window.

The model parameters are the fixed maximum-entropy probability tables distributed as plain-text files with the original Yeo & Burge (2004) MaxEntScan tool: me2x5 for the 5' scorer and the nine maximum-entropy decomposition matrices me2x3acc1..9 for the 3' scorer. The consensus and background ratios are fixed constants from the original score5.pl and score3.pl programs.

Training Procedure

Pre-training

MaxEntScan does not use neural-network pre-training. Its maximum-entropy probability tables are reused from the original MaxEntScan distribution.

Citation

@article{yeo2004maximum,
  author    = {Yeo, Gene and Burge, Christopher B.},
  title     = {Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals},
  journal   = {Journal of Computational Biology},
  volume    = {11},
  number    = {2-3},
  pages     = {377--394},
  year      = {2004},
  publisher = {Mary Ann Liebert, Inc.},
  doi       = {10.1089/1066527041410418}
}

The artifacts distributed in this repository are part of the MultiMolecule project. If MultiMolecule supports your research, please cite the MultiMolecule project as follows:

@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}

Contact

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the MaxEntScan paper for questions or comments on the paper/model.

License

This model implementation is licensed under the GNU Affero General Public License.

For additional terms and clarifications, please refer to our License FAQ.

SPDX-License-Identifier: AGPL-3.0-or-later