| --- |
| language: dna |
| tags: |
| - Biology |
| - DNA |
| - RNA |
| - Splicing |
| license: agpl-3.0 |
| library_name: multimolecule |
| --- |
| |
| # MaxEntScan |
|
|
| Maximum-entropy model for scoring short sequence motifs at RNA splice sites. |
|
|
| ## Disclaimer |
|
|
| This is an UNOFFICIAL implementation of [Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals](https://doi.org/10.1089/1066527041410418) by Gene Yeo and Christopher B. Burge. |
|
|
| The OFFICIAL distribution of MaxEntScan is at [the Burge Lab MaxEntScan page](http://hollywood.mit.edu/burgelab/maxent/Xmaxentscan_scoreseq.html). |
|
|
| > [!TIP] |
| > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. |
|
|
| **The team releasing MaxEntScan did not write this model card for this model so this model card has been written by the MultiMolecule team.** |
|
|
| ## Model Details |
|
|
| MaxEntScan is a maximum-entropy model for the splice donor (5') and splice acceptor (3') sequence motifs. It is **not a neural network** and has **no trainable weights**. The model parameters are fixed maximum-entropy probability tables estimated by Yeo & Burge (2004) from human splice-site sequences. MultiMolecule registers these tables as persistent buffers on the model so they serialize with saved checkpoints. |
|
|
| Two scorers are provided: |
|
|
| - `score5`: scores 5' (donor) splice sites over a 9-nucleotide window (3 exonic + 6 intronic nucleotides). The score is read from the published `me2x5` maximum-entropy probability table combined with the consensus background ratios. |
| - `score3`: scores 3' (acceptor) splice sites over a 23-nucleotide window. The 23-mer is decomposed into nine overlapping maximum-entropy submodels following the published maximum-entropy decomposition; the score is the log-ratio of the numerator and denominator submodel products. |
|
|
| ### Model Specification |
|
|
| MaxEntScan is a parameter-free maximum-entropy model. It performs fixed table lookups and contains no learnable weights or floating-point arithmetic that the profiler can attribute to a module. |
|
|
| | Mode | Window | Num Parameters (M) | FLOPs (G) | MACs (G) | |
| | ------ | ------ | ------------------ | --------- | -------- | |
| | score5 | 9 | 0.00 | 0.00 | 0.00 | |
| | score3 | 23 | 0.00 | 0.00 | 0.00 | |
|
|
| ### Links |
|
|
| - **Code**: [multimolecule.maxentscan](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/maxentscan) |
| - **Paper**: [Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals](https://doi.org/10.1089/1066527041410418) |
| - **Developed by**: Gene Yeo, Christopher B. Burge |
| - **Original Distribution**: [Burge Lab MaxEntScan](http://hollywood.mit.edu/burgelab/maxent/Xmaxentscan_scoreseq.html) |
|
|
| ## Usage |
|
|
| The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: |
|
|
| ```bash |
| pip install multimolecule |
| ``` |
|
|
| ### Direct Use |
|
|
| #### 5' Splice-Site Scoring |
|
|
| ```python |
| >>> import torch |
| >>> from multimolecule import DnaTokenizer, MaxEntScanModel, MaxEntScanConfig |
| |
| >>> config = MaxEntScanConfig() |
| >>> model = MaxEntScanModel(config) |
| >>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/maxentscan") |
| >>> # MaxEntScan scores a raw fixed-length window; do not add special tokens. |
| >>> input = tokenizer("CAGGTAAGT", add_special_tokens=False, return_tensors="pt")["input_ids"] |
| >>> output = model(input) |
| >>> output.logits.shape |
| torch.Size([1, 1]) |
| ``` |
|
|
| #### 3' Splice-Site Scoring |
|
|
| ```python |
| >>> config = MaxEntScanConfig(mode="score3") |
| >>> model = MaxEntScanModel(config) |
| >>> output = model(torch.randint(4, (1, config.window))) |
| >>> output.logits.shape |
| torch.Size([1, 1]) |
| ``` |
|
|
| ## Training Details |
|
|
| MaxEntScan is not trained. Its maximum-entropy probability tables were estimated once by Yeo & Burge (2004) from a set of human constitutive splice-site sequences using an iterative maximum-entropy procedure. The published tables are reused verbatim. |
|
|
| ### Training Data |
|
|
| - Source: human RefSeq splice-site sequences as described in Yeo & Burge (2004). |
| - Maximum-entropy constraints: pairwise and higher-order positional dependencies within the splice-site window. |
|
|
| ## Conversion And Provenance |
|
|
| - MaxEntScan has no upstream PyTorch checkpoint. The "parameters" are the fixed maximum-entropy probability tables (`me2x5` for the 5' scorer and the nine maximum-entropy decomposition matrices `me2x3acc1..9` for the 3' scorer; the consensus and background ratios are fixed constants from the original `score5.pl`/`score3.pl`) distributed as plain-text files with the original Yeo & Burge (2004) MaxEntScan tool. |
| - The original Burge-lab tables are bundled verbatim in this package as `score5_me2x5.txt` and `score3_me2x3acc.txt` (native one-float-per-line order, which equals base-4 / the published `splice5sequences` enumeration). They were obtained from the original MaxEntScan release as redistributed under the MIT license by the [`maxentpy`](https://github.com/kepbod/maxentpy) package, and are also mirrored by [Kipoi](https://github.com/kipoi/models/tree/master/MaxEntScan) (`MaxEntScan/5prime`, `MaxEntScan/3prime`); Kipoi is referenced only for provenance and is not a runtime dependency. |
| - `convert_checkpoint.py` builds the persistent score-table buffers directly from those bundled plain-text tables. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{yeo2004maximum, |
| author = {Yeo, Gene and Burge, Christopher B.}, |
| title = {Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals}, |
| journal = {Journal of Computational Biology}, |
| volume = {11}, |
| number = {2-3}, |
| pages = {377--394}, |
| year = {2004}, |
| publisher = {Mary Ann Liebert, Inc.}, |
| doi = {10.1089/1066527041410418} |
| } |
| ``` |
|
|
| > [!NOTE] |
| > The artifacts distributed in this repository are part of the MultiMolecule project. |
| > If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows: |
|
|
| ```bibtex |
| @software{chen_2024_12638419, |
| author = {Chen, Zhiyuan and Zhu, Sophia Y.}, |
| title = {MultiMolecule}, |
| doi = {10.5281/zenodo.12638419}, |
| publisher = {Zenodo}, |
| url = {https://doi.org/10.5281/zenodo.12638419}, |
| year = 2024, |
| month = may, |
| day = 4 |
| } |
| ``` |
|
|
| ## Known Limitations |
|
|
| - MaxEntScan only models the four canonical nucleotides `ACGT`. Unknown / `N` tokens are clamped onto `A` before table lookup. |
| - Inputs must be a single fixed-length window matching the configured mode (9 for `score5`, 23 for `score3`). |
| - The model does not accept `inputs_embeds`; it scores discrete token windows only. |
|
|
| ## Contact |
|
|
| Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. |
|
|
| Please contact the authors of the [MaxEntScan paper](https://doi.org/10.1089/1066527041410418) for questions or comments on the paper/model. |
|
|
| ## License |
|
|
| This model implementation is licensed under the [GNU Affero General Public License](license.md). |
|
|
| For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). |
|
|
| ```spdx |
| SPDX-License-Identifier: AGPL-3.0-or-later |
| ``` |
|
|