File size: 7,223 Bytes
d2d860c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
language: dna
tags:
  - Biology
  - DNA
  - RNA
  - Splicing
license: agpl-3.0
library_name: multimolecule
---

# MaxEntScan

Maximum-entropy model for scoring short sequence motifs at RNA splice sites.

## Disclaimer

This is an UNOFFICIAL implementation of [Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals](https://doi.org/10.1089/1066527041410418) by Gene Yeo and Christopher B. Burge.

The OFFICIAL distribution of MaxEntScan is at [the Burge Lab MaxEntScan page](http://hollywood.mit.edu/burgelab/maxent/Xmaxentscan_scoreseq.html).

> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

**The team releasing MaxEntScan did not write this model card for this model so this model card has been written by the MultiMolecule team.**

## Model Details

MaxEntScan is a maximum-entropy model for the splice donor (5') and splice acceptor (3') sequence motifs. It is **not a neural network** and has **no trainable weights**. The model parameters are fixed maximum-entropy probability tables estimated by Yeo & Burge (2004) from human splice-site sequences. MultiMolecule registers these tables as persistent buffers on the model so they serialize with saved checkpoints.

Two scorers are provided:

- `score5`: scores 5' (donor) splice sites over a 9-nucleotide window (3 exonic + 6 intronic nucleotides). The score is read from the published `me2x5` maximum-entropy probability table combined with the consensus background ratios.
- `score3`: scores 3' (acceptor) splice sites over a 23-nucleotide window. The 23-mer is decomposed into nine overlapping maximum-entropy submodels following the published maximum-entropy decomposition; the score is the log-ratio of the numerator and denominator submodel products.

### Model Specification

MaxEntScan is a parameter-free maximum-entropy model. It performs fixed table lookups and contains no learnable weights or floating-point arithmetic that the profiler can attribute to a module.

| Mode   | Window | Num Parameters (M) | FLOPs (G) | MACs (G) |
| ------ | ------ | ------------------ | --------- | -------- |
| score5 | 9      | 0.00               | 0.00      | 0.00     |
| score3 | 23     | 0.00               | 0.00      | 0.00     |

### Links

- **Code**: [multimolecule.maxentscan](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/maxentscan)
- **Paper**: [Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals](https://doi.org/10.1089/1066527041410418)
- **Developed by**: Gene Yeo, Christopher B. Burge
- **Original Distribution**: [Burge Lab MaxEntScan](http://hollywood.mit.edu/burgelab/maxent/Xmaxentscan_scoreseq.html)

## Usage

The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:

```bash
pip install multimolecule
```

### Direct Use

#### 5' Splice-Site Scoring

```python
>>> import torch
>>> from multimolecule import DnaTokenizer, MaxEntScanModel, MaxEntScanConfig

>>> config = MaxEntScanConfig()
>>> model = MaxEntScanModel(config)
>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/maxentscan")
>>> # MaxEntScan scores a raw fixed-length window; do not add special tokens.
>>> input = tokenizer("CAGGTAAGT", add_special_tokens=False, return_tensors="pt")["input_ids"]
>>> output = model(input)
>>> output.logits.shape
torch.Size([1, 1])
```

#### 3' Splice-Site Scoring

```python
>>> config = MaxEntScanConfig(mode="score3")
>>> model = MaxEntScanModel(config)
>>> output = model(torch.randint(4, (1, config.window)))
>>> output.logits.shape
torch.Size([1, 1])
```

## Training Details

MaxEntScan is not trained. Its maximum-entropy probability tables were estimated once by Yeo & Burge (2004) from a set of human constitutive splice-site sequences using an iterative maximum-entropy procedure. The published tables are reused verbatim.

### Training Data

- Source: human RefSeq splice-site sequences as described in Yeo & Burge (2004).
- Maximum-entropy constraints: pairwise and higher-order positional dependencies within the splice-site window.

## Conversion And Provenance

- MaxEntScan has no upstream PyTorch checkpoint. The "parameters" are the fixed maximum-entropy probability tables (`me2x5` for the 5' scorer and the nine maximum-entropy decomposition matrices `me2x3acc1..9` for the 3' scorer; the consensus and background ratios are fixed constants from the original `score5.pl`/`score3.pl`) distributed as plain-text files with the original Yeo & Burge (2004) MaxEntScan tool.
- The original Burge-lab tables are bundled verbatim in this package as `score5_me2x5.txt` and `score3_me2x3acc.txt` (native one-float-per-line order, which equals base-4 / the published `splice5sequences` enumeration). They were obtained from the original MaxEntScan release as redistributed under the MIT license by the [`maxentpy`](https://github.com/kepbod/maxentpy) package, and are also mirrored by [Kipoi](https://github.com/kipoi/models/tree/master/MaxEntScan) (`MaxEntScan/5prime`, `MaxEntScan/3prime`); Kipoi is referenced only for provenance and is not a runtime dependency.
- `convert_checkpoint.py` builds the persistent score-table buffers directly from those bundled plain-text tables.

## Citation

```bibtex
@article{yeo2004maximum,
  author    = {Yeo, Gene and Burge, Christopher B.},
  title     = {Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals},
  journal   = {Journal of Computational Biology},
  volume    = {11},
  number    = {2-3},
  pages     = {377--394},
  year      = {2004},
  publisher = {Mary Ann Liebert, Inc.},
  doi       = {10.1089/1066527041410418}
}
```

> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:

```bibtex
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}
```

## Known Limitations

- MaxEntScan only models the four canonical nucleotides `ACGT`. Unknown / `N` tokens are clamped onto `A` before table lookup.
- Inputs must be a single fixed-length window matching the configured mode (9 for `score5`, 23 for `score3`).
- The model does not accept `inputs_embeds`; it scores discrete token windows only.

## Contact

Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.

Please contact the authors of the [MaxEntScan paper](https://doi.org/10.1089/1066527041410418) for questions or comments on the paper/model.

## License

This model implementation is licensed under the [GNU Affero General Public License](license.md).

For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).

```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```