File size: 1,854 Bytes
bc15b88
b49ae26
bc15b88
 
b49ae26
bc15b88
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53e89e4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
---
license: apache-2.0
tags:
- generated_from_trainer
pipeline_tag: feature-extraction
model-index:
- name: RNAMamba-14M
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# RNAMamba-14M

This model is a small Mamba based model trained from scratch on 1.96 million sequences (1.56 billion bases) extracted from RNAcentral's active sequences FASTA file for release 24 (March 2024).

This is intended to be a sequence embedding model for downstream processing of ncRNA sequences. 
It is trained with a masked language modelling objective, and a context size of 8,192 nucleotides. This particular model has the MLM head stripped off and so should be almost ready to use for embedding.
The [dataset](https://huggingface.co/datasets/afg1/rnacentral_subset) has sequences ranging in length from 10 to 8192, so the model should be pretty good at handling sequences in that range.
This is a deliberately small model with only 14.1 million parameters (8 hidden layers, hidden dim 512, intermediate size 1024) such that it will run fast without a GPU. We may train something bigger if it looks like these embeddings are not good enough.


<!--## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed

## Training procedure
-->
### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 32
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 1.0

### Framework versions

- Transformers 4.39.3
- Pytorch 2.2.2+cu118
- Datasets 2.18.0
- Tokenizers 0.15.2