Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Model Card for aedupuga/multioutput-regression-models
|
| 2 |
+
|
| 3 |
+
### Model Description
|
| 4 |
+
|
| 5 |
+
This model card describes the multi-output regression models trained on the aedupuga/2025-scaffold-strucutres dataset. The models predict structural properties of DNA sequences based on their sequence and other features.
|
| 6 |
+
|
| 7 |
+
- **Model developed by:** Anuhya Edupuganti
|
| 8 |
+
- **Model type:** Multi-output regression models (e.g., Ridge, Elastic Net, etc.)
|
| 9 |
+
|
| 10 |
+
### Model Sources
|
| 11 |
+
- **Dataset:** https://huggingface.co/datasets/aedupuga/2025-scaffold-strucutres
|
| 12 |
+
|
| 13 |
+
### Direct Use
|
| 14 |
+
- These models can be used to predict structural properties of new DNA sequences. The inputs should be the sequence (one hot encoded), length_bp, GC_content, and AT_content in the same format as the training data.
|
| 15 |
+
|
| 16 |
+
## Bias, Risks, and Limitations
|
| 17 |
+
- The models are trained on a specific dataset and may not generalize well to sequences with significantly different characteristics.
|
| 18 |
+
|
| 19 |
+
## Training Data:
|
| 20 |
+
|
| 21 |
+
The models were trained on the original split of the aedupuga/2025-scaffold-strucutres dataset, which contains features like sequence, length_bp, GC_content and target variables mfe_energy, num_pairs, stem_len_mean, num_stems, num_hairpins, and num_internal_loops.
|
| 22 |
+
|
| 23 |
+
## Evaluation Data:
|
| 24 |
+
The models were evaluated using Mean Absolute Error (MAE) per target variable, Overall Mean Squared Error (MSE), and Overall R2 score on a test set. The results of this evaluation are below:
|
| 25 |
+
|
| 26 |
+
|index|MAE per Target|Overall MSE|Overall R2|Training Time \(s\)|Prediction Time \(s\)|
|
| 27 |
+
|---|---|---|---|---|---|
|
| 28 |
+
|Elastic Net Regression|\{'mfe\_energy': 52\.246284144510895, 'num\_pairs': 26\.310440395684935, 'stem\_len\_mean': 0\.12521268046915585, 'num\_stems': 11\.824946984005694, 'num\_hairpins': 6\.362566878951059, 'num\_internal\_loops': 10\.42332493488957\}|1106\.2239040178551|0\.826949061716721|37\.89513540267944|0\.1340947151184082|
|
| 29 |
+
|Gradient Boosting Regressor|\{'mfe\_energy': 93\.86046583448288, 'num\_pairs': 62\.12858533728426, 'stem\_len\_mean': 0\.1195790099334551, 'num\_stems': 19\.521731017111673, 'num\_hairpins': 8\.17095118930435, 'num\_internal\_loops': 13\.708766069413938\}|8056\.465535344057|0\.6354714816262127|1064\.1453528404236|0\.1442549228668213|
|
| 30 |
+
|Hist Gradient Boosting Regressor|\{'mfe\_energy': 92\.7948317451044, 'num\_pairs': 119\.05137751966541, 'stem\_len\_mean': 0\.09455135368867978, 'num\_stems': 38\.937795002481145, 'num\_hairpins': 14\.538582916907997, 'num\_internal\_loops': 17\.869036566267987\}|22401\.159492850904|0\.8354263411439559|2276\.7718391418457|0\.05630350112915039|
|
| 31 |
+
|LGBM Regressor|\{'mfe\_energy': 101\.99282118712706, 'num\_pairs': 118\.43061288454638, 'stem\_len\_mean': 0\.09833922311726692, 'num\_stems': 40\.143725672660345, 'num\_hairpins': 14\.649323146842754, 'num\_internal\_loops': 17\.48710432164195\}|23866\.947492270672|0\.8261400755125136|110\.61460065841675|2\.587249279022217|
|
| 32 |
+
|Ridge Regression|\{'mfe\_energy': 53\.306863779432625, 'num\_pairs': 25\.654395957994026, 'stem\_len\_mean': 0\.08403309633471835, 'num\_stems': 11\.393997952747661, 'num\_hairpins': 5\.67977376648804, 'num\_internal\_loops': 9\.260745328034114\}|1260\.7624462037288|0\.9156932974948483|7\.063617944717407|0\.12312531471252441|
|
| 33 |
+
|Lasso Regression|\{'mfe\_energy': 67\.2766660142239, 'num\_pairs': 31\.48700612938905, 'stem\_len\_mean': 0\.12521713179836697, 'num\_stems': 13\.158785656539967, 'num\_hairpins': 6\.854702974737726, 'num\_internal\_loops': 11\.13869663689622\}|1823\.6267070867707|0\.8248397294025618|51\.86927938461304|0\.12734723091125488|
|
| 34 |
+
|MLP Regressor|\{'mfe\_energy': 113\.60031276554486, 'num\_pairs': 76\.11145098696264, 'stem\_len\_mean': 1\.7844990300743258, 'num\_stems': 19\.919928534641326, 'num\_hairpins': 9\.225894814725708, 'num\_internal\_loops': 13\.794781026278551\}|5507\.494866833836|-34\.39226684672794|68\.65580224990845|0\.13591504096984863|
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
## Model Card Contact
|
| 38 |
+
|
| 39 |
+
Anuhya Edupuganti (Carnegie Mellon Univerity)- aedupuga@andrew.cmu.edu
|