File size: 2,127 Bytes
55fd721
 
 
 
 
 
 
0773969
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
license: apache-2.0
tags:
- glycan
- glycomics
- glycoinformatics
- mass_spectrometry
---

## Model Details

**Model Name:** GlycoBERT  
**Model Type:** Transformer-based sequence classifier for glycan structure prediction  
**Architecture:** BERT (Bidirectional Encoder Representations from Transformers)  
**Version:** 1.0  
**Date:** 6/25/2025

### Model Description
GlycoBERT is a transformer-based deep learning model designed to predict glycan structures from tandem mass spectrometry (MS/MS) data. The model treats mass spectra as tokenized sequences ("MS sentences") and performs multi-class classification to assign spectra to one of 3,590 possible glycan structure classes.

### Architecture Details
- **Base Architecture:** BertForSequenceClassification
- **Parameters:** 96 million
- **Layers:** 12 transformer layers
- **Attention Heads:** 12 per layer
- **Hidden Size:** 768 dimensions
- **Max Sequence Length:** 512 tokens
- **Vocabulary Size:** 10,010 tokens

## Intended Use

### Primary Use Cases
- Glycan structure prediction from MS/MS spectra
- High-throughput glycomics analysis
- Structural annotation of mass spectrometry data
- Research applications in glycobiology and glycoinformatics


## Ethical Considerations

### Responsible Use
- Model predictions should be validated experimentally
- Not intended for direct clinical decision-making without proper validation
- Users should understand the model's limitations and scope

### Potential Risks
- Over-reliance on computational predictions without experimental validation
- Misinterpretation of confidence scores as absolute certainty
- Application to data significantly different from training distribution

## Model Version
- **GlycoBERT-F:** Version trained on full dataset

### Code and Data Availability
- **Repository:** [GitHub (glycotrans)](https://github.com/cabsel/glycotrans)
- **Training Data:** [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.15741423.svg)](https://doi.org/10.5281/zenodo.15741423)
- **Example Inference:** [Google Colab](https://colab.research.google.com/drive/1otVLVDQfLyldtIFcBxGnwVf9PeeTnJ17?usp=sharing)