File size: 2,138 Bytes
9b82aa2
 
 
 
 
 
 
 
 
 
 
 
b23b9d6
9b82aa2
b23b9d6
9b82aa2
55db074
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9b82aa2
 
b23b9d6
9b82aa2
4457eee
b23b9d6
 
9b82aa2
 
4457eee
9b82aa2
5f33f64
 
 
9b82aa2
 
 
 
 
4457eee
 
5f33f64
9b82aa2
55db074
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
---
library_name: pytorch
license: other
tags:
- glycans
- wurcs
- bertose
- ambiguity-resolution
- contrastive-learning
- pytorch
---

# BERTose IAR Resolver

This repository contains the contrastively refined BERTose checkpoint used for iterative ambiguity resolution (IAR) over ambiguous WURCS BPE tokens.

## Quick Start

The recommended user path is the companion notebook:

```python
from huggingface_hub import hf_hub_download

checkpoint = hf_hub_download(
    repo_id="supanthadey1/bertose-iar-resolver",
    filename="checkpoints/bertose_iar_resolver.pt",
)
ambiguity_map = hf_hub_download(
    repo_id="supanthadey1/bertose-iar-resolver",
    filename="vocab/bpe_ambiguity_tokens.json",
)
```

No Hugging Face token is required for this BERTose IAR checkpoint now that the repository is public.

## Files

- `checkpoints/bertose_iar_resolver.pt` - BERTose IAR checkpoint.
- `vocab/bpe_vocabulary.json` - WURCS BPE vocabulary.
- `vocab/bpe_ambiguity_tokens.json` - ambiguous-token map used by the resolver.
- `src/bertose_model.py` - BERTose model definition.
- `src/bertose_layers.py` - Transformer layers used by BERTose.
- `src/wurcs_bpe_tokenizer.py` - WURCS BPE tokenizer.

## Input

Provide one WURCS glycan string or a CSV batch with `sample_id,wurcs`. The resolver is intended for glycans that already contain uncertainty markers in WURCS form.

Free-text ambiguous glycan names are not parsed directly. Convert the name or IUPAC-condensed notation to WURCS first. If the structure is ambiguous, preserve that ambiguity in the WURCS string with WURCS-style uncertainty markers before running BERTose IAR.

## Output

Token-level ambiguity-resolution predictions with confidence scores. The companion notebook writes both summary and detail CSVs for batch runs.

## Scope

The resolver provides model-backed token updates and confidence values for ambiguous positions. It does not claim to reconstruct a final canonical WURCS string by itself, and it does not perform IUPAC-condensed/name-to-WURCS conversion.

License metadata is currently `other`; update it when the final release license and citation text are chosen.