BERTose and AFFINose Training-Code Manifest
Updated: 2026-06-10
Public repository: https://huggingface.co/supanthadey1/bertose-affinose-training-code
Current public contents: 130 files, approximately 54 MB
Scope: training, benchmarking, probe, split/vocabulary, and provenance code needed to understand and reproduce the released BERTose and AFFINose workflows.
This manifest describes the files actually present in the public Hugging Face training-code repository. Large corpora, intermediate mapping files, checkpoints, generated figures, and full result bundles are intentionally not bundled here; the released checkpoints and inference notebook are linked from the repository README.
Public Naming
Some executable files keep historical development names so that the code remains traceable to the original training logs. In public-facing text:
| Public name | Historical code locations |
|---|---|
| BERTose glycan encoder / multimodal pretraining | code/model/, code/training/, configs/ |
| BERTose IAR resolver / contrastive refinement | code/contrastive/, code/contrastive_training/, code/tokenizer/ |
| AFFINose protein-glycan interaction model | code/affinose/README.md, code/bertint/ |
Entry Points
| File | Purpose |
|---|---|
README.md |
Reviewer/user-facing overview, repository map, checkpoint links, install notes, and quick import check |
MANIFEST.md |
This file; exact public package map |
RELEASE_AUDIT.md |
Public-release checks, repaired gaps, verification results, and known scope limits |
LICENSE |
Apache License 2.0 terms for the BERTose/AFFINose code, notebooks and released model artifacts |
requirements.txt |
Core Python dependencies plus optional analysis/probe packages |
SHA256SUMS |
Checksums for every public file except SHA256SUMS itself |
Workflow Map
| Workflow | Main files |
|---|---|
| BERTose multimodal pretraining | code/training/train_multimodal.py, code/training/multimodal_dataset.py, code/training/multimodal_masking.py, code/training/masking.py, configs/multimodal_config_v5b_excluded.yaml |
| WURCS-BPE tokenizer training | code/training/train_wurcs_bpe.py, code/model/wurcs_bpe_tokenizer.py, data/vocab/ |
| Ambiguity/IAR contrastive refinement | code/contrastive/contrastive_trainer_v51_FINAL.py, code/contrastive/generate_negatives_v4_FINAL.py, code/contrastive/mcns_filter_75k_v3_FINAL.py, code/contrastive_training/step3_train_v51_FIXED_FINAL.sh |
| AFFINose interaction training | code/affinose/README.md, code/bertint/build_combined_dataset.py, code/bertint/generate_glycan_splits.py, code/bertint/training_v8.py, code/bertint/bertint_v8.py, code/bertint/dataset_v8.py |
| Benchmark reproduction | code/benchmarks/, code/downstream_tasks/utils/ |
| Embedding and biology probes | code/probes/, code/probes/cluster_scripts/ |
| Provenance and compute context | provenance/, provenance/compute_provenance/ |
Version Selection
| Version label | Public role | Config or script | Checkpoint handling |
|---|---|---|---|
| v3 ORIGINAL | Legacy baseline pre-BPE | configs/multimodal_config_v3_ORIGINAL.yaml |
Not bundled |
| v4 BPE | First BPE model | configs/multimodal_config_v4_bpe.yaml |
Not bundled |
| v5-A IPA | IPA self-distillation | configs/multimodal_config_v5a_bpe_topo.yaml, code/training/ipa_bpe_distillation.py |
Not bundled |
| v5b excluded | Production leakage-proof BERTose encoder | configs/multimodal_config_v5b_excluded.yaml |
Released separately as supanthadey1/bertose-glycan-encoder |
| v5.1 contrastive | Production BERTose IAR resolver; historically called the contrastive/V6 model in development notes | code/contrastive/contrastive_trainer_v51_FINAL.py |
Released separately as supanthadey1/bertose-iar-resolver |
| AFFINose interaction model | Protein-glycan interaction prediction | code/bertint/training_v8.py |
Released separately as supanthadey1/affinose-interaction-model |
Included Directory Summary
| Directory | Included content |
|---|---|
code/model/ |
Core BERTose model, multimodal model, classifier heads, tokenizer helpers, dataset helpers |
code/training/ |
Pretraining, tokenizer training, IPA distillation, masking, multimodal dataset and masking dependencies |
code/contrastive/ |
IAR/contrastive trainer, negative generation, MCNS filtering, difficulty scoring, resolved-glycan extraction |
code/contrastive_training/ |
Representative contrastive launch script |
code/bertint/ |
Historical AFFINose implementation files for data construction, splits, model, dataset, training, inference, and ablations |
code/affinose/ |
Public AFFINose entrypoint explaining the historical bertint directory |
code/benchmarks/ |
Fine-tuning, benchmark consolidation, ranking calculation, result extraction, exclusion-dataset construction |
code/downstream_tasks/utils/ |
Package-local benchmark/probe utilities and tokenizer/dataset dependencies |
code/probes/ |
Probe and embedding-analysis scripts plus representative cluster launch scripts |
code/data_processing/ |
Pretraining-data augmentation and IUPAC/GlyCosmos conversion utilities |
code/tokenizer/ |
Ambiguity-token generation helper |
configs/ |
Four BERTose multimodal training configs |
data/splits/ |
Leakage-exclusion list and train/validation split metadata |
data/vocab/ |
BPE, character, ambiguity, confidence, and MS vocabulary assets |
provenance/ |
Source, tokenizer, model-lineage, and compute-provenance notes |
Included Data Assets
| File | Notes |
|---|---|
data/vocab/bpe_vocabulary_clean.json |
Production BPE vocabulary |
data/vocab/bpe_vocabulary.json |
Original BPE vocabulary before cleanup |
data/vocab/vocabulary.json |
Character-level vocabulary |
data/vocab/ms_vocabulary.json |
Mass-spectrometry fragment-token vocabulary |
data/vocab/bpe_ambiguity_tokens.json |
BPE ambiguity-token map |
data/vocab/ambiguity_tokens_CORRECTED.json |
Corrected ambiguity-token map |
data/vocab/confidence_analysis.json |
IPA confidence-analysis output used by release scripts |
data/splits/train_val_split.json |
BERTose train/validation split metadata |
data/splits/complete_exclusion_list.txt |
Leakage-proof exclusion list |
Not Included Here
The public training-code repository intentionally excludes large artifacts that are either too large for a compact code companion or are hosted in separate public repositories:
- Full pretraining corpus pickles such as
sequences_bpe.pklandsequences_bpe_excluded.pkl. - Multi-GB intermediate mapping files such as full multimodal/master mappings.
- Full model checkpoints; use the linked model repositories instead.
- ESM-C protein embeddings required for AFFINose training; users should generate or provide those according to ESM-C access rules.
- Generated manuscript figures, full probe result bundles, and large benchmark output folders; processed summaries/source data are handled separately for manuscript submission.
Verification Status
The public release was checked by no-token Hugging Face readback. The following checks passed:
- The repository is public and ungated.
- The remote file list exactly matches the local staged release.
- Model card license metadata is
apache-2.0, and the repository includes an Apache-2.0LICENSEfile. - A no-token snapshot download succeeds.
SHA256SUMSverifies every downloaded file.- No Hugging Face token-looking strings were found.
- No
.secrets,__pycache__,.pyc, or.DS_Storeartifacts were present in the uploaded release. - The README quick import check passes from the downloaded public snapshot.
These checks verify packaging, public access, checksum integrity, token hygiene, and import completeness for the released code surface. They do not claim that a full multi-GPU retraining run can be launched without the large excluded corpora, embeddings, and compute resources.