BERTose and AFFINose Training-Code Manifest

Updated: 2026-06-10
Public repository: https://huggingface.co/supanthadey1/bertose-affinose-training-code
Current public contents: 130 files, approximately 54 MB
Scope: training, benchmarking, probe, split/vocabulary, and provenance code needed to understand and reproduce the released BERTose and AFFINose workflows.

This manifest describes the files actually present in the public Hugging Face training-code repository. Large corpora, intermediate mapping files, checkpoints, generated figures, and full result bundles are intentionally not bundled here; the released checkpoints and inference notebook are linked from the repository README.

Public Naming

Some executable files keep historical development names so that the code remains traceable to the original training logs. In public-facing text:

Public name	Historical code locations
BERTose glycan encoder / multimodal pretraining	`code/model/`, `code/training/`, `configs/`
BERTose IAR resolver / contrastive refinement	`code/contrastive/`, `code/contrastive_training/`, `code/tokenizer/`
AFFINose protein-glycan interaction model	`code/affinose/README.md`, `code/bertint/`

Entry Points

File	Purpose
`README.md`	Reviewer/user-facing overview, repository map, checkpoint links, install notes, and quick import check
`MANIFEST.md`	This file; exact public package map
`RELEASE_AUDIT.md`	Public-release checks, repaired gaps, verification results, and known scope limits
`LICENSE`	Apache License 2.0 terms for the BERTose/AFFINose code, notebooks and released model artifacts
`requirements.txt`	Core Python dependencies plus optional analysis/probe packages
`SHA256SUMS`	Checksums for every public file except `SHA256SUMS` itself

Workflow Map

Workflow	Main files
BERTose multimodal pretraining	`code/training/train_multimodal.py`, `code/training/multimodal_dataset.py`, `code/training/multimodal_masking.py`, `code/training/masking.py`, `configs/multimodal_config_v5b_excluded.yaml`
WURCS-BPE tokenizer training	`code/training/train_wurcs_bpe.py`, `code/model/wurcs_bpe_tokenizer.py`, `data/vocab/`
Ambiguity/IAR contrastive refinement	`code/contrastive/contrastive_trainer_v51_FINAL.py`, `code/contrastive/generate_negatives_v4_FINAL.py`, `code/contrastive/mcns_filter_75k_v3_FINAL.py`, `code/contrastive_training/step3_train_v51_FIXED_FINAL.sh`
AFFINose interaction training	`code/affinose/README.md`, `code/bertint/build_combined_dataset.py`, `code/bertint/generate_glycan_splits.py`, `code/bertint/training_v8.py`, `code/bertint/bertint_v8.py`, `code/bertint/dataset_v8.py`
Benchmark reproduction	`code/benchmarks/`, `code/downstream_tasks/utils/`
Embedding and biology probes	`code/probes/`, `code/probes/cluster_scripts/`
Provenance and compute context	`provenance/`, `provenance/compute_provenance/`

Version Selection

Version label	Public role	Config or script	Checkpoint handling
v3 ORIGINAL	Legacy baseline pre-BPE	`configs/multimodal_config_v3_ORIGINAL.yaml`	Not bundled
v4 BPE	First BPE model	`configs/multimodal_config_v4_bpe.yaml`	Not bundled
v5-A IPA	IPA self-distillation	`configs/multimodal_config_v5a_bpe_topo.yaml`, `code/training/ipa_bpe_distillation.py`	Not bundled
v5b excluded	Production leakage-proof BERTose encoder	`configs/multimodal_config_v5b_excluded.yaml`	Released separately as `supanthadey1/bertose-glycan-encoder`
v5.1 contrastive	Production BERTose IAR resolver; historically called the contrastive/V6 model in development notes	`code/contrastive/contrastive_trainer_v51_FINAL.py`	Released separately as `supanthadey1/bertose-iar-resolver`
AFFINose interaction model	Protein-glycan interaction prediction	`code/bertint/training_v8.py`	Released separately as `supanthadey1/affinose-interaction-model`

Included Directory Summary

Directory	Included content
`code/model/`	Core BERTose model, multimodal model, classifier heads, tokenizer helpers, dataset helpers
`code/training/`	Pretraining, tokenizer training, IPA distillation, masking, multimodal dataset and masking dependencies
`code/contrastive/`	IAR/contrastive trainer, negative generation, MCNS filtering, difficulty scoring, resolved-glycan extraction
`code/contrastive_training/`	Representative contrastive launch script
`code/bertint/`	Historical AFFINose implementation files for data construction, splits, model, dataset, training, inference, and ablations
`code/affinose/`	Public AFFINose entrypoint explaining the historical `bertint` directory
`code/benchmarks/`	Fine-tuning, benchmark consolidation, ranking calculation, result extraction, exclusion-dataset construction
`code/downstream_tasks/utils/`	Package-local benchmark/probe utilities and tokenizer/dataset dependencies
`code/probes/`	Probe and embedding-analysis scripts plus representative cluster launch scripts
`code/data_processing/`	Pretraining-data augmentation and IUPAC/GlyCosmos conversion utilities
`code/tokenizer/`	Ambiguity-token generation helper
`configs/`	Four BERTose multimodal training configs
`data/splits/`	Leakage-exclusion list and train/validation split metadata
`data/vocab/`	BPE, character, ambiguity, confidence, and MS vocabulary assets
`provenance/`	Source, tokenizer, model-lineage, and compute-provenance notes

Included Data Assets

File	Notes
`data/vocab/bpe_vocabulary_clean.json`	Production BPE vocabulary
`data/vocab/bpe_vocabulary.json`	Original BPE vocabulary before cleanup
`data/vocab/vocabulary.json`	Character-level vocabulary
`data/vocab/ms_vocabulary.json`	Mass-spectrometry fragment-token vocabulary
`data/vocab/bpe_ambiguity_tokens.json`	BPE ambiguity-token map
`data/vocab/ambiguity_tokens_CORRECTED.json`	Corrected ambiguity-token map
`data/vocab/confidence_analysis.json`	IPA confidence-analysis output used by release scripts
`data/splits/train_val_split.json`	BERTose train/validation split metadata
`data/splits/complete_exclusion_list.txt`	Leakage-proof exclusion list

Not Included Here

The public training-code repository intentionally excludes large artifacts that are either too large for a compact code companion or are hosted in separate public repositories:

Full pretraining corpus pickles such as sequences_bpe.pkl and sequences_bpe_excluded.pkl.
Multi-GB intermediate mapping files such as full multimodal/master mappings.
Full model checkpoints; use the linked model repositories instead.
ESM-C protein embeddings required for AFFINose training; users should generate or provide those according to ESM-C access rules.
Generated manuscript figures, full probe result bundles, and large benchmark output folders; processed summaries/source data are handled separately for manuscript submission.

Verification Status

The public release was checked by no-token Hugging Face readback. The following checks passed:

The repository is public and ungated.
The remote file list exactly matches the local staged release.
Model card license metadata is apache-2.0, and the repository includes an Apache-2.0 LICENSE file.
A no-token snapshot download succeeds.
SHA256SUMS verifies every downloaded file.
No Hugging Face token-looking strings were found.
No .secrets, __pycache__, .pyc, or .DS_Store artifacts were present in the uploaded release.
The README quick import check passes from the downloaded public snapshot.

These checks verify packaging, public access, checksum integrity, token hygiene, and import completeness for the released code surface. They do not claim that a full multi-GPU retraining run can be launched without the large excluded corpora, embeddings, and compute resources.