| # BERTose and AFFINose Training-Code Manifest |
|
|
| **Updated**: 2026-06-10 |
| **Public repository**: `https://huggingface.co/supanthadey1/bertose-affinose-training-code` |
| **Current public contents**: 130 files, approximately 54 MB |
| **Scope**: training, benchmarking, probe, split/vocabulary, and provenance code needed to understand and reproduce the released BERTose and AFFINose workflows. |
|
|
| This manifest describes the files actually present in the public Hugging Face training-code repository. Large corpora, intermediate mapping files, checkpoints, generated figures, and full result bundles are intentionally not bundled here; the released checkpoints and inference notebook are linked from the repository README. |
|
|
| ## Public Naming |
|
|
| Some executable files keep historical development names so that the code remains traceable to the original training logs. In public-facing text: |
|
|
| | Public name | Historical code locations | |
| |---|---| |
| | BERTose glycan encoder / multimodal pretraining | `code/model/`, `code/training/`, `configs/` | |
| | BERTose IAR resolver / contrastive refinement | `code/contrastive/`, `code/contrastive_training/`, `code/tokenizer/` | |
| | AFFINose protein-glycan interaction model | `code/affinose/README.md`, `code/bertint/` | |
|
|
| ## Entry Points |
|
|
| | File | Purpose | |
| |---|---| |
| | `README.md` | Reviewer/user-facing overview, repository map, checkpoint links, install notes, and quick import check | |
| | `MANIFEST.md` | This file; exact public package map | |
| | `RELEASE_AUDIT.md` | Public-release checks, repaired gaps, verification results, and known scope limits | |
| | `LICENSE` | Apache License 2.0 terms for the BERTose/AFFINose code, notebooks and released model artifacts | |
| | `requirements.txt` | Core Python dependencies plus optional analysis/probe packages | |
| | `SHA256SUMS` | Checksums for every public file except `SHA256SUMS` itself | |
|
|
| ## Workflow Map |
|
|
| | Workflow | Main files | |
| |---|---| |
| | BERTose multimodal pretraining | `code/training/train_multimodal.py`, `code/training/multimodal_dataset.py`, `code/training/multimodal_masking.py`, `code/training/masking.py`, `configs/multimodal_config_v5b_excluded.yaml` | |
| | WURCS-BPE tokenizer training | `code/training/train_wurcs_bpe.py`, `code/model/wurcs_bpe_tokenizer.py`, `data/vocab/` | |
| | Ambiguity/IAR contrastive refinement | `code/contrastive/contrastive_trainer_v51_FINAL.py`, `code/contrastive/generate_negatives_v4_FINAL.py`, `code/contrastive/mcns_filter_75k_v3_FINAL.py`, `code/contrastive_training/step3_train_v51_FIXED_FINAL.sh` | |
| | AFFINose interaction training | `code/affinose/README.md`, `code/bertint/build_combined_dataset.py`, `code/bertint/generate_glycan_splits.py`, `code/bertint/training_v8.py`, `code/bertint/bertint_v8.py`, `code/bertint/dataset_v8.py` | |
| | Benchmark reproduction | `code/benchmarks/`, `code/downstream_tasks/utils/` | |
| | Embedding and biology probes | `code/probes/`, `code/probes/cluster_scripts/` | |
| | Provenance and compute context | `provenance/`, `provenance/compute_provenance/` | |
|
|
| ## Version Selection |
|
|
| | Version label | Public role | Config or script | Checkpoint handling | |
| |---|---|---|---| |
| | v3 ORIGINAL | Legacy baseline pre-BPE | `configs/multimodal_config_v3_ORIGINAL.yaml` | Not bundled | |
| | v4 BPE | First BPE model | `configs/multimodal_config_v4_bpe.yaml` | Not bundled | |
| | v5-A IPA | IPA self-distillation | `configs/multimodal_config_v5a_bpe_topo.yaml`, `code/training/ipa_bpe_distillation.py` | Not bundled | |
| | v5b excluded | Production leakage-proof BERTose encoder | `configs/multimodal_config_v5b_excluded.yaml` | Released separately as `supanthadey1/bertose-glycan-encoder` | |
| | v5.1 contrastive | Production BERTose IAR resolver; historically called the contrastive/V6 model in development notes | `code/contrastive/contrastive_trainer_v51_FINAL.py` | Released separately as `supanthadey1/bertose-iar-resolver` | |
| | AFFINose interaction model | Protein-glycan interaction prediction | `code/bertint/training_v8.py` | Released separately as `supanthadey1/affinose-interaction-model` | |
|
|
| ## Included Directory Summary |
|
|
| | Directory | Included content | |
| |---|---| |
| | `code/model/` | Core BERTose model, multimodal model, classifier heads, tokenizer helpers, dataset helpers | |
| | `code/training/` | Pretraining, tokenizer training, IPA distillation, masking, multimodal dataset and masking dependencies | |
| | `code/contrastive/` | IAR/contrastive trainer, negative generation, MCNS filtering, difficulty scoring, resolved-glycan extraction | |
| | `code/contrastive_training/` | Representative contrastive launch script | |
| | `code/bertint/` | Historical AFFINose implementation files for data construction, splits, model, dataset, training, inference, and ablations | |
| | `code/affinose/` | Public AFFINose entrypoint explaining the historical `bertint` directory | |
| | `code/benchmarks/` | Fine-tuning, benchmark consolidation, ranking calculation, result extraction, exclusion-dataset construction | |
| | `code/downstream_tasks/utils/` | Package-local benchmark/probe utilities and tokenizer/dataset dependencies | |
| | `code/probes/` | Probe and embedding-analysis scripts plus representative cluster launch scripts | |
| | `code/data_processing/` | Pretraining-data augmentation and IUPAC/GlyCosmos conversion utilities | |
| | `code/tokenizer/` | Ambiguity-token generation helper | |
| | `configs/` | Four BERTose multimodal training configs | |
| | `data/splits/` | Leakage-exclusion list and train/validation split metadata | |
| | `data/vocab/` | BPE, character, ambiguity, confidence, and MS vocabulary assets | |
| | `provenance/` | Source, tokenizer, model-lineage, and compute-provenance notes | |
|
|
| ## Included Data Assets |
|
|
| | File | Notes | |
| |---|---| |
| | `data/vocab/bpe_vocabulary_clean.json` | Production BPE vocabulary | |
| | `data/vocab/bpe_vocabulary.json` | Original BPE vocabulary before cleanup | |
| | `data/vocab/vocabulary.json` | Character-level vocabulary | |
| | `data/vocab/ms_vocabulary.json` | Mass-spectrometry fragment-token vocabulary | |
| | `data/vocab/bpe_ambiguity_tokens.json` | BPE ambiguity-token map | |
| | `data/vocab/ambiguity_tokens_CORRECTED.json` | Corrected ambiguity-token map | |
| | `data/vocab/confidence_analysis.json` | IPA confidence-analysis output used by release scripts | |
| | `data/splits/train_val_split.json` | BERTose train/validation split metadata | |
| | `data/splits/complete_exclusion_list.txt` | Leakage-proof exclusion list | |
|
|
| ## Not Included Here |
|
|
| The public training-code repository intentionally excludes large artifacts that are either too large for a compact code companion or are hosted in separate public repositories: |
|
|
| - Full pretraining corpus pickles such as `sequences_bpe.pkl` and `sequences_bpe_excluded.pkl`. |
| - Multi-GB intermediate mapping files such as full multimodal/master mappings. |
| - Full model checkpoints; use the linked model repositories instead. |
| - ESM-C protein embeddings required for AFFINose training; users should generate or provide those according to ESM-C access rules. |
| - Generated manuscript figures, full probe result bundles, and large benchmark output folders; processed summaries/source data are handled separately for manuscript submission. |
|
|
| ## Verification Status |
|
|
| The public release was checked by no-token Hugging Face readback. The following checks passed: |
|
|
| - The repository is public and ungated. |
| - The remote file list exactly matches the local staged release. |
| - Model card license metadata is `apache-2.0`, and the repository includes an Apache-2.0 `LICENSE` file. |
| - A no-token snapshot download succeeds. |
| - `SHA256SUMS` verifies every downloaded file. |
| - No Hugging Face token-looking strings were found. |
| - No `.secrets`, `__pycache__`, `.pyc`, or `.DS_Store` artifacts were present in the uploaded release. |
| - The README quick import check passes from the downloaded public snapshot. |
|
|
| These checks verify packaging, public access, checksum integrity, token hygiene, and import completeness for the released code surface. They do not claim that a full multi-GPU retraining run can be launched without the large excluded corpora, embeddings, and compute resources. |
|
|