supanthadey1's picture
Update Apache-2.0 release audit metadata
c1e2e6b verified
|
Raw
History Blame Contribute Delete
8.03 kB

BERTose and AFFINose Training-Code Manifest

Updated: 2026-06-10
Public repository: https://huggingface.co/supanthadey1/bertose-affinose-training-code
Current public contents: 130 files, approximately 54 MB
Scope: training, benchmarking, probe, split/vocabulary, and provenance code needed to understand and reproduce the released BERTose and AFFINose workflows.

This manifest describes the files actually present in the public Hugging Face training-code repository. Large corpora, intermediate mapping files, checkpoints, generated figures, and full result bundles are intentionally not bundled here; the released checkpoints and inference notebook are linked from the repository README.

Public Naming

Some executable files keep historical development names so that the code remains traceable to the original training logs. In public-facing text:

Public name Historical code locations
BERTose glycan encoder / multimodal pretraining code/model/, code/training/, configs/
BERTose IAR resolver / contrastive refinement code/contrastive/, code/contrastive_training/, code/tokenizer/
AFFINose protein-glycan interaction model code/affinose/README.md, code/bertint/

Entry Points

File Purpose
README.md Reviewer/user-facing overview, repository map, checkpoint links, install notes, and quick import check
MANIFEST.md This file; exact public package map
RELEASE_AUDIT.md Public-release checks, repaired gaps, verification results, and known scope limits
LICENSE Apache License 2.0 terms for the BERTose/AFFINose code, notebooks and released model artifacts
requirements.txt Core Python dependencies plus optional analysis/probe packages
SHA256SUMS Checksums for every public file except SHA256SUMS itself

Workflow Map

Workflow Main files
BERTose multimodal pretraining code/training/train_multimodal.py, code/training/multimodal_dataset.py, code/training/multimodal_masking.py, code/training/masking.py, configs/multimodal_config_v5b_excluded.yaml
WURCS-BPE tokenizer training code/training/train_wurcs_bpe.py, code/model/wurcs_bpe_tokenizer.py, data/vocab/
Ambiguity/IAR contrastive refinement code/contrastive/contrastive_trainer_v51_FINAL.py, code/contrastive/generate_negatives_v4_FINAL.py, code/contrastive/mcns_filter_75k_v3_FINAL.py, code/contrastive_training/step3_train_v51_FIXED_FINAL.sh
AFFINose interaction training code/affinose/README.md, code/bertint/build_combined_dataset.py, code/bertint/generate_glycan_splits.py, code/bertint/training_v8.py, code/bertint/bertint_v8.py, code/bertint/dataset_v8.py
Benchmark reproduction code/benchmarks/, code/downstream_tasks/utils/
Embedding and biology probes code/probes/, code/probes/cluster_scripts/
Provenance and compute context provenance/, provenance/compute_provenance/

Version Selection

Version label Public role Config or script Checkpoint handling
v3 ORIGINAL Legacy baseline pre-BPE configs/multimodal_config_v3_ORIGINAL.yaml Not bundled
v4 BPE First BPE model configs/multimodal_config_v4_bpe.yaml Not bundled
v5-A IPA IPA self-distillation configs/multimodal_config_v5a_bpe_topo.yaml, code/training/ipa_bpe_distillation.py Not bundled
v5b excluded Production leakage-proof BERTose encoder configs/multimodal_config_v5b_excluded.yaml Released separately as supanthadey1/bertose-glycan-encoder
v5.1 contrastive Production BERTose IAR resolver; historically called the contrastive/V6 model in development notes code/contrastive/contrastive_trainer_v51_FINAL.py Released separately as supanthadey1/bertose-iar-resolver
AFFINose interaction model Protein-glycan interaction prediction code/bertint/training_v8.py Released separately as supanthadey1/affinose-interaction-model

Included Directory Summary

Directory Included content
code/model/ Core BERTose model, multimodal model, classifier heads, tokenizer helpers, dataset helpers
code/training/ Pretraining, tokenizer training, IPA distillation, masking, multimodal dataset and masking dependencies
code/contrastive/ IAR/contrastive trainer, negative generation, MCNS filtering, difficulty scoring, resolved-glycan extraction
code/contrastive_training/ Representative contrastive launch script
code/bertint/ Historical AFFINose implementation files for data construction, splits, model, dataset, training, inference, and ablations
code/affinose/ Public AFFINose entrypoint explaining the historical bertint directory
code/benchmarks/ Fine-tuning, benchmark consolidation, ranking calculation, result extraction, exclusion-dataset construction
code/downstream_tasks/utils/ Package-local benchmark/probe utilities and tokenizer/dataset dependencies
code/probes/ Probe and embedding-analysis scripts plus representative cluster launch scripts
code/data_processing/ Pretraining-data augmentation and IUPAC/GlyCosmos conversion utilities
code/tokenizer/ Ambiguity-token generation helper
configs/ Four BERTose multimodal training configs
data/splits/ Leakage-exclusion list and train/validation split metadata
data/vocab/ BPE, character, ambiguity, confidence, and MS vocabulary assets
provenance/ Source, tokenizer, model-lineage, and compute-provenance notes

Included Data Assets

File Notes
data/vocab/bpe_vocabulary_clean.json Production BPE vocabulary
data/vocab/bpe_vocabulary.json Original BPE vocabulary before cleanup
data/vocab/vocabulary.json Character-level vocabulary
data/vocab/ms_vocabulary.json Mass-spectrometry fragment-token vocabulary
data/vocab/bpe_ambiguity_tokens.json BPE ambiguity-token map
data/vocab/ambiguity_tokens_CORRECTED.json Corrected ambiguity-token map
data/vocab/confidence_analysis.json IPA confidence-analysis output used by release scripts
data/splits/train_val_split.json BERTose train/validation split metadata
data/splits/complete_exclusion_list.txt Leakage-proof exclusion list

Not Included Here

The public training-code repository intentionally excludes large artifacts that are either too large for a compact code companion or are hosted in separate public repositories:

  • Full pretraining corpus pickles such as sequences_bpe.pkl and sequences_bpe_excluded.pkl.
  • Multi-GB intermediate mapping files such as full multimodal/master mappings.
  • Full model checkpoints; use the linked model repositories instead.
  • ESM-C protein embeddings required for AFFINose training; users should generate or provide those according to ESM-C access rules.
  • Generated manuscript figures, full probe result bundles, and large benchmark output folders; processed summaries/source data are handled separately for manuscript submission.

Verification Status

The public release was checked by no-token Hugging Face readback. The following checks passed:

  • The repository is public and ungated.
  • The remote file list exactly matches the local staged release.
  • Model card license metadata is apache-2.0, and the repository includes an Apache-2.0 LICENSE file.
  • A no-token snapshot download succeeds.
  • SHA256SUMS verifies every downloaded file.
  • No Hugging Face token-looking strings were found.
  • No .secrets, __pycache__, .pyc, or .DS_Store artifacts were present in the uploaded release.
  • The README quick import check passes from the downloaded public snapshot.

These checks verify packaging, public access, checksum integrity, token hygiene, and import completeness for the released code surface. They do not claim that a full multi-GPU retraining run can be launched without the large excluded corpora, embeddings, and compute resources.