Update Apache-2.0 release audit metadata

c1e2e6b verified 24 days ago

8.03 kB

	# BERTose and AFFINose Training-Code Manifest

	Updated: 2026-06-10
	Public repository: `https://huggingface.co/supanthadey1/bertose-affinose-training-code`
	Current public contents: 130 files, approximately 54 MB
	Scope: training, benchmarking, probe, split/vocabulary, and provenance code needed to understand and reproduce the released BERTose and AFFINose workflows.

	This manifest describes the files actually present in the public Hugging Face training-code repository. Large corpora, intermediate mapping files, checkpoints, generated figures, and full result bundles are intentionally not bundled here; the released checkpoints and inference notebook are linked from the repository README.

	## Public Naming

	Some executable files keep historical development names so that the code remains traceable to the original training logs. In public-facing text:

	\| Public name \| Historical code locations \|
	\|---\|---\|
	\| BERTose glycan encoder / multimodal pretraining \| `code/model/`, `code/training/`, `configs/` \|
	\| BERTose IAR resolver / contrastive refinement \| `code/contrastive/`, `code/contrastive_training/`, `code/tokenizer/` \|
	\| AFFINose protein-glycan interaction model \| `code/affinose/README.md`, `code/bertint/` \|

	## Entry Points

	\| File \| Purpose \|
	\|---\|---\|
	\| `README.md` \| Reviewer/user-facing overview, repository map, checkpoint links, install notes, and quick import check \|
	\| `MANIFEST.md` \| This file; exact public package map \|
	\| `RELEASE_AUDIT.md` \| Public-release checks, repaired gaps, verification results, and known scope limits \|
	\| `LICENSE` \| Apache License 2.0 terms for the BERTose/AFFINose code, notebooks and released model artifacts \|
	\| `requirements.txt` \| Core Python dependencies plus optional analysis/probe packages \|
	\| `SHA256SUMS` \| Checksums for every public file except `SHA256SUMS` itself \|

	## Workflow Map

	\| Workflow \| Main files \|
	\|---\|---\|
	\| BERTose multimodal pretraining \| `code/training/train_multimodal.py`, `code/training/multimodal_dataset.py`, `code/training/multimodal_masking.py`, `code/training/masking.py`, `configs/multimodal_config_v5b_excluded.yaml` \|
	\| WURCS-BPE tokenizer training \| `code/training/train_wurcs_bpe.py`, `code/model/wurcs_bpe_tokenizer.py`, `data/vocab/` \|
	\| Ambiguity/IAR contrastive refinement \| `code/contrastive/contrastive_trainer_v51_FINAL.py`, `code/contrastive/generate_negatives_v4_FINAL.py`, `code/contrastive/mcns_filter_75k_v3_FINAL.py`, `code/contrastive_training/step3_train_v51_FIXED_FINAL.sh` \|
	\| AFFINose interaction training \| `code/affinose/README.md`, `code/bertint/build_combined_dataset.py`, `code/bertint/generate_glycan_splits.py`, `code/bertint/training_v8.py`, `code/bertint/bertint_v8.py`, `code/bertint/dataset_v8.py` \|
	\| Benchmark reproduction \| `code/benchmarks/`, `code/downstream_tasks/utils/` \|
	\| Embedding and biology probes \| `code/probes/`, `code/probes/cluster_scripts/` \|
	\| Provenance and compute context \| `provenance/`, `provenance/compute_provenance/` \|

	## Version Selection

	\| Version label \| Public role \| Config or script \| Checkpoint handling \|
	\|---\|---\|---\|---\|
	\| v3 ORIGINAL \| Legacy baseline pre-BPE \| `configs/multimodal_config_v3_ORIGINAL.yaml` \| Not bundled \|
	\| v4 BPE \| First BPE model \| `configs/multimodal_config_v4_bpe.yaml` \| Not bundled \|
	\| v5-A IPA \| IPA self-distillation \| `configs/multimodal_config_v5a_bpe_topo.yaml`, `code/training/ipa_bpe_distillation.py` \| Not bundled \|
	\| v5b excluded \| Production leakage-proof BERTose encoder \| `configs/multimodal_config_v5b_excluded.yaml` \| Released separately as `supanthadey1/bertose-glycan-encoder` \|
	\| v5.1 contrastive \| Production BERTose IAR resolver; historically called the contrastive/V6 model in development notes \| `code/contrastive/contrastive_trainer_v51_FINAL.py` \| Released separately as `supanthadey1/bertose-iar-resolver` \|
	\| AFFINose interaction model \| Protein-glycan interaction prediction \| `code/bertint/training_v8.py` \| Released separately as `supanthadey1/affinose-interaction-model` \|

	## Included Directory Summary

	\| Directory \| Included content \|
	\|---\|---\|
	\| `code/model/` \| Core BERTose model, multimodal model, classifier heads, tokenizer helpers, dataset helpers \|
	\| `code/training/` \| Pretraining, tokenizer training, IPA distillation, masking, multimodal dataset and masking dependencies \|
	\| `code/contrastive/` \| IAR/contrastive trainer, negative generation, MCNS filtering, difficulty scoring, resolved-glycan extraction \|
	\| `code/contrastive_training/` \| Representative contrastive launch script \|
	\| `code/bertint/` \| Historical AFFINose implementation files for data construction, splits, model, dataset, training, inference, and ablations \|
	\| `code/affinose/` \| Public AFFINose entrypoint explaining the historical `bertint` directory \|
	\| `code/benchmarks/` \| Fine-tuning, benchmark consolidation, ranking calculation, result extraction, exclusion-dataset construction \|
	\| `code/downstream_tasks/utils/` \| Package-local benchmark/probe utilities and tokenizer/dataset dependencies \|
	\| `code/probes/` \| Probe and embedding-analysis scripts plus representative cluster launch scripts \|
	\| `code/data_processing/` \| Pretraining-data augmentation and IUPAC/GlyCosmos conversion utilities \|
	\| `code/tokenizer/` \| Ambiguity-token generation helper \|
	\| `configs/` \| Four BERTose multimodal training configs \|
	\| `data/splits/` \| Leakage-exclusion list and train/validation split metadata \|
	\| `data/vocab/` \| BPE, character, ambiguity, confidence, and MS vocabulary assets \|
	\| `provenance/` \| Source, tokenizer, model-lineage, and compute-provenance notes \|

	## Included Data Assets

	\| File \| Notes \|
	\|---\|---\|
	\| `data/vocab/bpe_vocabulary_clean.json` \| Production BPE vocabulary \|
	\| `data/vocab/bpe_vocabulary.json` \| Original BPE vocabulary before cleanup \|
	\| `data/vocab/vocabulary.json` \| Character-level vocabulary \|
	\| `data/vocab/ms_vocabulary.json` \| Mass-spectrometry fragment-token vocabulary \|
	\| `data/vocab/bpe_ambiguity_tokens.json` \| BPE ambiguity-token map \|
	\| `data/vocab/ambiguity_tokens_CORRECTED.json` \| Corrected ambiguity-token map \|
	\| `data/vocab/confidence_analysis.json` \| IPA confidence-analysis output used by release scripts \|
	\| `data/splits/train_val_split.json` \| BERTose train/validation split metadata \|
	\| `data/splits/complete_exclusion_list.txt` \| Leakage-proof exclusion list \|

	## Not Included Here

	The public training-code repository intentionally excludes large artifacts that are either too large for a compact code companion or are hosted in separate public repositories:

	- Full pretraining corpus pickles such as `sequences_bpe.pkl` and `sequences_bpe_excluded.pkl`.
	- Multi-GB intermediate mapping files such as full multimodal/master mappings.
	- Full model checkpoints; use the linked model repositories instead.
	- ESM-C protein embeddings required for AFFINose training; users should generate or provide those according to ESM-C access rules.
	- Generated manuscript figures, full probe result bundles, and large benchmark output folders; processed summaries/source data are handled separately for manuscript submission.

	## Verification Status

	The public release was checked by no-token Hugging Face readback. The following checks passed:

	- The repository is public and ungated.
	- The remote file list exactly matches the local staged release.
	- Model card license metadata is `apache-2.0`, and the repository includes an Apache-2.0 `LICENSE` file.
	- A no-token snapshot download succeeds.
	- `SHA256SUMS` verifies every downloaded file.
	- No Hugging Face token-looking strings were found.
	- No `.secrets`, `__pycache__`, `.pyc`, or `.DS_Store` artifacts were present in the uploaded release.
	- The README quick import check passes from the downloaded public snapshot.

	These checks verify packaging, public access, checksum integrity, token hygiene, and import completeness for the released code surface. They do not claim that a full multi-GPU retraining run can be launched without the large excluded corpora, embeddings, and compute resources.

	# BERTose and AFFINose Training-Code Manifest

	Updated: 2026-06-10
	Public repository: `https://huggingface.co/supanthadey1/bertose-affinose-training-code`
	Current public contents: 130 files, approximately 54 MB
	Scope: training, benchmarking, probe, split/vocabulary, and provenance code needed to understand and reproduce the released BERTose and AFFINose workflows.

	This manifest describes the files actually present in the public Hugging Face training-code repository. Large corpora, intermediate mapping files, checkpoints, generated figures, and full result bundles are intentionally not bundled here; the released checkpoints and inference notebook are linked from the repository README.

	## Public Naming

	Some executable files keep historical development names so that the code remains traceable to the original training logs. In public-facing text:

	\| Public name \| Historical code locations \|
	\|---\|---\|
	\| BERTose glycan encoder / multimodal pretraining \| `code/model/`, `code/training/`, `configs/` \|
	\| BERTose IAR resolver / contrastive refinement \| `code/contrastive/`, `code/contrastive_training/`, `code/tokenizer/` \|
	\| AFFINose protein-glycan interaction model \| `code/affinose/README.md`, `code/bertint/` \|

	## Entry Points

	\| File \| Purpose \|
	\|---\|---\|
	\| `README.md` \| Reviewer/user-facing overview, repository map, checkpoint links, install notes, and quick import check \|
	\| `MANIFEST.md` \| This file; exact public package map \|
	\| `RELEASE_AUDIT.md` \| Public-release checks, repaired gaps, verification results, and known scope limits \|
	\| `LICENSE` \| Apache License 2.0 terms for the BERTose/AFFINose code, notebooks and released model artifacts \|
	\| `requirements.txt` \| Core Python dependencies plus optional analysis/probe packages \|
	\| `SHA256SUMS` \| Checksums for every public file except `SHA256SUMS` itself \|

	## Workflow Map

	\| Workflow \| Main files \|
	\|---\|---\|
	\| BERTose multimodal pretraining \| `code/training/train_multimodal.py`, `code/training/multimodal_dataset.py`, `code/training/multimodal_masking.py`, `code/training/masking.py`, `configs/multimodal_config_v5b_excluded.yaml` \|
	\| WURCS-BPE tokenizer training \| `code/training/train_wurcs_bpe.py`, `code/model/wurcs_bpe_tokenizer.py`, `data/vocab/` \|
	\| Ambiguity/IAR contrastive refinement \| `code/contrastive/contrastive_trainer_v51_FINAL.py`, `code/contrastive/generate_negatives_v4_FINAL.py`, `code/contrastive/mcns_filter_75k_v3_FINAL.py`, `code/contrastive_training/step3_train_v51_FIXED_FINAL.sh` \|
	\| AFFINose interaction training \| `code/affinose/README.md`, `code/bertint/build_combined_dataset.py`, `code/bertint/generate_glycan_splits.py`, `code/bertint/training_v8.py`, `code/bertint/bertint_v8.py`, `code/bertint/dataset_v8.py` \|
	\| Benchmark reproduction \| `code/benchmarks/`, `code/downstream_tasks/utils/` \|
	\| Embedding and biology probes \| `code/probes/`, `code/probes/cluster_scripts/` \|
	\| Provenance and compute context \| `provenance/`, `provenance/compute_provenance/` \|

	## Version Selection

	\| Version label \| Public role \| Config or script \| Checkpoint handling \|
	\|---\|---\|---\|---\|
	\| v3 ORIGINAL \| Legacy baseline pre-BPE \| `configs/multimodal_config_v3_ORIGINAL.yaml` \| Not bundled \|
	\| v4 BPE \| First BPE model \| `configs/multimodal_config_v4_bpe.yaml` \| Not bundled \|
	\| v5-A IPA \| IPA self-distillation \| `configs/multimodal_config_v5a_bpe_topo.yaml`, `code/training/ipa_bpe_distillation.py` \| Not bundled \|
	\| v5b excluded \| Production leakage-proof BERTose encoder \| `configs/multimodal_config_v5b_excluded.yaml` \| Released separately as `supanthadey1/bertose-glycan-encoder` \|
	\| v5.1 contrastive \| Production BERTose IAR resolver; historically called the contrastive/V6 model in development notes \| `code/contrastive/contrastive_trainer_v51_FINAL.py` \| Released separately as `supanthadey1/bertose-iar-resolver` \|
	\| AFFINose interaction model \| Protein-glycan interaction prediction \| `code/bertint/training_v8.py` \| Released separately as `supanthadey1/affinose-interaction-model` \|

	## Included Directory Summary

	\| Directory \| Included content \|
	\|---\|---\|
	\| `code/model/` \| Core BERTose model, multimodal model, classifier heads, tokenizer helpers, dataset helpers \|
	\| `code/training/` \| Pretraining, tokenizer training, IPA distillation, masking, multimodal dataset and masking dependencies \|
	\| `code/contrastive/` \| IAR/contrastive trainer, negative generation, MCNS filtering, difficulty scoring, resolved-glycan extraction \|
	\| `code/contrastive_training/` \| Representative contrastive launch script \|
	\| `code/bertint/` \| Historical AFFINose implementation files for data construction, splits, model, dataset, training, inference, and ablations \|
	\| `code/affinose/` \| Public AFFINose entrypoint explaining the historical `bertint` directory \|
	\| `code/benchmarks/` \| Fine-tuning, benchmark consolidation, ranking calculation, result extraction, exclusion-dataset construction \|
	\| `code/downstream_tasks/utils/` \| Package-local benchmark/probe utilities and tokenizer/dataset dependencies \|
	\| `code/probes/` \| Probe and embedding-analysis scripts plus representative cluster launch scripts \|
	\| `code/data_processing/` \| Pretraining-data augmentation and IUPAC/GlyCosmos conversion utilities \|
	\| `code/tokenizer/` \| Ambiguity-token generation helper \|
	\| `configs/` \| Four BERTose multimodal training configs \|
	\| `data/splits/` \| Leakage-exclusion list and train/validation split metadata \|
	\| `data/vocab/` \| BPE, character, ambiguity, confidence, and MS vocabulary assets \|
	\| `provenance/` \| Source, tokenizer, model-lineage, and compute-provenance notes \|

	## Included Data Assets

	\| File \| Notes \|
	\|---\|---\|
	\| `data/vocab/bpe_vocabulary_clean.json` \| Production BPE vocabulary \|
	\| `data/vocab/bpe_vocabulary.json` \| Original BPE vocabulary before cleanup \|
	\| `data/vocab/vocabulary.json` \| Character-level vocabulary \|
	\| `data/vocab/ms_vocabulary.json` \| Mass-spectrometry fragment-token vocabulary \|
	\| `data/vocab/bpe_ambiguity_tokens.json` \| BPE ambiguity-token map \|
	\| `data/vocab/ambiguity_tokens_CORRECTED.json` \| Corrected ambiguity-token map \|
	\| `data/vocab/confidence_analysis.json` \| IPA confidence-analysis output used by release scripts \|
	\| `data/splits/train_val_split.json` \| BERTose train/validation split metadata \|
	\| `data/splits/complete_exclusion_list.txt` \| Leakage-proof exclusion list \|

	## Not Included Here

	The public training-code repository intentionally excludes large artifacts that are either too large for a compact code companion or are hosted in separate public repositories:

	- Full pretraining corpus pickles such as `sequences_bpe.pkl` and `sequences_bpe_excluded.pkl`.
	- Multi-GB intermediate mapping files such as full multimodal/master mappings.
	- Full model checkpoints; use the linked model repositories instead.
	- ESM-C protein embeddings required for AFFINose training; users should generate or provide those according to ESM-C access rules.
	- Generated manuscript figures, full probe result bundles, and large benchmark output folders; processed summaries/source data are handled separately for manuscript submission.

	## Verification Status

	The public release was checked by no-token Hugging Face readback. The following checks passed:

	- The repository is public and ungated.
	- The remote file list exactly matches the local staged release.
	- Model card license metadata is `apache-2.0`, and the repository includes an Apache-2.0 `LICENSE` file.
	- A no-token snapshot download succeeds.
	- `SHA256SUMS` verifies every downloaded file.
	- No Hugging Face token-looking strings were found.
	- No `.secrets`, `__pycache__`, `.pyc`, or `.DS_Store` artifacts were present in the uploaded release.
	- The README quick import check passes from the downloaded public snapshot.

	These checks verify packaging, public access, checksum integrity, token hygiene, and import completeness for the released code surface. They do not claim that a full multi-GPU retraining run can be launched without the large excluded corpora, embeddings, and compute resources.