How I pre-trained a MS/MS model from scratch
Tandem mass spectrometry (MS/MS) is one of the most widely used tools for molecular characterization, but it is built on a deep and uncomfortable fact: a spectrum does not uniquely determine a molecule. That single constraint shapes nearly the entire field. A spectrum contains real chemical information, but it is only a partial projection of the molecule that produced it. Multiple chemically plausible structures can yield similar fragmentation patterns. This means the problem is not solved by scale alone. More data, more compute, or larger models do not eliminate ambiguity. The real question is whether machine learning can recover meaningful structure from the available signal, and where the boundary lies between what can be learned, what can be ranked, and what remains fundamentally uncertain.
This project began with a simple objective: determine whether this was actually possible in practice. Existing work, particularly around DreaMS-style systems, suggested that self-supervised spectral transformers could learn chemically meaningful representations. But reproducing a result and independently building, training, validating, and understanding a full system are fundamentally different tasks. The goal here was the latter. I wanted to construct the entire pipeline under my own control, understand its behavior, and push it until it broke.
That decision turned a reproduction effort into a full systems project. The result was an end-to-end machine learning stack for large-scale MS/MS representation learning, structure alignment, candidate retrieval, inference, and atlas construction. The dataset reached approximately 580GB of processed spectra organized into deterministic shards. Training covered roughly 201 million spectra across three pretraining phases. The model was a compact encoder-only transformer with a structure-alignment path using RDKit-derived Morgan fingerprints. The inference layer used Qdrant for embedding indexing and neighborhood exploration at scales up to five million spectra. Surrounding this core, I built inspection tools, family analysis pipelines, outlier detection systems, and evaluation harnesses. The goal was not just performance, but observability: understanding what the model knows, what it fails to capture, and where those failures occur.
One of the earliest lessons was that in scientific machine learning, the model is often the least interesting component. The architecture matters, but it sits atop a much larger system. If the data contract is flawed, the model can appear correct while being meaningless. If the runtime is unstable, failures may have nothing to do with learning. If evaluation is weak, performance can be misleading. The real achievement was not just training a model that worked, but constructing a system where outputs could be trusted enough to interpret.
The data system illustrates this clearly. Rather than treating the dataset as a monolithic blob, it was structured into deterministic, versioned shards. The final dataset consisted of 338 parquet shards totaling approximately 580GB, each around 2GB and containing roughly 2 million spectra. This was a deliberate design choice balancing I/O efficiency, scheduling flexibility, and evaluation clarity. More importantly, it allowed training runs to be defined explicitly as lists of shards, improving reproducibility and interpretability.
This design also mattered scientifically. MS/MS data is both redundant and variable. The same molecule can appear many times, but under different conditions producing different spectra. This redundancy is not noise; it is signal. It provides repeated views of the same underlying structure, making the data well-suited for representation learning if handled correctly. By enforcing shard-level splits instead of random sampling, leakage was minimized and generalization was meaningfully tested.
Training was conducted in three phases (V1 through V3), each covering approximately 66–67 million spectra. Each phase reused model weights from the previous one but reset the optimizer state. This allowed a clean test of whether improvements were due to genuine representation learning or simply optimizer trajectory effects. The result was clear: representation quality continued improving across phases. Loss decreased, gradients stabilized, and embedding variance remained healthy, indicating no collapse. The final checkpoint represented a true foundation model within the scope of the project.
Evaluation followed a layered strategy: heavy training, light evaluation, and deep evaluation. Light evaluation provided real-time feedback during training, while deep evaluation served as a promotion gate requiring full inference-path validation on held-out shards. This approach maintained throughput while enforcing rigorous validation standards.
On the systems side, stability proved more important than scale. Training ran on two RTX PRO 6000 GPUs with BF16 precision, but the key was not hardware scale—it was reliability. A single-process dual-device execution path was used to avoid distributed system fragility. Preflight checks ensured correct GPU visibility, data readiness, and container integrity. Once stabilized, the system became predictably efficient, with minimal dataloader overhead and consistent throughput.
A critical turning point occurred after V1–V3. A deeper audit revealed that the training data lacked usable molecular labels such as SMILES or InChIKeys. This meant that while the model learned strong spectral representations, any chemical interpretation from those phases would have been invalid. Because the system was observable, this issue could not be ignored.
This led to a restructuring of the project. V1–V3 became the foundation-model phase, while molecular grounding was deferred to a corrected labeled phase, V26. In V26, labeled data with RDKit-derived targets was used to train structure-alignment heads. The results showed strong improvement: structure loss decreased, fingerprint similarity increased, and embeddings remained stable. Importantly, candidate-bank decoding recovered the correct molecular identity in 11 out of 20 validation examples. This is not full structure identification, but it is strong evidence that meaningful chemical signal is present in the representation.
From this, the first major conclusion emerges: MS/MS spectra contain enough information to learn meaningful chemical structure. However, the second conclusion is more nuanced. While structure is learnable, ranking remains weak. The model often encodes the correct answer, but scoring mechanisms fail to consistently extract it. This creates a representation–decision gap.
This gap is best described as weak determination. The data supports candidate narrowing and top-k retrieval, but not robust top-1 ranking. Benchmark results reinforce this: the model outperforms simpler baselines but falls short of state-of-the-art systems. The limitation is not purely architectural; it reflects the intrinsic ambiguity of the data.
This also explains why de novo structure generation remains difficult. It compounds structure learning, candidate generation, ranking, and uncertainty into a single problem. If ranking is already unstable within a constrained candidate set, unconstrained generation becomes significantly harder.
Given this, the system is best understood not as a molecular oracle, but as a candidate-narrowing and evidence-aggregation engine. The inference layer supports large-scale embedding, indexing, and atlas construction. At smaller scales, detailed inspection is possible. At larger scales, density-based representations and cluster summaries become necessary. This shift highlights a practical insight: beyond a certain point, raw data visualization becomes less useful than structured abstractions.
Tools like family mapping and outlier detection transform the model into a reasoning system. Instead of asking “what is this molecule,” the system helps answer “what neighborhood does this belong to” and “what should be investigated next.” This aligns more closely with real-world analytical workflows.
Ultimately, the most important output of this project is not the model, but the system that makes the model interpretable. Deterministic data pipelines, explicit evaluation contracts, audit mechanisms, and inference tools together form a complete, reproducible framework for structure-aware MS/MS learning.
The key conclusions are straightforward. Structure is learnable. Ranking is weakly determined. Confidence remains unresolved. De novo generation is still the hardest frontier because it compounds all unresolved components.
This work is being released not as a solution to MS/MS, but as a clear map of what works, what does not, and where the boundary lies. If there is a central takeaway, it is this: in scientific machine learning, the critical question is not whether the model works, but what must be built around it for its outputs to be meaningful.
