← Research

Mass Spectrometry AlphaFold Protein Design Materials ML ML for Science

Machine learning for chemistry reports strong benchmark numbers across spectral models, molecular property prediction, materials discovery, protein design, and chemistry LLMs. We audit how many of those claims hold up outside the benchmark, and test specific geometric methods for predicting cross-instrument failure.


Heterogeneous Validity in ML for Chemistry

Heterogeneous Validity in Machine Learning for Chemistry: A Claim-Level Audit of 52 Published Methods

ML for Science

119 ML-for-chemistry claims across 12 families — spectral foundation models, metabolomics biomarkers, molecular property prediction, reaction prediction, materials discovery, spectral matching, multi-omic integration, chemistry LLMs, docking/AlphaFold, protein design — scored against 16 typed failure modes organized into four categories: data integrity, generalization, interpretation, and reproducibility. Of the scored claims, 68% are Benchmark-Bound: real on the reported benchmark but with transportability unresolved. Three entries earn Validated — AlphaFold2 for single-chain static structure, ProteinMPNN for inverse folding, and RFdiffusion for backbone generation — all validated on real experimental endpoints (CASP, X-ray, cryo-EM). 17 are Disconfirmed. The dominant unmeasured failure is external-validation collapse driven by batch/instrument confound and split-design leakage.

Adapts the Mechanistic Validity framework from interpretability to chemistry, demonstrating that the same validity structure applies across domains.


Spectral Transportability in Mass Spectrometry

Which Distance Metric Predicts Cross-Instrument MS/MS Degradation? A Systematic Comparison on MassBank Fingerprint Embeddings

ML for Science

Six distance metrics are compared on 2,140 MS/MS spectra from MassBank spanning 10 instrument types across three ionization families (ESI, EI, MALDI), yielding 45 pairwise instrument comparisons. Domain classifier AUC is the strongest predictor of spectral match degradation (rho = 0.51, p < 0.001). Geodesic distance on the Grassmannian predicts neither outcome — the PCA subspace differences are real (all 45 exceed the permutation null at p = 0.000) but do not predict functional degradation.

The apparent signal in raw correlations is driven by the ionization-family boundary, not by gradual distributional differences a metric could usefully rank. Within the ESI family alone (21 pairs), no metric reaches p < 0.25. The negative result is informative: statistical distance and functional degradation are dissociated when the shift is categorical rather than continuous.

← Back to Research