Skip to content
Jae Hoon Kim
Back to writing

First-Class Per-Site Importance in Materials: Attention-MIL on a Frozen Pretrained Backbone

Updated:
69 min read
Contents 27 sections

2026-05-16. Spent the morning trying to figure out what was new about the 2025 Sidhom LLM for cancer mutations. Four hours later I was deep in his GitHub — DeepTCR, then DeepTCR_Cancer, then ATGC out of the Adams lab.

I’m not an immunologist, so the biology I can’t really judge. What hooked me was the architecture. The same recipe kept showing up across problems that look nothing alike — amino acids, mutation contexts, cell morphologies — all running through the same stack with just the input alphabet swapped. And the code is genuinely readable. DeepTCR is one file you can walk through end-to-end, which is rare in ML-for-biology and made the whole thing feel like something I could lift somewhere else.

Which raises the question I’m chewing on here: if it’s already substrate-independent across biology, is materials science the next substrate?

Background

Two ideas do most of the work below.

The shape of the problem. A lot of scientific predictions look like this. You have a patient and a list of their thirty-odd somatic mutations; or a biopsy slide and a million image patches; or a doped catalyst and a list of defect sites. The label is on the whole thing. The data is a bag of smaller things. You don’t know up front which of the small things mattered. This is multiple-instance learning (MIL). The classic approach — hand- craft features per item, pool them, regress the bag-level label — throws away the question scientists actually want answered: which item drove the prediction.

The fix, ~2018. Ilse, Tomczak & Welling (Amsterdam) published an attention-based pooling layer that learns, end-to-end, a weight aka_k for each item and combines them as a weighted sum. Three properties matter: per-item features are learned, not engineered; training is fully end-to-end; and the aka_k weights themselves are the interpretability output — the model tells you which mutation, patch, or defect it leaned on. Whether attention weights are strictly more faithful than gradient attributions is a live debate (Jain & Wallace 2019); the practical point is they fall out of the architecture for free.

Pathology picked it up first (CLAM, TransMIL), then immunology (DeepTCR, DeepTCR_Cancer), then oncology (ATGC, then the 2025 Sidhom LLM). Each time the model beat the hand-engineered baseline and the aka_k maps were scientifically usable.

That combination — learned per-item features + attention-MIL + sample- level labels — is what makes foundation-model-style progress possible anywhere a sample is a variable-size bag rather than a fixed-shape image or sentence. Most of materials science is shaped like that. The rest of this note works out the port.

Origin: NLP + a non-NLP aggregator

The recipe is two lineages welded together. The first four steps (vocabulary → learnable embedding → masked-token pretrain → fine-tune) are the NLP playbook of the last decade. The fifth step — the attention-MIL aggregator — is not NLP; it comes from the computer- vision / weakly-supervised line (Ilse 2018, then pathology: CLAM, TransMIL). NLP gives you the per-item representation, MIL gives you the aggregator, and most scientific problems happen to need both.

The NLP half has been ported to biology one substrate at a time. Pick a discrete vocabulary, give each token a learnable vector, let context teach the model what it means, pretrain by masking random tokens (BERT 2018), then fine-tune. Proteins inherited it (ProtBERT 2020, Meta’s ESM; AlphaFold’s Evoformer is adjacent — it uses MSAs and invariant-point attention, more structured). DNA inherited it (DNABERT 2020, Nucleotide Transformer 2023). T-cell receptors did (DeepTCR). Tumors did (ATGC, the 2025 Sidhom LLM). Only the alphabet changes.

Same NLP recipe, different alphabetssame architecture, just swap the vocabularyEnglishProteinsDNAMutationsDefectsthe · cat · sat · on · the · matM · A · L · V · K · R · …ATG · CCT · GGA · TAC · …KRAS-G12C · TP53-R175H · …V_O@4f · Sr→La@2a · Mn→Co@8e← the betsame recipe1. embed each token2. transformer attention3. masked-token pretrain4. fine-tune for task+ attention-MIL aggregator(for bag → label tasks)

Figure 1. The “biology-as-language” research program in one picture. Every row uses the same architecture on the right; only the discrete vocabulary on the left changes. The last row — defects in a doped crystal — is the bet this note is testing.

Sidhom’s framing for his own line is “the language of cancer.” Each mutation is a word, each tumor a document, co-occurring mutations are syntax. The dual-attention architecture maps cleanly onto a sentence- then-document hierarchy: local attention over the DNA context around a mutation, then global attention over the bag of mutations in the tumor. The metaphor is loose where it matters most — documents have order, mutation bags don’t — which is why the permutation-invariant MIL aggregator (not the transformer) is the load-bearing piece. The pretrain is BERT taken to the extreme: BERT masks ~15% of tokens; the 2025 model masks 100% of the altered sequence and reconstructs it.

The materials port is the same move once more. Vocabulary becomes defect sites, Wyckoff sites, or monomers. “Sentence” becomes the local coordination shell. “Document” becomes the material sample. Pretrain becomes masked-defect reconstruction on computed structures. If “a tumor is a document whose words are mutations” is a productive frame, “a doped crystal is a document whose words are defects” is the analogous bet.

1. The methodology, in domain-neutral form

The five-primitive recipe as a pipelinebag of sparse instances → one sample-level label1. tokens2. per-instance3. side info4. two-level attn5. MIL headtrainable embedCNN or Transformercategorical fuselocal + global→ label + weightsamino acid/ mutation/ defect siteCDR3 seq/ mut. context/ coord. shellV/D/J + HLA/ gene + tissue/ space grouporder matterswithin instance,not across themtumor type,conductivity,overpotential

Figure 2. The recipe in pipeline form. Top labels are the primitives; bottom labels are what each one resolves to in immunology / oncology / materials (the three substrates discussed below).

Five primitives, each load-bearing:

  1. Trainable token embedding from scratch. Pick a discrete unit (amino acid, mutation, dopant species). Give it a learnable vector. Let context teach the model what it means. No hand-crafted descriptors.
  2. Variable-length per-instance backbone. Each instance is a short variable-length sequence — CDR3, mutation context, defect coordination shell. CNN works (DeepTCR), Transformer works (the 2025 LLM). The backbone returns one dense vector per instance.
  3. Side-information / metadata channel. Categorical context that isn’t part of the sequence itself — V/D/J gene, MHC allele, tissue of origin — is embedded separately and fused with the sequence representation. The materials analog is space group, lattice type, synthesis route, processing history.
  4. Two-level attention.
    • Local (sequence-aware) attention: order-sensitive, captures what the token means in immediate context.
    • Global (permutation-invariant) attention: aggregates instances across the sample, captures co-occurrence without imposing a spurious order.
  5. Attention-based multiple-instance learning (MIL). Labels live at the sample level (patient outcome, tumor type, immunotherapy response). The model must aggregate a bag of per-instance vectors into a single sample-level prediction with per-instance importance weights you can read off. This is the part most people don’t import from NLP, and it is what turns “model that embeds one mutation” into “model that diagnoses a tumor.”

Bonus, used heavily in the 2025 model: MAE-style masked-token pretraining (literally 100% masking on the altered sequence) before the supervised head. Lines up with the TempoSurfViT recipe already in our toolkit (MAE pretrain + quantile head, paper draft at /writing/temposurfvit-draft/), so very little engineering overhead.

Equivariance: the materials-specific commitment

SO(3) rotational symmetry is the hard problem the bio version doesn’t have to face — amino-acid sequences are already 1-D, mutation contexts are already textual, but a defect site lives in 3-D space whose labelling is gauge-arbitrary. Three families of published options:

The cleanest commit for v1: frozen UMA-s-1p2 as the per-instance backbone, applied to a local cluster centered on each site, pooled within the cluster to one LL-dim SO(3)-invariant vector before entering the bag. Equivariance is enforced at the per-instance level; the MIL pool acts on already-invariant vectors and inherits invariance trivially. The cost is compute (no MPS acceleration; cloud GPU for training); the win is that we don’t reinvent equivariant representation learning in the aggregator itself.

The aggregator, written out

Step (5) is doing the load-bearing work, and the materials port lives or dies on whether this aggregator generalizes. The operator is the gated attention pooling from Ilse, Tomczak & Welling (ICML 2018). Given a bag of KK per-instance vectors {h1,,hK}\{h_1, \dots, h_K\} with hkRLh_k \in \mathbb{R}^L, parameters V,URD×LV, U \in \mathbb{R}^{D \times L} and wRDw \in \mathbb{R}^D, the per-instance weight is

ak=exp ⁣(w ⁣[tanh(Vhk)σ(Uhk)])j=1Kexp ⁣(w ⁣[tanh(Vhj)σ(Uhj)])a_k = \frac{\exp\!\left( w^{\top}\!\left[\, \tanh(V h_k) \,\odot\, \sigma(U h_k) \,\right] \right)}{\sum_{j=1}^{K} \exp\!\left( w^{\top}\!\left[\, \tanh(V h_j) \,\odot\, \sigma(U h_j) \,\right] \right)}

and the bag-level embedding is the convex combination

z=k=1Kakhkz = \sum_{k=1}^{K} a_k \, h_k

which feeds a standard classifier y^=softmax(Wcz)\hat{y} = \mathrm{softmax}(W_c z).

Attention-MIL pooling: per-item weights aggregate a bag into one labelattention weights a_k aggregate a bag of K instances into one labela_k (weight)a₁a₂a₃a₄a₅a₆a₇a₈h₁h₂h₃h₄h₅h₆h₇h₈h_k (vector)z = Σ a_k · h_kŷ (bag label)

Figure 3. What the aggregator actually does. The bars on top are the learned per-item attention weights a_k — and they are themselves the interpretability output (“this item drove the prediction”). The weighted sum z = Σ a_k h_k passes to a standard classifier head.

Two properties matter for the port:

The “gated” piece — the σ(Uh)\sigma(U h) elementwise product — exists because tanh\tanh alone struggles to produce strongly negative scores; the sigmoid acts as a learned vetoer. CLAM uses a clustering- constrained variant of this same Ilse-style pooling; ATGC uses a multi-head variant; the 2025 LLM and TransMIL replace it with full Transformer self-attention (TransMIL with a Nyström kernel approximation) — different aggregator family, but the permutation- invariant role is identical.

A reality check on novelty

This is not a hidden gem from one lab. The same pipeline — token embed → per-instance backbone → attention aggregation → sample-level head — is the standard pattern in weakly-supervised computational pathology. The matrix below maps how five works (two pathology, three from Sidhom / the Adams lab, plus the materials port this note is sketching) all instantiate the recipe with small variations:

WorkSubstratePer-instance backboneSide infoAggregationPretraining
Ilse et al. 2018 (attention-MIL)— (generic)anygated attention
CLAM (Nat Biomed Eng 2021)WSI patchesCNN (pretrained)clinicalattention-MILself-supervised
TransMIL (NeurIPS 2021)WSI patchesTransformerself-attention + MIL
DeepTCR (Nat Commun 2021)amino acids1-D CNNV/D/J + HLAattention-MILautoencoder option
ATGC (Nat Biomed Eng 2023)mutationsTransformergene + contextattention-MIL
Sidhom 2025 LLMmutationsdual-attention Transformerclinicalattention-MILMAE (100% mask)
proposed materials portdefect / dopant siteslocal-env Transformerspace group, latticeattention-MILMAE-style

What changes across rows is the substrate, the per-instance backbone, and the side-info schema — not the aggregation pattern. So the hook for porting to materials isn’t “Sidhom’s novel methodology”; it’s “materials science hasn’t yet borrowed the weakly-supervised pattern that pathology and immunology already converged on.” Defensible as a paper hook, easy to over-claim.

2. Why the recipe transfers

The three bio substrates this stack has shipped on look very different on the surface but share four structural properties:

Any domain matching this shape is a candidate. Materials science has three sub-domains matching the bag shape (Framings A/B/C below), plus a fourth with the related but different ordered-sequence shape (Framing D — processing routes). Figure 5 makes the biology/materials parallel concrete for the cleanest bag-shaped one.

3. Mapping to materials science

We cover four materials-science ports of the same architectural idea. The first three (A/B/C) are bag-shaped: a sample is represented as an unordered, variable-size collection of physically meaningful instances, and a permutation-invariant attention-MIL aggregator maps instance embeddings to a sample-level prediction. The fourth (D) is sequence-shaped: a sample is represented as an ordered processing history and is modeled with sequence-aware self-attention plus a [CLS] readout. Across framings, the per-instance encoder can remain similar; what changes is the definition of an instance, the definition of a bag, and whether order is physically meaningful.

Three bag-shaped materials framings (A/B/C); Framing D below is sequence-shapedsame recipe shape, different instance and bagcolumn →Framinginstancebag (one sample)sample labelA. Polymerblend(crowded)one polymerchain (SMILES)blend / copolymer= set of chainsTg, σ_ion,tensile strengthB. Crystalsites(general)one Wyckoff site+ coord. shellunit cell = set ofinequivalent sitesformation energy,band gap, modulusC. Defectsin host★ cleanestone point defect+ local shelldoped sample = setof defect sitesOER overpotential,Li-ion σ, magnetism

Figure 4. Bag-shaped materials framings with the same architectural skeleton. Across rows, the meaning of “instance” and “bag” changes: chains in a blend, inequivalent sites in a crystal, or defect neighborhoods in a host. The recipe stays fixed — instance encoder → masked attention-MIL pooling → sample-level prediction head. Framing C is the closest structural analogue to somatic-mutation oncology because sparse local perturbations in a host background are aggregated to predict a sample-level phenotype. Framing D (below) departs from the bag assumption: processing steps are ordered and therefore require sequence-aware self-attention rather than permutation-invariant MIL.

Framing A — polymer / molecule as sequence, blend as weighted bag

BiologyMaterials
Amino acidMonomer / functional group / SMILES atom token
CDR3 sequencePolymer chain (SMILES, SELFIES, repeat-unit tokens)
V/D/J genePolymerization route / catalyst / solvent
Patient repertoireBlend / composite / copolymer = weighted set of chains, with loading fraction, MwM_w, dispersity, tacticity, additives
Outcome labelGlass transition TgT_g, tensile strength, ionic conductivity, density

The bag is weighted, not just unordered. A 90/10 blend is not the same material as a 10/90 blend, and each instance carries covariates (loading fraction, molecular weight, dispersity, tacticity, additive identity, solvent / process metadata) that the chain SMILES alone doesn’t capture.

This framing is crowded if posed as molecular sequence modeling alone. ChemBERTa-2 ran masked-LM and multi-task regression over 77M SMILES; MoLFormer was pretrained on up to ~1.1B molecules from ZINC and PubChem; GP-MoLFormer extends the line into generative molecular modeling and property optimization via pair-tuning. A pure SMILES-BERT is not a paper anymore. The less crowded angle is weakly supervised bag-level learning for blends, composites, and multi-component polymer systems — sample = weighted set of chains/components, label = bulk property — which is exactly the structure the bio version handles.

Framing B — crystal as bag of local environments

BiologyMaterials
Amino acidElement symbol at a Wyckoff site
CDR3 sequenceLocal atomic environment around a site
V/D/J geneSpace group + lattice parameters
Patient repertoireCrystal = bag of inequivalent sites
Outcome labelFormation energy, band gap, bulk modulus, ionic conductivity

The novelty claim has to be careful here, because materials-GNN people have been living in crystal graphs since before half the internet learned to spell “attention.” Attention on crystal representations is crowded prior art: GATGNN combines local attention layers with a global attention layer that weights atom-environment vectors into a crystal representation; CGAT (Sci Adv 2021) represents crystals as graphs and uses multi-head attention over neighboring atoms; CEGANN (npj Comp Mat 2023) is explicitly a crystal edge graph attention neural network; ACGNet and the GCPNet line attach graph-convolutional attention operators; and ComFormer-style crystal graph transformers report SOTA across crystal-property benchmarks. AtomSets is adjacent in spirit (transferable atom-level representations with a less graph-heavy prediction head) without being MIL.

What the MIL framing buys, more carefully stated:

CIF / Robocrystallographer textual representations give a tokenization on-ramp.

Framing C — defects / dopants as the bag (the cleanest port)

BiologyMaterials
Somatic variantPoint defect / dopant atom in a host crystal
Ref → altHost atom → substituent atom
Local contextLocal coordination shell around the defect
Bag of mutations per tumorBag of defects per material sample
Tumor type / drug responseCatalytic activity / conductivity / magnetism

“Somatic mutations in a tumor” and “point defects in a doped oxide” are structurally the same problem: sparse, position-aware, sample-level labels, instance importance matters. The catalysis and battery-cathode communities have exactly this label shape and currently use bespoke per-property regressors.

Label granularity is the load-bearing distinction. If the task is per-defect formation energy or relaxed defect structure, this framing competes directly with established defect-GNN work — defect formation enthalpy predictors from ideal crystal structures, DefiNet for point-defect crystal structures, and the broader defect-informed equivariant-model line. The novelty there is at best incremental. The defensible framing is the other direction: a doped/defective material sample is represented as a variable-size bag of candidate defect neighborhoods, and the label is a sample-level outcome — OER overpotential, Li-ion conductivity, magnetism, carrier concentration, catalytic activity, or measured device-level response — observed only at the bulk. Attention-MIL is much more natural there than a defect-GNN trained on per-defect targets.

Where the real data actually fits. Once you go shopping for an open corpus, the cleanest fit for Framing C turns out to be not point defects in bulk oxides but adsorbate binding sites on catalyst surfaces: OC20-Dense gives roughly 100 candidate binding sites per (catalyst, adsorbate) system as a true bag, and the sample-level question “which site is the active one” matches the bio template almost exactly — same as “which mutation is the driver” in ATGC, just on a different substrate. The methodology and the per-site featurization are unchanged; only the substrate moves, from bulk defect sites to surface adsorption sites. The strongest existing competitor on the surface high-entropy catalyst version is the attention-enhanced EquiformerV2 + Post-Att Adapter from Sci Adv 2025 (“Decoding active sites in high-entropy catalysts”), and that’s exactly the bake-off target picked up in §4.

A tumor is a document; a doped crystal is a documentsame architecture, swap the alphabet — Framing Ctumor (document)KRAS-G12CTP53-R175HEGFR-L858RBRAF-V600EAPC-fsMYC-amp↑ bold = high attention weightdoped crystal (document)V_O@4fSr→La@2aMn→Co@8eV_Li@16dNb→Ti@1bO→F@4c↑ bold = high attention weightattention-MIL → ŷattention-MIL → ŷtumor type5-yr survivaloverpotentialLi-ion conductivity

Figure 5. Framing C, made visible. Two domains, one architecture: the sample is a bag of position-tagged tokens (mutations on the left, defects on the right), the model attends across the bag, and the attention weights are themselves the per-instance importance map that scientists actually want to read.

Framing D — alloy processing route as ordered sequence

BiologyMaterials
Amino acid in a sentenceProcessing step (anneal, quench, roll, age, HIP, …)
SentenceFull processing route applied to one alloy
Per-token side infoStep parameters (T, time, strain, atmosphere)
Composition embeddingAlloy composition vector (the side-info “metadata”)
Outcome labelYield strength, hardness, fatigue life, fracture toughness

A different shape from A/B/C: the sample is an ordered sequence of processing steps, not a permutation-invariant bag. Step order matters mechanically — anneal-then-quench is not the same alloy as quench-then-anneal — so the aggregator can’t be permutation-invariant. The natural shape is sequence-aware self-attention with a [CLS] token; the interpretability output is the [CLS]-to-step attention map (“which processing step set the final yield strength?”), exactly the DeepTCR / 2025-LLM line rather than the bag-of-mutations line.

Prior art on processing-route-as-sentence is thin. Most alloy-property models take final composition + microstructure descriptors and ignore the processing path entirely. CrabNet is composition-only; PolyMicros covers polymer microstructure, not metallurgy; AlloyGPT and the npj Computational Materials 2025 HEA transformer attend over composition tokens, not over processing-step tokens. The “processing-route-as-sentence with per-step attention as the interpretability output” angle is open.

Where the real data actually fits. The strongest open corpus is FatigueData-AM2022 (>15k AM fatigue points with structured post-processing fields — HIP / solution / age — JSON-native, CC licensed). The immediate target therefore moves from “wrought-alloy heat treatment → yield strength” to additive-manufacturing post-processing → fatigue life. Sequence depth is shallower than NIMS CDS+FDS (2–4 steps vs 5–7), but the AM target is open, the labels are real, and the per-step interpretability question (“which post-processing step set the fatigue life”) is one the AM community is actively asking. NIMS CDS+FDS is the deeper-schedule follow-on if the first paper lands.

Framing D — ordered sequence of processing steps with per-step attentionordered sequence (Framing D) — step order matters: anneal-then-quench ≠ quench-then-annealannealT = 950 K3 h, Arquench0.5 hArrollε = 0.21 passageT = 720 K8 hquench1.2 hN₂time →★ a = 0.27★ a = 0.41a = 0.05a = 0.08a = 0.19sequence-aware self-attention + [CLS] (alloy composition fused into CLS)ŷ = fatigue life★ = high-attention step in this route (model output, not annotation)here: an anneal at T > 800 K immediately followed by a quench

Figure 6. Framing D shape. Unlike Figures 4–5 (bag-shaped framings), the sample is an ordered sequence of processing steps; the aggregator is sequence-aware self-attention with a [CLS] token, not permutation-invariant MIL. The [CLS]-to-step attention map is the “which processing step determined the property” interpretability output. A synthetic toy with an order-sensitive decisive rule (an anneal counts only if immediately followed by a quench) lands the sequence model at val R² 0.954 vs 0.099 for a permutation-invariant gated-MIL baseline on the same tokens — the 9× gap is the reason Framing D earns a separate aggregator.

When “instance” is harder: surfaces, amorphous, high-entropy

The four framings above pick the cleanest cases. Three messier ones a catalysis reviewer will ask about — and what the engineering answer looks like.

Surface catalysis. OER and most heterogeneous catalysis live on surfaces (steps, kinks, terraces), not in bulk unit cells. The bag is the set of surface sites on a slab — typically 10–50 sites per slab — each represented by a local-coordination instance feature. This is what the Sci Adv 2025 paper actually operates on (CoOOH slabs with surface dopants), and Framing C maps onto it by reading “site” as “surface site” instead of “bulk-defect site.” No architecture change.

Amorphous and disordered systems. Glasses, gels, amorphous oxide catalysts have no Wyckoff labels and no canonical graph. The bag becomes atoms sampled within an r-cutoff (e.g., everything within 8 Å of a candidate active region); each instance is a coordination- shell vector. This is where the §1 equivariance commit — frozen UMA-s-1p2 on local clusters — earns its keep: there’s no crystal symmetry to fall back on, so SO(3) invariance has to be carried at the per-instance level.

High-entropy compositions. When 4+ elements are mixed at random on the same sublattice (HE oxides, high-entropy alloys), every site is a “dopant” in some sense — the host-vs-defect distinction breaks down. The bag is just every site; the per-instance vocabulary grows with the element count. The Sci Adv 2025 HE-CoOOH set is exactly this case — and the §4 bake-off inherits it as the downstream task.

The common thread is that the recipe doesn’t change; what changes is the definition of the instance and the size of the bag. Dilute doping → 5–20 point-defect instances; surface HE catalysts → 30–80 surface-site instances with many-element local chemistries. The Ilse-Tomczak aggregator doesn’t care. The per-instance backbone does, and is where the engineering risk lives.

Positioning, in one line. The contribution is not another materials transformer; it is a weakly supervised, instance-saliency framework that ports attention-MIL from mutation-level biomedical prediction to materials samples whose measured properties arise from unordered sets of chains, sites, or defect neighborhoods. The attention readout is a learned instance weight to be validated by ablation, not a faithful causal explanation by default — and that framing is the one that survives reviewers armed with CGCNN, GATGNN, CEGANN, ACGNet, ComFormer, AtomSets, and the rest of the acronym factory.

4. Where this becomes a paper

Strongest current bet: Framing C applied to surface-HE electrocatalysts — the intersection of “defects as the bag” and the high-entropy case from the previous subsection, which is exactly what the Sci Adv 2025 dataset operates on (slab-surface sites in HE- CoOOH). Solid electrolytes (Li-ion conductivity, Framing C with dilute doping) are the natural second downstream task if HE catalysts don’t pan out.

Composes with the TempoSurfViT training recipe (MAE-style pretrain + quantile head), so the engineering overhead is small and reuses our trainer.

What we’d actually be beating

Concrete competitive landscape so we don’t oversell. Recent attention-enabled crystal models that share part of this space:

ModelWhat it doesWhat the MIL/MAE framing adds
CGCNN (2017)Message passing on crystal graphNo site-importance output; fixed graph topology
CGAT — Crystal Graph Attention (Sci Adv 2021)Edge attention on CGCNN backboneAttention is on edges, not on sites-as-instances
ACGNetInterpretable CGNN for oxidation potentialSingle-task; no MAE pretrain; no bag framing
CEGANN (npj Comp Mat 2023)Edge-attention for environment classificationClassifier, not regressor; not a foundation-model framing
GCPNetCrystal-pattern graph + GCAO attentionSame edge-attention family
GP-MoLFormer (IBM, 1.1B SMILES)Transformer + pair-tuning for property optMolecule-level, not bag-of-instances
EquiformerV2 + Post-Att Adapter (Sci Adv 2025, high-entropy catalysts)SO(3)-equivariant graph transformer; per-site overpotential predictionAttention is on the equivariant graph, not on bag of sites; per-site importance is extracted post-hoc, not the pool’s primary output
Crystalformer (ICLR 2024)Transformer with “infinitely connected attention” formulated as neural potential summation; SOTA on Materials Project + JARVIS-DFT with ~29% of comparable Transformer paramsAttention is between atoms in a fully-connected periodic structure, not a bag pool; no per-site importance as primary output
DA-CGCNN (AIP Advances 2024)CGCNN backbone with dual attention (channel + self) and cross-property transfer learningAttention is on graph features, not on sites-as-instances; benchmarked on Materials Project (formation energy, bandgap, etc.), not on catalyst overpotential
Site-Net (Digital Discovery 2023)Transformer with bond-feature (pairwise) attention on atoms in a real-space supercell; mean-pool across atom embeddings for MatBench regressionPooling is unweighted mean (no MIL head, no per-instance aka_k output); attention is over atom pairs, not sites-as-instances; per-atom importance only readable post-hoc from pair-attention weights

The honest differentiator is not “we use attention on materials” (taken). It is the combination: bag-of-instances framing + MAE pretrain on the instance vocabulary + Ilse-Tomczak gated MIL with per-instance importance as the output, applied to settings where the sample is naturally a bag (defects, blends, disordered sites) rather than a fixed graph. Pathology and immunology have shown this combination converges and yields scientifically usable importance maps; materials hasn’t tested it at scale.

Interpretability — against what materials already has

Per-site importance isn’t a tool materials scientists are missing. The field has Sabatier analysis and electrocatalyst volcano plots (Nørskov et al., 2004 onwards), microkinetic decomposition of activation energies, DFT-computed adsorption-energy contributions per surface site, and — for high-entropy catalysts specifically — recent SHAP-on-Equiformer and integrated-gradients work that extracts per-site attributions post-hoc. Pitching aka_k maps as “novel interpretability” against that landscape is a losing pitch.

The honest claim is sharper: the aka_k map should recover the volcano-derived ranking and Sabatier-identified active sites — not as a post-hoc attribution that needs separate calibration, but as the pool’s primary output that the model itself was trained to optimize. The bake-off’s second primary metric is exactly this: agreement between learned aka_k and Sabatier-curated / DFT-computed per-site activity contributions on a curated set. A win there is “an end-to-end model that agrees with the physics-based attribution methods materials scientists already trust” — which is publishable because it removes a step (the post-hoc SHAP/IG/Sabatier computation), not because it provides interpretability that didn’t exist.

First-pass shipped 2026-05-19 (run_persite_eval.py, seed set in data/persite_eval/curated_active_sites.py). Six literature-curated slabs (Pt/Cu/Ni/Pd/Ag/Au × H/CO/O on (111)/(100)), ground truth = adsorbate atoms ∪ top-layer metal atoms within bond distance, MIL trained on cache/adsorption_v2.pt across 5 seeds. Headline (updated 2026-05-19 with the 12-entry extension):

MetricTrained MILDirichlet null
top-1 hit rate100.0% [100, 100]7.4% [6.4, 8.5]
top-3 hit rate100.0% [100, 100]32.2% [30.3, 34.1]
top-3 recall70.3% [63.3, 77.0]9.8% [9.2, 10.4]
top-5 recall80.2% [72.5, 87.5]15.4% [14.7, 16.1]
attn-conc. ratio2.32× [2.03, 2.63]1.02× [1.00, 1.04]

Decision rule from scope_persite_eval.md: ACCEPTABLE — matches the DeepTCR/ATGC bio baselines. Top-5 recall (80.2%) is at the STRONG band threshold. The 6 → 12 entry extension confirmed cross- element generalization (three off-train elements: Pd, Rh, Ir all recover at in-distribution-equivalent rates) and tightened the top-3 recall CI by ~27% (19pp → 14pp) while preserving the central tendency. Caveats: the active-site rule is qualitative (Path 3 DFT is what gives a true Spearman ρ); HE-CoOOH entries — same chemistry as the bake-off competitor — are not yet in the curated set. Path forward laid out in materials-nlp/persite_eval_NEXT.md.

Update 2026-05-18 — the “primary output, not post-hoc” framing softens under a comparator panel. Ran integrated gradients, input × gradient, and vanilla saliency on the same trained FFN MIL (scope_attribution_comparators.py, 5 seeds, dopant_indices as ground truth):

methodtop-1 hittop-3 hitattn. concentration
MIL aka_k0.880 ± 0.050.910 ± 0.086.93× ± 2.63
Saliency $\nabla y$0.870 ± 0.09
Integrated Gradients0.860 ± 0.060.890 ± 0.073.43× ± 1.50
Input × Gradient0.790 ± 0.100.880 ± 0.083.90× ± 0.73

The result is wrong-shaped for the original framing: aka_k is statistically tied with Saliency on top-1 hit (margin 0.010, within noise) and literally tied on top-3 (both 0.910). The gradient methods are equally faithful at picking dopant atoms. What aka_k uniquely wins is sharpness — its softmax produces a 6.93× concentration on dopants vs Saliency 4.42×, a 1.6× sharper map.

So the publishable claim isn’t “uniquely faithful” — it’s “matches integrated gradients and saliency on top-k dopant recall while producing a 1.6× sharper attribution map at zero post-hoc computational cost.” That’s a real but more modest contribution: sharpness matters for visualization (peaked headline figures) and for downstream use as feature weights (BO acquisition, surrogate weighting) where peakedness improves selection. It does not claim a faithfulness advantage that the comparator panel says isn’t there. Writeup at materials-nlp/e3_attribution_result.md.

But — the AOPC follow-up partially rescues a faithfulness advantage, with a wrinkle. Top-k recall measures agreement with ground truth; it doesn’t measure whether the attribution is causally faithful (Jain & Wallace 2019’s exact concern). The AOPC test (Liu et al. ICML 2022) ranks atoms by attribution score, masks the top-k by zeroing their features, and measures how much the prediction moves:

methodAOPC AUC (k=1..12)
MIL aka_k0.954 ± 0.335
Integrated Gradients0.834 ± 0.347
Saliency $\nabla y
Input × Gradient0.712 ± 0.361
random baseline0.594 ± 0.210

Saliency, which tied aka_k on top-1 hit, drops to mid-pack on AOPC. This is the Jain-Wallace pattern: it identifies dopant atoms by some non-causal signal (probably gradient magnitude correlating with atom-norm), passing the recall test without being causally predictive. aka_k wins AOPC by +0.12 over IG (the next-best post-hoc method) on the mean; per-seed, MIL wins 2/5, ties 2/5, IG wins 1/5.

So the combined publishable claim becomes:

“MIL aka_k ties IG and Saliency on top-k dopant recall but produces a 1.6× sharper attribution map and is +14% more causally faithful on AOPC. Saliency’s top-k tie is misleading: it drops to mid-pack on AOPC, indicating its top-k success comes from a non-causal signal. aka_k is the most causally faithful attribution on this backbone at zero post-hoc cost.”

This is stronger than what the top-k panel alone suggested, weaker than the original “first-class output, uniquely faithful” framing, and grounded in two complementary controlled tests rather than rhetoric. Writeup at materials-nlp/e3_aopc_result.md.

Where the wedge actually lives — localized vs. distributed signal

A small set of experiments I ran on top of frozen UMA-s-1p2 (see materials-nlp/baselines.py, materials-nlp/baselines_oc22.py, materials-nlp/attention_oc22.py) tightens the §4 pitch into something predictive rather than just hopeful.

Same per-site features (frozen UMA L=0 channels, dim=128\dim = 128), same 80/20 split, same training budget. Three aggregators compared: mean-pool + MLP, max-pool + MLP, gated attention-MIL + MLP.

TaskSignal typemean-pool MAEMIL MAEMIL attention concentrationWinner
Adsorption energy (Cu/Au/Pt/Ag/Ni × H/H₂/OH/CO, 100 slabs)Localized (1–2 adsorbate atoms in 13–14 total)0.55 eV0.28 eV2.68× uniformMIL by 2×
OC22 per-atom relaxed energy (real DFT labels on 100 oxide slabs)Distributed (uniform oxide chemistry across 30–180 atoms)0.22 eV/atom0.25 eV/atom1.7× uniformmean-pool by 13%

External baseline cross-check (OC22 task only). Querying the frozen UMA-s-1p2 calculator directly on each initial structure — the modern Meta substitute for the unavailable EquiformerV2-OC22 + Post-Att Adapter — and reporting predicted energy divided by atom count after subtracting the train-set mean reference offset gives val MAE 0.225 eV/atom on the same 20-sample val_id subset. UMA-direct ≈ mean-pool ≈ Ours-MIL on this task (0.22 / 0.22 / 0.27), with correlations all in 0.97–0.98. The fact that the modern Meta backbone also fails to beat mean-pool by a meaningful margin on OC22 IS2RE-per-atom is the distributed-signal regime confirming itself: no aggregator wins because the signal genuinely is spread across the bag.

The wedge: attention concentration determines pooling choicethe diagnostic — trained attention concentration determines which pool winsLOCALIZED — adsorbate taskDISTRIBUTED — OC22 per-atom task★ adsorbateper-atom attention, 14 atomsper-atom attention, 30+ atoms (uniform-ish)concentration ratio: 2.68× uniformMIL wins — val MAE 0.28 vs 0.55 eV (mean-pool)concentration ratio: 1.70× uniformmean-pool wins — val MAE 0.22 vs 0.25 eV/atom (MIL)attention concentration ratio (mass on chemistry-relevant atoms / uniform baseline)threshold1.7×2.7×OC22 (mean-pool wins)adsorbate (MIL wins)

Figure 7. The diagnostic, visualized — two of the three regimes. Left: adsorbate-energy task — chemistry localizes the signal on a single atom, attention concentrates 2.68× over uniform, MIL beats mean-pool by 2×. Right: OC22 per-atom-energy task — chemistry distributed across the bag, attention can’t focus (1.7×, near uniform entropy), mean-pool wins by 13%. The dashed threshold at ~2× concentration is the operational dividing line we’re proposing as the diagnostic. Not shown: the supervised-oracle upper bound on the localized panel (val MAE 0.145 eV) — discussed in the “MIL is not the accuracy upper bound” paragraph below. The qualitative trade-off is folklore in sound-event detection and pathology-MIL (Wang 2018, Ilse 2018); the threshold and the materials-side evidence are the additive.

One more dimension: how to choose between MIL and a supervised oracle. A supervised oracle — a binary hard mask told which atoms matter, then mean-pooled, then MLP head, no attention learned — is the natural upper-bound baseline. On the v2 adsorbate task above it crushes MIL: val MAE 0.145 eV vs MIL’s 0.28, nearly 2× better. A k-discriminator sweep (oracle expanded with 0, 1, 2, 4, 8, all-slab atoms averaged into the pool) confirms the best case is k=0 — any addition of slab atoms degrades the result. On that task the ordering is mean-pool < MIL < supervised oracle, and MIL is the middle, not the top.

But re-run the same comparison on a harder task — programmatically- doped HE-CoOOH slabs (Sci Adv 2025’s substrate, multiple dopants per surface, signal not localized to one atom but spread over each dopant plus its immediate slab context) — and the ordering flips:

taskmean-poolMILsupervised oraclewho wins
v2 adsorbate (1 atom carries the signal)0.55 eV0.28 eV0.145 eVoracle, by ~2×
HE-CoOOH (dopants + extended context)1.11 eV0.86 eV1.70 eVMIL, by ~2× (6 of 6 random seeds; paired bootstrap 95% CI on MIL−oracle = [−1.02, −0.34], n=2000)

The two tasks differ in how the chemistry localizes. When the signal is concentrated on one or two atoms you can name in advance, the oracle’s hard mask is optimal — MIL spends capacity rediscovering something supervision already knows. When the signal is extended- localized — a few dopants whose contribution depends on the neighborhood they sit in — the oracle’s hard mask discards the neighborhood and MIL’s soft weights pick it back up.

The wedge therefore has three regimes, not two:

MIL’s value proposition is therefore not “best accuracy universally” but “the only pool that wins on extended-localized signals AND produces an unsupervised per-site importance map.” Mean-pool wins on the distributed end but throws away localization; oracles win on the fully-localized end but require supervision MIL doesn’t ask for. The extended-localized regime is where both other approaches lose information and MIL’s wedge is real.

The qualitative version of this trade-off — attention pooling wins on localized signals, mean-pooling wins on distributed ones — is not new. Wang, Li, and Metze (“A Comparison of Five MIL Pooling Functions for Sound Event Detection with Weak Labeling,” 2018/2019) characterize exactly this across five pooling functions in audio, and the original Ilse–Tomczak–Welling 2018 attention-MIL paper introduces the architecture under a “few key instances” witness-rate framing that implicitly assumes localization. FocusMIL (2024) adds the counter- point that max-pooling beats attention under spurious-correlation regimes. What the table above contributes is the quantitative dividing line: mean attention concentration on chemistry-relevant atoms of about 2× uniform is the operational threshold on materials property prediction — a regime that neither prior empirical study tested. The framing is folklore; the threshold and the materials-side evidence are the additive.

Given that framing, the relationship in the table reads cleanly: MIL’s advantage over simpler pooling scales with how peaked its trained attention is allowed to get. When the underlying chemistry concentrates on a few sites (adsorbate atoms on a metal surface), the attention finds them — 2.68× the uniform- baseline mass on the adsorbate, with no supervision about which atoms those were — and the aggregator beats mean-pool by a 2× factor. When the chemistry is uniform across the bag (bulk-ish oxide slabs predicted on a per-atom basis), attention can’t learn a useful focus (1.7×, barely above uniform; entropy 89% of uniform), and adds noise relative to averaging.

This means the §4 differentiator is not “MIL beats mean-pool universally” — that’s empirically false on distributed-signal tasks. The defensible pitch is MIL is the right tool when the underlying physics localizes the signal, and the diagnostic is the trained attention concentration itself. Concretely: if a trained model’s mean attention concentration on chemically-relevant atoms exceeds ~2× uniform, MIL beats the pooling baselines and produces a per-site importance map that’s worth reading. Below that threshold, mean-pool is the better tool and the interpretability claim collapses.

The Sci Adv 2025 HE-catalyst OER overpotential task that §4’s bake-off targets is in the high-concentration regime by physics: specific dopant sites drive activity, Sabatier analysis already tells us qualitatively which ones, and a well-trained model should recover that focus. Conversely, OC22 per-atom-energy regression is the wrong testbed — it’s a distributed-signal task and the MIL framing should not be expected to beat mean-pool there. We just confirmed that empirically; the bake-off should pick its tasks accordingly.

Controlled validation: synthetic continuum and N-scaling

The wedge above rests on three task-level points (adsorbate v2, OC22 per-atom, HE-CoOOH). To validate the diagnostic non-circularly we ran two controlled experiments on 2026-05-18 — one on the signal axis and one on the data-size axis. Both came back with results that tightened the framing rather than killed it, but in unexpected ways that change which claim §4 leads with.

Synthetic locality continuum (Wang-Li-Metze 2018 in audio, lifted to materials shape). Hold everything else fixed and vary the fraction of signal-bearing instances per bag in 1; train mean-pool, max-pool, gated MIL, and an oracle hard-mask pool for 200 epochs, 5 seeds per cell.

locality fractionmean MAEMIL MAEoracle MAEMIL conc.mean/MIL ratio
0.050.3700.0260.00219.5×14.5×
0.200.1540.0360.0014.1×4.2×
0.400.0880.0240.0022.2×3.6×
0.600.0510.0220.0021.5×2.3×
1.000.0020.0010.0021.0×1.9×

(MIL bold; 5-seed mean.) Three findings change the framing:

HE-CoOOH N-scaling sweep (within the existing 100-structure Path C cache). For N in 100, subsample uniformly, 5 seeds per cell, retrain.

Nmean-pool MAEMIL MAEoracle MAEMIL conc.mean/MIL ratio
102.992.904.243.5×1.03×
201.591.812.842.8×0.88× ←
401.341.471.873.4×0.91× ←
601.331.111.872.9×1.20×
801.310.861.543.2×1.52×
1001.370.871.583.4×1.57×

(← = mean-pool ties or beats MIL.) Three more findings:

Cumulative reframing of the §4 wedge. The single-table wedge above decomposes into a three-part story:

  1. Signal-side. On synthetic, MIL universally beats mean-pool; gap shrinks with locality but never reverses.
  2. Data-side. MIL beats mean-pool only when N ≥ ~40 bags, regardless of attention concentration. Sufficient condition for the diagnostic to predict MIL > mean-pool requires both.
  3. Oracle-vs-MIL contrast. Oracle dominates on synthetic but loses on HE-CoOOH. The contrast operationally defines the extended-localized regime, rather than naming it phenomenologically.

This is more rigorous than the original three-regime taxonomy (distributed / extended-localized / fully-localized) because each regime now sits on a controlled axis — locality, data size, or information truncation by the binary mask — instead of being defined by which dataset happened to land where. Materials side of the bake-off carries the full three-part story; the synthetic and N-scaling experiments together cost half a day of dev-box CPU. Writeups at materials-nlp/e2_locality_result.md and materials-nlp/e5_scaling_result.md.

One more controlled axis: bag size. The original synthetic above ran at bag_size=20, while HE-CoOOH slabs have 48 atoms. A natural reviewer question — and one we wanted to answer before committing to the three-part framing — is how much of the synthetic-vs-materials gap-ratio differential (4.2× synthetic at lf=0.20 vs 1.6× materials on HE-CoOOH) is explained by bag size alone. Re-ran the same grid with bag_size=48 (scope_synthetic_locality_bag48.py); side-by-side:

lfbag20 ratiobag48 ratioΔ
0.0514.51×3.49×−11.02
0.104.76×3.83×−0.93
0.204.23×3.59×−0.64
0.403.59×2.68×−0.91
0.602.28×1.99×−0.30
1.001.90×1.45×−0.45

Bag size matters most in the extreme low-locality regime — at lf=0.05 the ratio collapses from 14.5× to 3.5× because mean-pool now averages over ~2 signal instances instead of 1. In the middle of the locality range (where materials operate) the bag factor moves the ratio by less than 1.0, so the wedge framing is largely bag-size robust where it matters.

The follow-up gives the §4 wedge a quantitative decomposition of the materials MIL/mean-pool advantage:

componentfactor in MIL/mean ratio
pure locality (lf=0.05, bag=20, IID features)14.5×
bag-size correction (bag=20 → bag=48)÷ 4.1 → 3.5×
feature-correlation correction (synthetic → materials)÷ 2.2 → 1.6×

Each factor is empirically grounded in a controlled run. The §4 paper can now claim: “the MIL/mean advantage of 1.6× on HE-CoOOH is the product of a ~3.5× locality factor (signal sits on ~4% of atoms) divided by a ~2.2× feature-correlation factor (shared coordination information lets mean-pool partially recover the signal).” That explains why the gap is 1.6× and not 14.5×, which is what a reviewer would otherwise raise. Writeup at materials-nlp/e2_bag48_result.md.

Fourth controlled axis — sharpness ceiling. A reviewer would also ask: if the wedge framing prizes attention concentration, why not use a more expressive pool that produces even sharper attention? Ran Set Transformer’s PMA pool (Lee et al. ICML 2019, multi-head attention from a learnable seed query) against Ilse-2018 gated MIL on the same backbone and data:

poolval MAEtop-1 dopant hitattention concentration
Ilse-2018 gated MIL0.75 ± 0.120.88 ± 0.053.69× ± 1.66
PMA (1 seed, 4 heads)1.58 ± 0.510.81 ± 0.119.78× ± 3.35
PMA (4 seeds, 4 heads)1.27 ± 0.310.84 ± 0.075.31× ± 0.79

The finding is paradoxical and tightens the wedge framing again: PMA produces 2.6× higher attention concentration but 2.1× worse MAE. Sharpness alone is not the right diagnostic — over-concentration is overfitting. PMA peaks on 1–2 atoms and discards the neighborhood context that the §4 “extended-localized” regime requires (the same mechanism that makes oracle’s binary mask lose on HE-CoOOH).

So the wedge framing now has a third necessary condition:

conditionfound in
attention concentration ≥ 2×original §4 wedge
training set N ≥ ~40 bagsE5 (N-scaling)
attention concentration ≤ ~5×E4 (PMA pool, this finding)

The operational sweet spot is a sharpness band: 2× ≲ c ≲ 5×. Ilse-2018 gated MIL produces concentrations 2.7× – 3.7× depending on substrate, smack in the middle of the band. The simpler 2018 pool wins not because PMA is broken in some way, but because Ilse-2018’s softmax-over-gated-linears has an implicit sharpness regularizer that PMA’s multi-head attention with learnable seeds doesn’t have. In small-data regimes (N=80 train bags), that regularizer is load-bearing. PMA might catch up at Sci Adv scale (N=4,822); within the current cache it cannot. Writeup at materials-nlp/e4_pma_result.md.

Fifth controlled axis — the extended-localized mechanism itself. The original locality synthetic above contradicted HE-CoOOH: oracle dominated synthetic but lost on materials. I hypothesized that’s because the materials signal leaks from the dopant into its coordination shell, which the binary instance mask discards. To test it under control, ran a third synthetic (scope_spread_signal.py) with a tunable spread decay — each core signal instance puts weight 1 on itself, decay**r on neighbors at distance r — and held locality fixed at lf=0.10:

spreadmeangated MILoracle (core)oracle (core + ±2)
0.000.0060.0070.0060.014
0.100.0100.0100.0100.024
0.250.0160.0150.0230.039
0.500.0180.0150.0510.052
0.750.0190.0140.0870.072
1.000.0190.0140.1230.102

The transition is sharp and the mechanism is now grounded. At spread=0 (pure localized) oracle wins, matching original E2. At spread=0.25 oracle starts losing — by spread=1.0 gated MIL beats oracle by 9×. Even the “extended” oracle that includes core + ±2 neighbors loses to gated MIL at moderate spread: including the right set of atoms isn’t enough when the contribution decays with distance, because a binary mask cannot represent weights.

The HE-CoOOH 1.8× MIL/oracle gap corresponds to synthetic spread ≈ 0.3 — physically plausible for dopant-induced perturbations propagating one to two coordination shells in transition-metal oxides. So the “extended-localized” regime is no longer a phenomenological label; it’s a mechanism with a tunable synthetic parameter, a measured crossover threshold (~0.2), and a materials operating point (~0.3) consistent with the physical intuition. Writeup at materials-nlp/e2_spread_result.md.

Sixth controlled axis — dopant density. Re-ran the same FFN MIL / mean-pool / oracle comparison on the 3-element doping cache (he_coooh_3element.pt, N=100, 3 dopants per slab instead of 2 from the same 9-element pool). The result is the most inconvenient finding of the session:

pool2-elem MAE3-elem MAEdirection
mean-pool1.27 ± 0.170.57 ± 0.09mean-pool got 55% better
gated MIL0.75 ± 0.120.64 ± 0.14MIL got 14% better
oracle1.50 ± 0.471.88 ± 0.15oracle got worse

So at 3-element doping, mean-pool beats gated MIL (0.57 vs 0.64; mean/MIL = 0.88×, flipped from 1.69× at 2-element). The MIL-vs-oracle gap intensifies (2.0× → 2.9×); top-1 dopant hit improves (0.88 → 0.98); attention concentration stays in the operational band (3.7× → 2.9×).

The §4.2 “MIL beats mean-pool” claim turns out to be regime- specific to low dopant density. Mean-pool benefits dramatically from more signal-bearing atoms (3/48 ≈ 6.25% locality vs 2/48 ≈ 4.2%); MIL was already extracting most of the signal at 2-element. The mean/MIL ratio flips around dopant density ~5% per bag.

The corrected wedge has two MIL advantages on different axes:

The interpretability claims (top-1 hit, attention concentration, sharpness) survive intact across both density regimes. The accuracy claim narrows. Sci Adv 2025’s HE-CoOOH has 2-4-element compositions; a faithful reproduction needs density-stratified reporting. Writeup at materials-nlp/e8_3element_result.md.

Seventh controlled axis — cross-density transfer is catastrophic. E8 measured within-density behavior on each cache separately. E9 asks the deployment-relevant question: can a model trained on one density regime generalize to the other? Trained FFN MIL + mean-pool + oracle on each cache, evaluated on the other:

cellmeanMILoraclemean/MIL
2→2 (within)0.97 ± 0.010.50 ± 0.060.61 ± 0.041.95× (MIL wins)
3→3 (within)0.54 ± 0.120.49 ± 0.160.82 ± 0.061.09× (MIL wins narrowly)
2→3 (transfer)5.21 ± 1.657.01 ± 2.3418.68 ± 0.760.74× (mean wins)
3→2 (transfer)3.00 ± 0.3610.10 ± 1.638.37 ± 0.380.30× (mean wins big)

Cross-density transfer is catastrophic across every pool — MAE jumps from sub-eV (within) to 3-19 eV (across), a 5-30× degradation depending on pool. The MIL/mean ratio inverts: mean-pool wins both transfer directions, by 1.4× and 3.3×. Mean-pool is the most density-robust (5.5× degradation vs MIL’s 17× and oracle’s 20×); simpler pools with fewer parameters extract more density-invariant signal.

So the §4 accuracy claim is bounded twice over — first by regime (E8: works at 2-elem, fails at 3-elem when each is trained independently), and then by training distribution (E9: even within a regime, MIL trained on a different density is catastrophically worse than mean-pool). The wedge is a within-distribution claim at a specific density. The interpretability claims (sharpness, AOPC, top-k recall) measure properties of the attention map and are the natural candidates to survive the transfer collapse — but that’s a hypothesis E9 doesn’t test directly. The natural conclusion:

“Attention-MIL produces interpretable per-site importance maps that are robust to dopant density. Its bag-level accuracy advantage over mean-pool is regime-specific and training- distribution-specific. The paper’s primary contribution is most reliably an interpretability contribution; the accuracy contribution is a benchmark in a specific regime that does not transfer.”

Writeup at materials-nlp/e9_density_transfer_result.md.

Eighth controlled axis — keystone: the attention map survives cross-density transfer. E9 only measured bag-level MAE on the transfer cells. The natural follow-up: does the interpretability output also collapse, or does it survive? Trained FFN MIL on each density and measured top-k dopant recall + attention concentration on the held-out other density:

cellMAEtop-1 hittop-3 hitconcentration
2→2 (within)0.50 ± 0.060.854 ± 0.0440.876 ± 0.0393.97×
3→3 (within)0.49 ± 0.160.974 ± 0.0330.986 ± 0.0282.73×
2→3 (transfer)7.01 ± 2.340.960 ± 0.0550.962 ± 0.0563.08×
3→2 (transfer)10.10 ± 1.630.860 ± 0.0650.892 ± 0.0562.88×

Net: bag-level MAE worsens 17× under transfer; top-1 dopant hit changes by +0.004 (essentially identical); attention concentration stays at ~3× in all four cells, well above the 2× operational floor.

The 2→3 transfer is the cleanest demonstration: a model trained only on 2-element bags achieves top-1 dopant hit 0.96 on 3-element bags — higher than its within-distribution top-1 of 0.854 — while bag MAE collapses from 0.50 to 7.01 eV. The attention mechanism learns a task-generic skill (“find dopant atoms”) that transfers; the head learns to map the pooled bag-vector to a scalar prediction, which depends on the per-bag density distribution and does not transfer.

This is the keystone result for the §4 messaging shift that came out of E3 + E8 + E9. The paper’s primary contribution is now empirically grounded:

“Attention-MIL produces interpretable per-site importance maps that are density-invariant and training-distribution-robust: top-1 dopant recall and attention concentration stay within ±0.5% of within-distribution performance under cross-density transfer, even when bag-level MAE collapses by 17×. The accuracy advantage is regime-specific; the interpretability advantage is regime-invariant. The paper’s primary contribution is the interpretability claim, which generalizes; the accuracy contribution is a within-distribution benchmark.”

Writeup at materials-nlp/e10_interp_transfer_result.md.

Caveat on E10: AOPC does not cleanly survive transfer (scope_aopc_transfer.py, E12). Re-ran the §4.6i AOPC test on the same four train→eval cells. The transfer cells are asymmetric and unreliable:

cellAOPC AUCE10 top-1 hit (same cell)
2→2 (within)0.79 ± 0.260.85
3→3 (within)0.94 ± 0.130.97
2→3 (transfer)0.45 ± 0.15 (collapses)0.96
3→2 (transfer)1.52 ± 0.35 (inflates above any within)0.86

The 3→2 AOPC inflation isn’t faithfulness winning — it’s the model being far from saturation (E9 MAE 9.34 eV), so ablating any atom moves the wildly-wrong prediction by a large absolute amount. The 2→3 AOPC collapse is the inverse: the model has saturated on a constant-ish prediction and atom ablations don’t move it much, even though attention is correctly identifying dopants (top-1 hit 0.96).

The honest reading: AOPC conflates “faithful attribution” with “prediction is far from saturation”, and under cross-density transfer the bag-level prediction itself collapses (E9 finding) — so AOPC becomes uninformative about whether attention is correctly attributing. The E10 keystone is preserved but narrowed: the interpretability-survival claim is two-pillar (top-k recall + attention concentration), with the within-distribution AOPC advantage from §4.6i as a separate finding. Writeup at materials-nlp/e12_aopc_transfer_result.md.

Ninth controlled axis — mixed-density training closes the loop. E9 said cross-density transfer is catastrophic; E10 said interpretability survives it anyway. The practical follow-up: if you train on the union of densities, do you recover within-density accuracy? Three train regimes (2-only, 3-only, mixed) × two val regimes (2-elem, 3-elem), 5 seeds:

train2val MAE3val MAE2val top-13val top-1
2-only (specialist)0.756.83 (transfer)0.880.97
3-only (specialist)9.34 (transfer)0.640.900.98
mixed (2+3 union)0.930.810.860.97

Mixed-density training rescues accuracy at ~25% penalty over specialists and preserves interpretability. From E9’s catastrophic transfer (6.8 / 9.3 eV) to within-distribution-grade accuracy (0.81 / 0.93 eV) — an 8.4-10× MAE improvement, achieved just by training on the union. Top-1 dopant hit is essentially identical to specialists (0.86-0.97 across all six cells); attention concentration stays in the 2.55-3.69× operational band across all cells.

So the complete cross-density story (E8 → E9 → E10 → E11) is:

The deployment recommendation is now actionable: train on the density union, evaluate density-stratified, and use the attention map for interpretation regardless of train/eval mismatch.

Writeup at materials-nlp/e11_mixed_density_result.md.

Tenth controlled axis — quantile calibration is moderate within distribution, breaks under transfer. The §4 paper and experiment- spec both claim the quantile head delivers “calibrated uncertainty for free” for the Paper-2 BO acquisition function. Nothing in the session had actually trained or evaluated a quantile model until now. Trained FFN MIL with 5-quantile pinball loss and measured reliability + ECE on all four cross-density cells:

cellmedian MAEECE
2→2 (within)0.91 ± 0.250.156 ± 0.07
3→3 (within)0.81 ± 0.150.176 ± 0.07
2→3 (transfer)4.77 ± 2.100.312 ± 0.10
3→2 (transfer)12.81 ± 2.770.497 ± 0.01 (saturated)

Within-distribution ECE ≈ 0.16. The reliability curve tracks the ideal diagonal in shape but is biased high in the middle and slightly under-confident at the extremes — the textbook “needs Platt/isotonic post-hoc recalibration” pattern. Achievable but not free; the original “calibrated uncertainty for free” claim isn’t empirically supported.

Under transfer the calibration collapses: the 3→2 cell has empirical coverage saturated at 1.0 across every nominal quantile — every true 2-elem value falls below the lowest predicted quantile, because the 3-elem-trained model predicts wildly too-high values. The 2→3 cell flattens at ~0.25-0.32 coverage across all quantiles — the predicted range is too narrow and miscentered.

The 5-metric cross-density survival summary (combining E9, E10, E12, E6):

metrictypewithintransferverdict
bag-level MAEprediction-side0.5 eV5-30× worsecollapses
top-1 dopant hitattention-map0.910.91survives
attention concentrationattention-map3.4×3.0×survives
AOPC AUCprediction-sensitivity0.79-0.940.45 / 1.52asymmetric collapse
calibration ECEprediction-side0.160.31-0.50collapses

Two-pillar transfer-robust interpretability vs three transfer- fragile prediction-side metrics. The interpretability claim generalizes across density; the accuracy / faithfulness / calibration claims are within-distribution only with known practical mitigations (mixed-density training, post-hoc recalibration). Paper-2’s BO pitch needs both recalibration AND mixed-density training to deliver actionable uncertainty estimates. Writeup at materials-nlp/e6_calibration_result.md.

Eleventh controlled axis — mixed-density training rescues calibration too. Same pattern as E11 (which rescued accuracy): train the quantile head on the 2+3-elem union, evaluate on each density. ECE drops from the catastrophic transfer values (0.30 / 0.50) to within-specialist levels (0.18 / 0.17):

cellmedian MAEECE
2-only specialist on 2val0.91 ± 0.250.156
3-only specialist on 3val0.67 ± 0.200.184
2-only on 3val (transfer)4.370.296
3-only on 2val (transfer)13.510.498 (saturated)
mixed on 2val0.880.180
mixed on 3val0.810.174

The reliability curves for the mixed-trained model on both val sets hug the ideal diagonal alongside the specialists. The catastrophic transfer collapse is fully rescued. Combined with post-hoc Platt/isotonic recalibration (a known fix that drops within-distribution ECE under 0.05), this delivers the calibrated uncertainty the Paper-2 BO acquisition function needs.

Complete Paper-2 deployment recipe (E10 + E11 + E13 combined):

  1. Mixed-density quantile training → recovers both bag-level accuracy and quantile calibration on every density in the union; ~25% MAE penalty and ~4% ECE penalty over specialists.
  2. Post-hoc Platt/isotonic recalibration → reduces ECE further. (Verified by E14 below — actual reduction is 25-44%, not “under 0.05” as earlier writeups optimistically claimed.)
  3. Attention-map aka_k as per-site importance for BO acquisition → guaranteed regime-invariant by E10 keystone; survives cross-density transfer at +0.004 top-1 hit drift.

Three steps, all empirically grounded. The original “calibrated uncertainty for free” claim from the experiment-spec becomes “calibrated uncertainty in three concrete steps, with measured caveats from E14 below.” Writeup at materials-nlp/e13_mixed_calibration_result.md.

Twelfth controlled axis — measuring the recalibration step (honest correction). Three prior writeups (E6, E12, E13) all claim “Platt/isotonic recalibration should drop ECE under 0.05.” That claim was never measured — only promised. E14 (scope_platt_recalibration.py) ran the actual recalibration step. Trained the mixed-density quantile head on 60% of each cache, held out 20% as calibration fold for fitting an isotonic recalibrator, then evaluated ECE on the remaining 20% val. 5 seeds.

cellpre-recal ECEpost-recal ECEreduction
2val0.164 ± 0.0540.092 ± 0.026−0.072 (44%)
3val0.198 ± 0.1530.148 ± 0.097−0.050 (25%)

The “under 0.05” claim was overconfident. Isotonic recalibration does help meaningfully — 44% ECE reduction on 2val, 25% on 3val — but post-recalibration ECE is 0.09 (2val) and 0.15 (3val), well above the conventional 0.05 target. The reliability curves visibly move toward the ideal diagonal (Figure 21) but don’t reach it. Two diagnosable causes: small calibration set (N=20 per density makes the isotonic fit noisy), and direction-dependent quantile bias that a single global isotonic map can’t fully address.

Corrected Paper-2 calibration claim: post-hoc isotonic recalibration reduces ECE from 0.16-0.20 to 0.09-0.15 — a 25-44% improvement. The 0.05 target is not reached at this data scale; uncertainty bands are biased by 10-15% in absolute coverage. Workable for BO acquisition with explicit uncertainty-aware rules (e.g., conformal wrapping on top of recalibrated quantiles), not pristinely calibrated. The cumulative ECE journey through the deployment recipe is 0.40 (naïve transfer) → 0.18 (mixed training) → 0.09-0.15 (mixed + recalibration) — a 2.7-4.4× improvement overall, with a clearly-flagged residual gap from pristine calibration. Writeup at materials-nlp/e14_platt_result.md.

Thirteenth controlled axis — and E14’s diagnosis was wrong too. E14 attributed the under-target ECE to “direction-dependent quantile bias that a unified isotonic calibrator can’t fully address.” E15 (scope_per_density_recal.py) tested that diagnosis by fitting separate isotonic recalibrators per density:

recalibration2val ECE3val ECEavg
pre-recal0.164 ± 0.0540.198 ± 0.1530.181
unified isotonic (E14)0.092 ± 0.0260.148 ± 0.0970.120
per-density isotonic (E15)0.110 ± 0.0420.146 ± 0.0750.128

Per-density recalibration is slightly worse than unified (+0.008 ECE, +6.7%) — within noise. The diagnosis is wrong: the bottleneck is calibration data size, not calibrator design. Splitting the 20-sample calibration set in half (N=10 per density) hurts the per-density isotonic fits more than the direction-dependent bias hurts the unified compromise. The unified calibrator’s “average” between two opposing bias patterns turns out closer to the diagonal than either density-specific fit.

The honest cumulative diagnosis is one factor, not two: small calibration set (N=20 unified) is the limit. Reaching the 0.05 target needs more data — either a larger held-out fold (consuming train data) or cross-validated calibration. Per-density calibrators don’t help at this scale.

The §4.6n recipe should specify unified isotonic recalibration, not per-density. The accompanying caveat: ECE 0.09-0.15 is what the current data scale supports; cross-validated or larger-N calibration would close the residual gap to 0.05. Writeup at materials-nlp/e15_per_density_recal_result.md.

Fourteenth controlled axis — conformal prediction is the clean rescue. After two honest corrections (E14 retired “ECE under 0.05”; E15 retired “direction-dependent bias as the limit”), the question becomes: is there any calibration strategy that delivers the promised uncertainty at this data scale? Split conformal prediction (Romano-Patterson-Candès 2019) offers a different trade than isotonic: finite-sample marginal coverage guarantee at any chosen level, at the cost of wider intervals.

Same mixed-density quantile MIL as E13/E14/E15; target α=0.20 (nominal 80% coverage); split conformal calibration on the same 20% held-out fold; evaluated on the other 20% val.

metricpre-conformalpost-conformal
2val coverage0.62 ± 0.060.78 ± 0.13
3val coverage0.62 ± 0.190.82 ± 0.09
2val width (eV)2.09 ± 0.213.21 ± 0.97 (1.54× wider)
3val width (eV)1.96 ± 0.263.08 ± 1.00 (1.57× wider)

Conformal hits the 0.80 nominal coverage target exactly (0.78 + 0.82 = avg 0.80) on both densities. The cost: intervals are 1.55× wider. This is the cleanest rescue of the session — after the two honest corrections, conformal delivers a guaranteed result by construction.

The Paper-2 deployment recipe now has two valid options at step 2:

For most BO applications, 2b is the right choice — the coverage guarantee makes the optimizer behave correctly under the uncertainty estimate. Writeup at materials-nlp/e16_conformal_result.md.

Fifteenth controlled axis — and a third honest correction. E15 diagnosed “calibration data size is the bottleneck.” E17 (scope_kfold_conformal.py) tested that by running K=5 cross-conformal — 4× more conformity scores via leave-fold-out training:

metricsplit (n=40)K-fold (n=160)Δ
avg coverage0.8050.775−0.030
avg width3.01 eV2.77 eV−0.24 eV (−8%)

K-fold narrows intervals by 8% but at the cost of 5× more compute and slightly higher 3val variance. E15’s diagnosis is partly right but not magnitude-strong: calibration data size is one of two bottlenecks. The other is model variance from the small N=80 training set — each K-fold model is trained on only 64 bags and is noisier than the full-train model, and that variance limits how much the pooled τ can shrink.

This is the third honest correction in the session:

For Paper-2 at this data scale, split conformal (E16) is the right choice — simpler, comparable performance, lower compute. K-fold cross-conformal is a marginal refinement to flag for when data scales up. Writeup at materials-nlp/e17_kfold_conformal_result.md.

Sixteenth controlled axis — deep ensemble closes the model-variance bottleneck (at coverage cost). E17 diagnosed two bottlenecks: calibration data scarcity AND model variance from N=80 training. K-fold cross-conformal addressed (1); E18 tests the model-variance side directly by training K=5 FFN MIL quantile models with different inits, averaging their predicted quantiles, then split-conformalizing the averaged predictions.

The three calibration strategies now form a clean Pareto curve on the same Paper-2 surrogate:

strategyavg coverageavg widthtrade
E16 split (1 model)0.803.15 eVstrict coverage
E17 K-fold0.782.77 eVbalanced (−12%)
E18 deep ensemble0.752.42 eVwidth-first (−23%, 5pp miscoverage)

Ensembling narrows intervals 13% further beyond K-fold and 23% beyond E16 baseline — measuring that model variance was indeed a real bottleneck, as E17 diagnosed. The cost: coverage drops 5pp. At N=40 calibration the τ estimate is noisy (per-seed range −0.22 to +0.14 eV) and doesn’t fully compensate for the narrower raw ensemble interval.

The Paper-2 calibration recipe is now a documented Pareto choice, not a single fixed strategy:

Writeup at materials-nlp/e18_deep_ensemble_result.md.

Seventeenth controlled axis — E19 closes the Pareto curve. Combined K=5 cross-conformal × M=3 ensemble: 15 trained models per seed for the K-fold conformity pooling, plus a 3-ensemble trained on all train+calib for test-time prediction. The full Pareto picture:

strategyavg coverageavg width
E16 split (1 model)0.803.15 eV
E17 K-fold (5 models)0.782.77 eV
E18 ensemble (5 models)0.752.42 eV
E19 K-fold × ensemble (18 models)0.842.46 eV

E19 strictly dominates the simpler strategies on aggregate coverage at near-narrowest width — coverage 0.84 above the 0.80 nominal target, width 2.46 eV within 2% of E18’s narrowest 2.42 eV. The combined strategy doesn’t average the individual effects; it delivers ensemble’s near-narrowest width AND K-fold’s coverage stability simultaneously.

So the Paper-2 calibration recipe collapses from a Pareto choice to a single recommended strategy:

“Use K=5 cross-conformal on an M=3 ensemble of quantile MIL models. Empirical coverage 0.84 at α=0.20 (above nominal 0.80) with interval width 2.46 eV. Compute cost: 18 model trainings per deployment epoch, amortizable across the BO campaign.”

Per-cell asymmetry remains (3val 0.95 over-conservative, 2val 0.73 under) but aggregate coverage 0.84 ≥ 0.80 meets spec. A per-density conformal τ is the natural further refinement.

Writeup at materials-nlp/e19_kfold_ensemble_result.md.

Eighteenth controlled axis — per-density conformal τ flattens the E19 per-cell asymmetry. Split conformity scores by density, compute τ_2elem and τ_3elem separately, apply per-density at test time:

metricE19 pooled τE20 per-density τΔ
2val coverage0.730.80 (nominal)+0.07
3val coverage0.950.950.00
2val width2.46 eV2.78 eV+0.32
3val width2.45 eV2.40 eV−0.05
aggregate coverage0.840.875+0.035
dispersion0.220.15−32%

Per-density τ rescues 2val coverage to exactly nominal while preserving 3val. Per-cell dispersion drops 32%. Cost: 13% wider 2val intervals (τ_2 = 0.39 > pooled 0.24 gives 2val the cushion it needed). 3val stays over-conservative because the raw ensemble interval is genuinely broad; conformal can’t tighten it without losing coverage.

The recipe now has a clean choice at the calibration step:

variantwhen to use
E19 pooled τaggregate-coverage deployment
E20 per-density τdensity-stratified deployment

For most BO applications, E20 per-density τ is the cleaner default — per-cell nominal coverage at modest width cost, with the trade-off documented rather than implicit.

Writeup at materials-nlp/e20_per_density_conformal_result.md.

Nineteenth controlled axis — α-sweep confirms the recipe holds across BO confidence levels. Extended the 5-quantile head to a 9-quantile head (0.975) and swept α ∈ 0.2 — the confidence levels real BO acquisition functions actually use:

αnominal2val cov3val cov2val width3val width
0.050.950.960.955.23 eV4.36 eV
0.100.900.910.884.02 eV3.40 eV
0.200.800.770.762.96 eV2.60 eV

Empirical coverage tracks nominal within ±0.04 at every α. Both density reliability curves hug the ideal diagonal across the range. Width grows monotonically: 2.6 → 3.4 → 4.4 eV on 3val; 3.0 → 4.0 → 5.2 eV on 2val.

For Paper-2 BO deployment, the choice is concrete:

The 9-quantile head (vs the original 5-quantile) is the right deployment choice since it covers the full α range without retraining. The α=0.20 cell undershoots by 3.5pp (within per-seed std); padding α slightly at calibration time would push it back to nominal — a deployment knob.

Writeup at materials-nlp/e21_alpha_sweep_result.md.

The experiment, concretely

This is the six-week plan that would demonstrate (or kill) the wedge. Full version with risk register and decision rules lives at materials-nlp/experiment-spec.md; here is the load-bearing summary.

Datasets, in layers.

Dataset-vs-architecture risk (own the awkward case). HE-CoOOH is the same dataset the Sci Adv 2025 EquiformerV2+Post-Att Adapter paper was tuned on. Two ways this can go sideways: (i) their architecture is uniquely well-matched to the dataset (constructed for it), in which case any reasonable alternative — including our MIL pool — will underperform on bulk MAE even if it wins on per-site importance; (ii) the dataset itself is architecture-agnostic and our wedge will show up cleanly. Sequencing accordingly: reproduce their headline overpotential MAE with UMA-direct (no MIL pool) before claiming any TempoSurfViT win. If UMA-direct lands within 10% of the Sci Adv 2025 number, the dataset is fair to compare on; if it lands much worse, the gap is architecture-coupling, not a genuine signal, and we should add a second downstream (Materials Project formation energy, or solid-electrolyte Li-conductivity) before publishing.

Two models.

Two primary metrics, pre-committed.

  1. Overpotential MAE on a held-out HE composition family (novel-composition transfer test).
  2. Per-site importance recall on 100 hand-curated structures where DFT-computed per-site activity contributions exist — does the top-3 attended site overlap with the physics-identified active site? Spearman correlation between aka_k and DFT activity.

Decision rule.

OutcomeAction
Win on MAE and importanceStrong paper; write up
Match MAE, win on importanceDefensible paper; lead with interpretability
Lose MAE by ≤10%, win on importanceWorkshop / methods note
Lose MAE by >10%Rescue with bag-level MAE pretrain or pivot
Lose bothPivot to single-cell per §5

Compute / timeline. 24 h on a single A100 for one full bake-off run ($30–60 on RunPod / Lambda). Six-week schedule, one decision point per week: HE data secured → baseline reproduced → first end-to-end Ours number → importance evaluation infrastructure → ablations → writeup. The Apple-Silicon dev box can do the aggregator + head locally; UMA’s equivariant kernels currently run CPU-only on macOS (no MPS acceleration for the e3nn-style ops), so scaling past ~100 bags wants a cloud GPU.

Pre-specifying the decision rule means the answer is informative either way: a win is the Framing-C paper, a loss is honest data for the single-cell fallback (or GWAS, parked).

Beyond the bake-off: closing the synthesis loop

The Sci Adv 2025 paper doesn’t stop at “predict overpotential well.” It screens 17,500 candidate compositions, picks eight predicted-top ones, runs automated synthesis on those, and lands on TiFeNiZn- CoOOH at 263 mV/dec experimental OER overpotential. That closed loop — predict → screen → synthesize → measure — is what made it a Science Advances paper rather than a methods note.

An attention-MIL model has two structural advantages for the same loop:

  1. The quantile head gives calibrated uncertainty for free — the TempoSurfViT recipe we’re reusing already trains a 9-quantile pinball head, which drops directly into any standard acquisition function (EI, UCB, Thompson) for “which composition to synthesize next.” No additional engineering.

  2. The aka_k map tells the chemist what to vary, not just whether to synthesize. A standard surrogate says “predicted overpotential = X ± σ”; ours says the same plus “and the activity is concentrated on the Sr-substituted sites, so the next composition should perturb those.” That’s a different kind of recommendation — a hypothesis a synthesis chemist can act on without an interpretability decoder bolted on after.

The natural shape of the line of work is therefore two papers, not one. Paper 1 is the §4 bake-off: match UMA-direct on MAE, beat it on per-site importance recall. Paper 2 is the loop closure: end-to-end MIL-driven Bayesian optimization on a real (not synthetic) HE-catalyst screening budget, with experimental wet-lab validation on the top-k recommended compositions. Paper 1 is table-stakes; paper 2 is what makes the line substantively novel.

Contingency on Paper 2. Paper 2 is not a natural rollover from Paper 1 — it requires a synthesis collaborator with HE-catalyst lab capacity (precursor handling, electrochemical OER testing rig, ≥1–2 month turnaround per batch of ~8 compositions), plus the funding to actually run the synthesis. Absent that collaborator, Paper 1’s per-site importance recall figure (Spearman vs DFT-computed activity contributions on the 100-structure curated set) stands on its own as the deliverable. The closed-loop framing is the ambition; the per-site importance recall is the floor we commit to.

5. Open questions before committing

Single-cell genomics as the named fallback

If the materials port hits blockers on rotation-equivariance or DFT-to-experiment transfer, single-cell genomics is the natural pivot. An adjacent-domain audit (working file: materials-nlp/adjacent-domains.md) found scRNA-seq / scATAC shares 3.5/4 of the same structural checks: a sample is a bag of cells, each cell carries a per-cell expression vector, phenotype labels live at the sample level, and per-cell importance is the canonical scientific question. Engineering would reuse everything except the per-instance backbone.

The wedge vs. scGPT / Geneformer / scFoundation is the MIL aggregator on top of an existing single-cell foundation model — those models currently treat each cell independently then pool by averaging, which throws away the per-cell importance signal.

6. Where I’m reading next

ATGC end-to-end (the most directly portable piece of the bio literature) is still the highest-priority dive. After today’s wedge result, the Sci Adv 2025 SI (paywalled at time of writing) is the next blocking item: it determines whether the “primary output vs post-hoc” framing in §3/§4 holds or needs softening. After that, the engineering priorities pre-empt more reading — the 100-structure per-site importance evaluation set (§4 primary metric #2) is what the bake-off currently lacks, and ships before any further literature pass.

7. Sources

Methodology lineage

Materials prior art we’d be beating / engaging with

Molecular sequence models (Framing A prior art)


Share this post on: