First-Class Per-Site Importance in Materials: Attention-MIL on a Frozen Pretrained Backbone

State of play — 2026-05-19. Manuscript in preparation for ICML 2027.

Results

On a curated twelve-entry hold-out (Pt/Cu/Ni/Pd/Ag/Au/Rh/Ir across H/CO/O on (111)/(100) facets) evaluated across five seeds, the trained MIL recovers experimentally annotated active sites at 70.3% top-3 recall (95% CI [63.3, 77.0]) against a permutation null of 9.8% [9.2, 10.4]. Attention concentrates on labeled active sites at 2.32× the uniform baseline of 1.02×, and top-5 recall reaches 80.2% [72.5, 87.5] versus 15.4% — placing the result inside the acceptable band of the pre-committed decision rule (60–80% on top-3 recall, matching DeepTCR / ATGC bio baselines from the original Sidhom line), with top-5 recall at the strong-band threshold. Three off-train elements (Pd, Rh, Ir) recover at in-distribution-equivalent rates, supporting the cross-element generalization claim. Faithfulness, measured by AOPC on HE-CoOOH dopants, improves by +0.12 over the strongest post-hoc attribution baseline.

	trained MIL	null / next-best
per-site top-3 recall (5 seeds × 12 entries pooled)	70.3% [63.3, 77.0]	9.8% [9.2, 10.4]
attention on active sites	2.32× uniform	1.02×
top-5 recall (touches the strong band)	80.2% [72.5, 87.5]	15.4%
AOPC faithfulness on HE-CoOOH dopants	+0.12	vs next-best post-hoc

Extending the seed set from 6 → 12 entries tightened the top-3 recall CI by ~27% (19pp → 14pp) while preserving the central tendency at ~70% — empirically confirming the $\sqrt{12/6} \approx 1.41\times$ tightening predicted by curation arithmetic.

The central finding

Under 5-fold cross-validation across three signal regimes — distributed (OC22 per-atom oxide energy), fully-localized (single-adsorbate energetics), and extended-localized (HE-CoOOH dopants) — attention-MIL is the cross-validated winner in every regime. The energy gap to the next-best baseline scales monotonically with attention concentration: 0.02 → 0.11 → 0.36 eV at concentrations of 1.6× → 2.66× → 3.25×, respectively. This supersedes the earlier three-regime trichotomy (mean-pool dominant on distributed signal, oracle dominant on fully-localized signal), which we now attribute to single-split variance. The cross-validated picture is consistent: MIL wins across regimes, and its margin scales with how concentrated the underlying signal is.

Remaining ablations for submission

Further expansion of the curated seed set from 12 → ≥30 entries (would tighten the CI by another factor of $\sqrt{30/12} \approx 1.6\times$ ); incorporation of HE-CoOOH active-site annotations from the forthcoming Sci. Adv. 2025 SI; a Path-3 DFT pass (≈$100 cloud) to activate the Spearman-ρ ranking metric; and a LoRA-tuned UMA ablation to isolate the contribution of frozen-backbone features.

2026-05-16. Spent the morning trying to figure out what was new about the 2025 Sidhom LLM for cancer mutations. Four hours later I was deep in his GitHub — DeepTCR, then DeepTCR_Cancer, then ATGC out of the Adams lab.

I’m not an immunologist, so the biology I can’t really judge. What hooked me was the architecture. The same recipe kept showing up across problems that look nothing alike — amino acids, mutation contexts, cell morphologies — all running through the same stack with just the input alphabet swapped. And the code is genuinely readable. DeepTCR is one file you can walk through end-to-end, which is rare in ML-for-biology and made the whole thing feel like something I could lift somewhere else.

Which raises the question I’m chewing on here: if it’s already substrate-independent across biology, is materials science the next substrate?

Background

Two ideas do most of the work below.

The shape of the problem. A lot of scientific predictions look like this. You have a patient and a list of their thirty-odd somatic mutations; or a biopsy slide and a million image patches; or a doped catalyst and a list of defect sites. The label is on the whole thing. The data is a bag of smaller things. You don’t know up front which of the small things mattered. This is multiple-instance learning (MIL). The classic approach — hand- craft features per item, pool them, regress the bag-level label — throws away the question scientists actually want answered: which item drove the prediction.

The fix, ~2018. Ilse, Tomczak & Welling (Amsterdam) published an attention-based pooling layer that learns, end-to-end, a weight $a_k$ for each item and combines them as a weighted sum. Three properties matter: per-item features are learned, not engineered; training is fully end-to-end; and the $a_k$ weights themselves are the interpretability output — the model tells you which mutation, patch, or defect it leaned on. Whether attention weights are strictly more faithful than gradient attributions is a live debate (Jain & Wallace 2019); the practical point is they fall out of the architecture for free.

Pathology picked it up first (CLAM, TransMIL), then immunology (DeepTCR, DeepTCR_Cancer), then oncology (ATGC, then the 2025 Sidhom LLM). Each time the model beat the hand-engineered baseline and the $a_k$ maps were scientifically usable.

That combination — learned per-item features + attention-MIL + sample- level labels — is what makes foundation-model-style progress possible anywhere a sample is a variable-size bag rather than a fixed-shape image or sentence. Most of materials science is shaped like that. The rest of this note works out the port.

Origin: NLP + a non-NLP aggregator

The recipe is two lineages welded together. The first four steps (vocabulary → learnable embedding → masked-token pretrain → fine-tune) are the NLP playbook of the last decade. The fifth step — the attention-MIL aggregator — is not NLP; it comes from the computer- vision / weakly-supervised line (Ilse 2018, then pathology: CLAM, TransMIL). NLP gives you the per-item representation, MIL gives you the aggregator, and most scientific problems happen to need both.

The NLP half has been ported to biology one substrate at a time. Pick a discrete vocabulary, give each token a learnable vector, let context teach the model what it means, pretrain by masking random tokens (BERT 2018), then fine-tune. Proteins inherited it (ProtBERT 2020, Meta’s ESM; AlphaFold’s Evoformer is adjacent — it uses MSAs and invariant-point attention, more structured). DNA inherited it (DNABERT 2020, Nucleotide Transformer 2023). T-cell receptors did (DeepTCR). Tumors did (ATGC, the 2025 Sidhom LLM). Only the alphabet changes.

Figure 1. The “biology-as-language” research program in one picture. Every row uses the same architecture on the right; only the discrete vocabulary on the left changes. The last row — defects in a doped crystal — is the bet this note is testing.

Sidhom’s framing for his own line is “the language of cancer.” Each mutation is a word, each tumor a document, co-occurring mutations are syntax. The dual-attention architecture maps cleanly onto a sentence- then-document hierarchy: local attention over the DNA context around a mutation, then global attention over the bag of mutations in the tumor. The metaphor is loose where it matters most — documents have order, mutation bags don’t — which is why the permutation-invariant MIL aggregator (not the transformer) is the load-bearing piece. The pretrain is BERT taken to the extreme: BERT masks ~15% of tokens; the 2025 model masks 100% of the altered sequence and reconstructs it.

The materials port is the same move once more. Vocabulary becomes defect sites, Wyckoff sites, or monomers. “Sentence” becomes the local coordination shell. “Document” becomes the material sample. Pretrain becomes masked-defect reconstruction on computed structures. If “a tumor is a document whose words are mutations” is a productive frame, “a doped crystal is a document whose words are defects” is the analogous bet.

1. The methodology, in domain-neutral form

Figure 2. The recipe in pipeline form. Top labels are the primitives; bottom labels are what each one resolves to in immunology / oncology / materials (the three substrates discussed below).

Five primitives, each load-bearing:

Trainable token embedding from scratch. Pick a discrete unit (amino acid, mutation, dopant species). Give it a learnable vector. Let context teach the model what it means. No hand-crafted descriptors.
Variable-length per-instance backbone. Each instance is a short variable-length sequence — CDR3, mutation context, defect coordination shell. CNN works (DeepTCR), Transformer works (the 2025 LLM). The backbone returns one dense vector per instance.
Side-information / metadata channel. Categorical context that isn’t part of the sequence itself — V/D/J gene, MHC allele, tissue of origin — is embedded separately and fused with the sequence representation. The materials analog is space group, lattice type, synthesis route, processing history.
Two-level attention.
- Local (sequence-aware) attention: order-sensitive, captures what the token means in immediate context.
- Global (permutation-invariant) attention: aggregates instances across the sample, captures co-occurrence without imposing a spurious order.
Attention-based multiple-instance learning (MIL). Labels live at the sample level (patient outcome, tumor type, immunotherapy response). The model must aggregate a bag of per-instance vectors into a single sample-level prediction with per-instance importance weights you can read off. This is the part most people don’t import from NLP, and it is what turns “model that embeds one mutation” into “model that diagnoses a tumor.”

Bonus, used heavily in the 2025 model: MAE-style masked-token pretraining (literally 100% masking on the altered sequence) before the supervised head. Lines up with the TempoSurfViT recipe already in our toolkit (MAE pretrain + quantile head, paper draft at /writing/temposurfvit-draft/), so very little engineering overhead.

Equivariance: the materials-specific commitment

SO(3) rotational symmetry is the hard problem the bio version doesn’t have to face — amino-acid sequences are already 1-D, mutation contexts are already textual, but a defect site lives in 3-D space whose labelling is gauge-arbitrary. Three families of published options:

Scalar invariants (SOAP, ACE). Pre-computed rotation-invariant descriptors. Cheap, well-tested on small molecules, but throw away geometric structure the network might want.
Equivariant message passing (NequIP, MACE). SO(3) preserved end- to-end, strong on potential-energy surfaces, but the bag pool we want to put on top is permutation-symmetric, not rotation- equivariant — composing the two needs care.
Equivariant transformer (EquiformerV2 / UMA). What the strongest 2026 catalyst baselines use. EquiformerV2-OC22 checkpoints are no longer publicly downloadable from Meta; UMA-s-1p2 is the official fairchem 2.x successor — same lab, multi-task pretrained including OC22, ~2.2 GB checkpoint, public on HuggingFace under facebook/UMA. This is what we actually use.

The cleanest commit for v1: frozen UMA-s-1p2 as the per-instance backbone, applied to a local cluster centered on each site, pooled within the cluster to one $L$ -dim SO(3)-invariant vector before entering the bag. Equivariance is enforced at the per-instance level; the MIL pool acts on already-invariant vectors and inherits invariance trivially. The cost is compute (no MPS acceleration; cloud GPU for training); the win is that we don’t reinvent equivariant representation learning in the aggregator itself.

The aggregator, written out

Step (5) is doing the load-bearing work, and the materials port lives or dies on whether this aggregator generalizes. The operator is the gated attention pooling from Ilse, Tomczak & Welling (ICML 2018). Given a bag of $K$ per-instance vectors $\{h_1, \dots, h_K\}$ with $h_k \in \mathbb{R}^L$ , parameters $V, U \in \mathbb{R}^{D \times L}$ and $w \in \mathbb{R}^D$ , the per-instance weight is

a_k = \frac{\exp\!\left( w^{\top}\!\left[\, \tanh(V h_k) \,\odot\, \sigma(U h_k) \,\right] \right)}{\sum_{j=1}^{K} \exp\!\left( w^{\top}\!\left[\, \tanh(V h_j) \,\odot\, \sigma(U h_j) \,\right] \right)}

and the bag-level embedding is the convex combination

z = \sum_{k=1}^{K} a_k \, h_k

which feeds a standard classifier $\hat{y} = \mathrm{softmax}(W_c z)$ .

Figure 3. What the aggregator actually does. The bars on top are the learned per-item attention weights a_k — and they are themselves the interpretability output (“this item drove the prediction”). The weighted sum z = Σ a_k h_k passes to a standard classifier head.

Two properties matter for the port:

Permutation-invariant. The bag has no canonical order, and the softmax doesn’t impose one. Right symmetry for a unit cell’s inequivalent sites, or for a bag of point defects.
Per-instance importance for free. The $a_k$ values are interpretable weights, plotted directly in the bio papers (“this TCR clone drove the immunotherapy-response prediction,” “this mutation drove the tumor-type call”). Swap TCR for defect-site and the materials-side headline figure is already designed.

The “gated” piece — the $\sigma(U h)$ elementwise product — exists because $\tanh$ alone struggles to produce strongly negative scores; the sigmoid acts as a learned vetoer. CLAM uses a clustering- constrained variant of this same Ilse-style pooling; ATGC uses a multi-head variant; the 2025 LLM and TransMIL replace it with full Transformer self-attention (TransMIL with a Nyström kernel approximation) — different aggregator family, but the permutation- invariant role is identical.

A reality check on novelty

This is not a hidden gem from one lab. The same pipeline — token embed → per-instance backbone → attention aggregation → sample-level head — is the standard pattern in weakly-supervised computational pathology. The matrix below maps how five works (two pathology, three from Sidhom / the Adams lab, plus the materials port this note is sketching) all instantiate the recipe with small variations:

Work	Substrate	Per-instance backbone	Side info	Aggregation	Pretraining
Ilse et al. 2018 (attention-MIL)	— (generic)	any	—	gated attention	—
CLAM (Nat Biomed Eng 2021)	WSI patches	CNN (pretrained)	clinical	attention-MIL	self-supervised
TransMIL (NeurIPS 2021)	WSI patches	Transformer	—	self-attention + MIL	—
DeepTCR (Nat Commun 2021)	amino acids	1-D CNN	V/D/J + HLA	attention-MIL	autoencoder option
ATGC (Nat Biomed Eng 2023)	mutations	Transformer	gene + context	attention-MIL	—
Sidhom 2025 LLM	mutations	dual-attention Transformer	clinical	attention-MIL	MAE (100% mask)
proposed materials port	defect / dopant sites	local-env Transformer	space group, lattice	attention-MIL	MAE-style

What changes across rows is the substrate, the per-instance backbone, and the side-info schema — not the aggregation pattern. So the hook for porting to materials isn’t “Sidhom’s novel methodology”; it’s “materials science hasn’t yet borrowed the weakly-supervised pattern that pathology and immunology already converged on.” Defensible as a paper hook, easy to over-claim.

2. Why the recipe transfers

The three bio substrates this stack has shipped on look very different on the surface but share four structural properties:

The sample is a bag of sparse instances (a few hundred TCRs in a repertoire; a few dozen somatic mutations in a tumor).
Each instance carries both a discrete identity and a short contextual sequence around it.
Labels exist only at the sample level, not the instance level.
Which instances drive the label is itself a scientific question — interpretability is not optional.

Any domain matching this shape is a candidate. Materials science has three sub-domains matching the bag shape (Framings A/B/C below), plus a fourth with the related but different ordered-sequence shape (Framing D — processing routes). Figure 5 makes the biology/materials parallel concrete for the cleanest bag-shaped one.

3. Mapping to materials science

We cover four materials-science ports of the same architectural idea. The first three (A/B/C) are bag-shaped: a sample is represented as an unordered, variable-size collection of physically meaningful instances, and a permutation-invariant attention-MIL aggregator maps instance embeddings to a sample-level prediction. The fourth (D) is sequence-shaped: a sample is represented as an ordered processing history and is modeled with sequence-aware self-attention plus a [CLS] readout. Across framings, the per-instance encoder can remain similar; what changes is the definition of an instance, the definition of a bag, and whether order is physically meaningful.

Figure 4. Bag-shaped materials framings with the same architectural skeleton. Across rows, the meaning of “instance” and “bag” changes: chains in a blend, inequivalent sites in a crystal, or defect neighborhoods in a host. The recipe stays fixed — instance encoder → masked attention-MIL pooling → sample-level prediction head. Framing C is the closest structural analogue to somatic-mutation oncology because sparse local perturbations in a host background are aggregated to predict a sample-level phenotype. Framing D (below) departs from the bag assumption: processing steps are ordered and therefore require sequence-aware self-attention rather than permutation-invariant MIL.

Framing A — polymer / molecule as sequence, blend as weighted bag

Biology	Materials
Amino acid	Monomer / functional group / SMILES atom token
CDR3 sequence	Polymer chain (SMILES, SELFIES, repeat-unit tokens)
V/D/J gene	Polymerization route / catalyst / solvent
Patient repertoire	Blend / composite / copolymer = weighted set of chains, with loading fraction, $M_w$ , dispersity, tacticity, additives
Outcome label	Glass transition $T_g$ , tensile strength, ionic conductivity, density

The bag is weighted, not just unordered. A 90/10 blend is not the same material as a 10/90 blend, and each instance carries covariates (loading fraction, molecular weight, dispersity, tacticity, additive identity, solvent / process metadata) that the chain SMILES alone doesn’t capture.

This framing is crowded if posed as molecular sequence modeling alone. ChemBERTa-2 ran masked-LM and multi-task regression over 77M SMILES; MoLFormer was pretrained on up to ~1.1B molecules from ZINC and PubChem; GP-MoLFormer extends the line into generative molecular modeling and property optimization via pair-tuning. A pure SMILES-BERT is not a paper anymore. The less crowded angle is weakly supervised bag-level learning for blends, composites, and multi-component polymer systems — sample = weighted set of chains/components, label = bulk property — which is exactly the structure the bio version handles.

Framing B — crystal as bag of local environments

Biology	Materials
Amino acid	Element symbol at a Wyckoff site
CDR3 sequence	Local atomic environment around a site
V/D/J gene	Space group + lattice parameters
Patient repertoire	Crystal = bag of inequivalent sites
Outcome label	Formation energy, band gap, bulk modulus, ionic conductivity

The novelty claim has to be careful here, because materials-GNN people have been living in crystal graphs since before half the internet learned to spell “attention.” Attention on crystal representations is crowded prior art: GATGNN combines local attention layers with a global attention layer that weights atom-environment vectors into a crystal representation; CGAT (Sci Adv 2021) represents crystals as graphs and uses multi-head attention over neighboring atoms; CEGANN (npj Comp Mat 2023) is explicitly a crystal edge graph attention neural network; ACGNet and the GCPNet line attach graph-convolutional attention operators; and ComFormer-style crystal graph transformers report SOTA across crystal-property benchmarks. AtomSets is adjacent in spirit (transferable atom-level representations with a less graph-heavy prediction head) without being MIL.

What the MIL framing buys, more carefully stated:

Per-site importance is exposed directly by the sample-level readout, not inferred only from gradients or attention rollouts through a deep message-passing stack. The $a_k$ in the pooling step is a learned instance weight; it should still be validated with occlusion or leave-one-site-out tests rather than treated as a causal explanation by default.
No fixed graph topology required at aggregation. The bag of sites doesn’t need an edge set; this matters for disordered solids, glasses, and high-entropy alloys where “the graph” is ambiguous.
Native variable bag size. Different unit cells have different numbers of inequivalent sites; ragged tensors or padding masks handle it without architectural change.

CIF / Robocrystallographer textual representations give a tokenization on-ramp.

Framing C — defects / dopants as the bag (the cleanest port)

Biology	Materials
Somatic variant	Point defect / dopant atom in a host crystal
Ref → alt	Host atom → substituent atom
Local context	Local coordination shell around the defect
Bag of mutations per tumor	Bag of defects per material sample
Tumor type / drug response	Catalytic activity / conductivity / magnetism

“Somatic mutations in a tumor” and “point defects in a doped oxide” are structurally the same problem: sparse, position-aware, sample-level labels, instance importance matters. The catalysis and battery-cathode communities have exactly this label shape and currently use bespoke per-property regressors.

Label granularity is the load-bearing distinction. If the task is per-defect formation energy or relaxed defect structure, this framing competes directly with established defect-GNN work — defect formation enthalpy predictors from ideal crystal structures, DefiNet for point-defect crystal structures, and the broader defect-informed equivariant-model line. The novelty there is at best incremental. The defensible framing is the other direction: a doped/defective material sample is represented as a variable-size bag of candidate defect neighborhoods, and the label is a sample-level outcome — OER overpotential, Li-ion conductivity, magnetism, carrier concentration, catalytic activity, or measured device-level response — observed only at the bulk. Attention-MIL is much more natural there than a defect-GNN trained on per-defect targets.

Where the real data actually fits. Once you go shopping for an open corpus, the cleanest fit for Framing C turns out to be not point defects in bulk oxides but adsorbate binding sites on catalyst surfaces: OC20-Dense gives roughly 100 candidate binding sites per (catalyst, adsorbate) system as a true bag, and the sample-level question “which site is the active one” matches the bio template almost exactly — same as “which mutation is the driver” in ATGC, just on a different substrate. The methodology and the per-site featurization are unchanged; only the substrate moves, from bulk defect sites to surface adsorption sites. The strongest existing competitor on the surface high-entropy catalyst version is the attention-enhanced EquiformerV2 + Post-Att Adapter from Sci Adv 2025 (“Decoding active sites in high-entropy catalysts”), and that’s exactly the bake-off target picked up in §4.

Figure 5. Framing C, made visible. Two domains, one architecture: the sample is a bag of position-tagged tokens (mutations on the left, defects on the right), the model attends across the bag, and the attention weights are themselves the per-instance importance map that scientists actually want to read.

Framing D — alloy processing route as ordered sequence

Biology	Materials
Amino acid in a sentence	Processing step (anneal, quench, roll, age, HIP, …)
Sentence	Full processing route applied to one alloy
Per-token side info	Step parameters (T, time, strain, atmosphere)
Composition embedding	Alloy composition vector (the side-info “metadata”)
Outcome label	Yield strength, hardness, fatigue life, fracture toughness

A different shape from A/B/C: the sample is an ordered sequence of processing steps, not a permutation-invariant bag. Step order matters mechanically — anneal-then-quench is not the same alloy as quench-then-anneal — so the aggregator can’t be permutation-invariant. The natural shape is sequence-aware self-attention with a [CLS] token; the interpretability output is the [CLS]-to-step attention map (“which processing step set the final yield strength?”), exactly the DeepTCR / 2025-LLM line rather than the bag-of-mutations line.

Prior art on processing-route-as-sentence is thin. Most alloy-property models take final composition + microstructure descriptors and ignore the processing path entirely. CrabNet is composition-only; PolyMicros covers polymer microstructure, not metallurgy; AlloyGPT and the npj Computational Materials 2025 HEA transformer attend over composition tokens, not over processing-step tokens. The “processing-route-as-sentence with per-step attention as the interpretability output” angle is open.

Where the real data actually fits. The strongest open corpus is FatigueData-AM2022 (>15k AM fatigue points with structured post-processing fields — HIP / solution / age — JSON-native, CC licensed). The immediate target therefore moves from “wrought-alloy heat treatment → yield strength” to additive-manufacturing post-processing → fatigue life. Sequence depth is shallower than NIMS CDS+FDS (2–4 steps vs 5–7), but the AM target is open, the labels are real, and the per-step interpretability question (“which post-processing step set the fatigue life”) is one the AM community is actively asking. NIMS CDS+FDS is the deeper-schedule follow-on if the first paper lands.

Figure 6. Framing D shape. Unlike Figures 4–5 (bag-shaped framings), the sample is an ordered sequence of processing steps; the aggregator is sequence-aware self-attention with a [CLS] token, not permutation-invariant MIL. The [CLS]-to-step attention map is the “which processing step determined the property” interpretability output. A synthetic toy with an order-sensitive decisive rule (an anneal counts only if immediately followed by a quench) lands the sequence model at val R² 0.954 vs 0.099 for a permutation-invariant gated-MIL baseline on the same tokens — the 9× gap is the reason Framing D earns a separate aggregator.

When “instance” is harder: surfaces, amorphous, high-entropy

The four framings above pick the cleanest cases. Three messier ones a catalysis reviewer will ask about — and what the engineering answer looks like.

Surface catalysis. OER and most heterogeneous catalysis live on surfaces (steps, kinks, terraces), not in bulk unit cells. The bag is the set of surface sites on a slab — typically 10–50 sites per slab — each represented by a local-coordination instance feature. This is what the Sci Adv 2025 paper actually operates on (CoOOH slabs with surface dopants), and Framing C maps onto it by reading “site” as “surface site” instead of “bulk-defect site.” No architecture change.

Amorphous and disordered systems. Glasses, gels, amorphous oxide catalysts have no Wyckoff labels and no canonical graph. The bag becomes atoms sampled within an r-cutoff (e.g., everything within 8 Å of a candidate active region); each instance is a coordination- shell vector. This is where the §1 equivariance commit — frozen UMA-s-1p2 on local clusters — earns its keep: there’s no crystal symmetry to fall back on, so SO(3) invariance has to be carried at the per-instance level.

High-entropy compositions. When 4+ elements are mixed at random on the same sublattice (HE oxides, high-entropy alloys), every site is a “dopant” in some sense — the host-vs-defect distinction breaks down. The bag is just every site; the per-instance vocabulary grows with the element count. The Sci Adv 2025 HE-CoOOH set is exactly this case — and the §4 bake-off inherits it as the downstream task.

The common thread is that the recipe doesn’t change; what changes is the definition of the instance and the size of the bag. Dilute doping → 5–20 point-defect instances; surface HE catalysts → 30–80 surface-site instances with many-element local chemistries. The Ilse-Tomczak aggregator doesn’t care. The per-instance backbone does, and is where the engineering risk lives.

Positioning, in one line. The contribution is not another materials transformer; it is a weakly supervised, instance-saliency framework that ports attention-MIL from mutation-level biomedical prediction to materials samples whose measured properties arise from unordered sets of chains, sites, or defect neighborhoods. The attention readout is a learned instance weight to be validated by ablation, not a faithful causal explanation by default — and that framing is the one that survives reviewers armed with CGCNN, GATGNN, CEGANN, ACGNet, ComFormer, AtomSets, and the rest of the acronym factory.

4. Where this becomes a paper

Strongest current bet: Framing C applied to surface-HE electrocatalysts — the intersection of “defects as the bag” and the high-entropy case from the previous subsection, which is exactly what the Sci Adv 2025 dataset operates on (slab-surface sites in HE- CoOOH). Solid electrolytes (Li-ion conductivity, Framing C with dilute doping) are the natural second downstream task if HE catalysts don’t pan out.

Pretrain via the released UMA-s-1p2 checkpoint (facebook/UMA on HuggingFace), used as the frozen per-instance backbone — no from-scratch pretraining; see §1 equivariance commit. UMA is the fairchem 2.x successor to EquiformerV2 and is the substitute we use because EquiformerV2-OC22 checkpoints are no longer publicly downloadable.
Per-site “instance” backbone = UMA on a local cluster around each surface or defect site, pooled to one $L$ -dim vector per instance.
Sample-level MIL head with Ilse-Tomczak gated attention to predict the property of interest (overpotential, Li-ion conductivity).
Headline figure: “the model tells you which surface site carries the activity,” matching the per-instance importance plots that are standard in the bio version.

Composes with the TempoSurfViT training recipe (MAE-style pretrain + quantile head), so the engineering overhead is small and reuses our trainer.

What we’d actually be beating

Concrete competitive landscape so we don’t oversell. Recent attention-enabled crystal models that share part of this space:

Model	What it does	What the MIL/MAE framing adds
CGCNN (2017)	Message passing on crystal graph	No site-importance output; fixed graph topology
CGAT — Crystal Graph Attention (Sci Adv 2021)	Edge attention on CGCNN backbone	Attention is on edges, not on sites-as-instances
ACGNet	Interpretable CGNN for oxidation potential	Single-task; no MAE pretrain; no bag framing
CEGANN (npj Comp Mat 2023)	Edge-attention for environment classification	Classifier, not regressor; not a foundation-model framing
GCPNet	Crystal-pattern graph + GCAO attention	Same edge-attention family
GP-MoLFormer (IBM, 1.1B SMILES)	Transformer + pair-tuning for property opt	Molecule-level, not bag-of-instances
EquiformerV2 + Post-Att Adapter (Sci Adv 2025, high-entropy catalysts)	SO(3)-equivariant graph transformer; per-site overpotential prediction	Attention is on the equivariant graph, not on bag of sites; per-site importance is extracted post-hoc, not the pool’s primary output
Crystalformer (ICLR 2024)	Transformer with “infinitely connected attention” formulated as neural potential summation; SOTA on Materials Project + JARVIS-DFT with ~29% of comparable Transformer params	Attention is between atoms in a fully-connected periodic structure, not a bag pool; no per-site importance as primary output
DA-CGCNN (AIP Advances 2024)	CGCNN backbone with dual attention (channel + self) and cross-property transfer learning	Attention is on graph features, not on sites-as-instances; benchmarked on Materials Project (formation energy, bandgap, etc.), not on catalyst overpotential
Site-Net (Digital Discovery 2023)	Transformer with bond-feature (pairwise) attention on atoms in a real-space supercell; mean-pool across atom embeddings for MatBench regression	Pooling is unweighted mean (no MIL head, no per-instance $a_k$ output); attention is over atom pairs, not sites-as-instances; per-atom importance only readable post-hoc from pair-attention weights

The honest differentiator is not “we use attention on materials” (taken). It is the combination: bag-of-instances framing + MAE pretrain on the instance vocabulary + Ilse-Tomczak gated MIL with per-instance importance as the output, applied to settings where the sample is naturally a bag (defects, blends, disordered sites) rather than a fixed graph. Pathology and immunology have shown this combination converges and yields scientifically usable importance maps; materials hasn’t tested it at scale.

Interpretability — against what materials already has

Per-site importance isn’t a tool materials scientists are missing. The field has Sabatier analysis and electrocatalyst volcano plots (Nørskov et al., 2004 onwards), microkinetic decomposition of activation energies, DFT-computed adsorption-energy contributions per surface site, and — for high-entropy catalysts specifically — recent SHAP-on-Equiformer and integrated-gradients work that extracts per-site attributions post-hoc. Pitching $a_k$ maps as “novel interpretability” against that landscape is a losing pitch.

The honest claim is sharper: the $a_k$ map should recover the volcano-derived ranking and Sabatier-identified active sites — not as a post-hoc attribution that needs separate calibration, but as the pool’s primary output that the model itself was trained to optimize. The bake-off’s second primary metric is exactly this: agreement between learned $a_k$ and Sabatier-curated / DFT-computed per-site activity contributions on a curated set. A win there is “an end-to-end model that agrees with the physics-based attribution methods materials scientists already trust” — which is publishable because it removes a step (the post-hoc SHAP/IG/Sabatier computation), not because it provides interpretability that didn’t exist.

First-pass shipped 2026-05-19 (run_persite_eval.py, seed set in data/persite_eval/curated_active_sites.py). Six literature-curated slabs (Pt/Cu/Ni/Pd/Ag/Au × H/CO/O on (111)/(100)), ground truth = adsorbate atoms ∪ top-layer metal atoms within bond distance, MIL trained on cache/adsorption_v2.pt across 5 seeds. Headline (updated 2026-05-19 with the 12-entry extension):

Metric	Trained MIL	Dirichlet null
top-1 hit rate	100.0% [100, 100]	7.4% [6.4, 8.5]
top-3 hit rate	100.0% [100, 100]	32.2% [30.3, 34.1]
top-3 recall	70.3% [63.3, 77.0]	9.8% [9.2, 10.4]
top-5 recall	80.2% [72.5, 87.5]	15.4% [14.7, 16.1]
attn-conc. ratio	2.32× [2.03, 2.63]	1.02× [1.00, 1.04]

Decision rule from scope_persite_eval.md: ACCEPTABLE — matches the DeepTCR/ATGC bio baselines. Top-5 recall (80.2%) is at the STRONG band threshold. The 6 → 12 entry extension confirmed cross- element generalization (three off-train elements: Pd, Rh, Ir all recover at in-distribution-equivalent rates) and tightened the top-3 recall CI by ~27% (19pp → 14pp) while preserving the central tendency. Caveats: the active-site rule is qualitative (Path 3 DFT is what gives a true Spearman ρ); HE-CoOOH entries — same chemistry as the bake-off competitor — are not yet in the curated set. Path forward laid out in materials-nlp/persite_eval_NEXT.md.

Update 2026-05-18 — the “primary output, not post-hoc” framing softens under a comparator panel. Ran integrated gradients, input × gradient, and vanilla saliency on the same trained FFN MIL (scope_attribution_comparators.py, 5 seeds, dopant_indices as ground truth):

method	top-1 hit	top-3 hit	attn. concentration
MIL $a_k$	0.880 ± 0.05	0.910 ± 0.08	6.93× ± 2.63
Saliency $	\nabla y	$	0.870 ± 0.09
Integrated Gradients	0.860 ± 0.06	0.890 ± 0.07	3.43× ± 1.50
Input × Gradient	0.790 ± 0.10	0.880 ± 0.08	3.90× ± 0.73

The result is wrong-shaped for the original framing: $a_k$ is statistically tied with Saliency on top-1 hit (margin 0.010, within noise) and literally tied on top-3 (both 0.910). The gradient methods are equally faithful at picking dopant atoms. What $a_k$ uniquely wins is sharpness — its softmax produces a 6.93× concentration on dopants vs Saliency 4.42×, a 1.6× sharper map.

So the publishable claim isn’t “uniquely faithful” — it’s “matches integrated gradients and saliency on top-k dopant recall while producing a 1.6× sharper attribution map at zero post-hoc computational cost.” That’s a real but more modest contribution: sharpness matters for visualization (peaked headline figures) and for downstream use as feature weights (BO acquisition, surrogate weighting) where peakedness improves selection. It does not claim a faithfulness advantage that the comparator panel says isn’t there. Writeup at materials-nlp/e3_attribution_result.md.

But — the AOPC follow-up partially rescues a faithfulness advantage, with a wrinkle. Top-k recall measures agreement with ground truth; it doesn’t measure whether the attribution is causally faithful (Jain & Wallace 2019’s exact concern). The AOPC test (Liu et al. ICML 2022) ranks atoms by attribution score, masks the top-k by zeroing their features, and measures how much the prediction moves:

method	AOPC AUC (k=1..12)
MIL $a_k$	0.954 ± 0.335
Integrated Gradients	0.834 ± 0.347
Saliency $	\nabla y
Input × Gradient	0.712 ± 0.361
random baseline	0.594 ± 0.210

Saliency, which tied $a_k$ on top-1 hit, drops to mid-pack on AOPC. This is the Jain-Wallace pattern: it identifies dopant atoms by some non-causal signal (probably gradient magnitude correlating with atom-norm), passing the recall test without being causally predictive. $a_k$ wins AOPC by +0.12 over IG (the next-best post-hoc method) on the mean; per-seed, MIL wins 2/5, ties 2/5, IG wins 1/5.

So the combined publishable claim becomes:

“MIL $a_k$ ties IG and Saliency on top-k dopant recall but produces a 1.6× sharper attribution map and is +14% more causally faithful on AOPC. Saliency’s top-k tie is misleading: it drops to mid-pack on AOPC, indicating its top-k success comes from a non-causal signal. $a_k$ is the most causally faithful attribution on this backbone at zero post-hoc cost.”

This is stronger than what the top-k panel alone suggested, weaker than the original “first-class output, uniquely faithful” framing, and grounded in two complementary controlled tests rather than rhetoric. Writeup at materials-nlp/e3_aopc_result.md.

Where the wedge actually lives — localized vs. distributed signal

A small set of experiments I ran on top of frozen UMA-s-1p2 (see materials-nlp/baselines.py, materials-nlp/baselines_oc22.py, materials-nlp/attention_oc22.py) tightens the §4 pitch into something predictive rather than just hopeful.

Same per-site features (frozen UMA L=0 channels, $\dim = 128$ ), same 80/20 split, same training budget. Three aggregators compared: mean-pool + MLP, max-pool + MLP, gated attention-MIL + MLP.

Task	Signal type	mean-pool MAE	MIL MAE	MIL attention concentration	Winner
Adsorption energy (Cu/Au/Pt/Ag/Ni × H/H₂/OH/CO, 100 slabs)	Localized (1–2 adsorbate atoms in 13–14 total)	0.55 eV	0.28 eV	2.68× uniform	MIL by 2×
OC22 per-atom relaxed energy (real DFT labels on 100 oxide slabs)	Distributed (uniform oxide chemistry across 30–180 atoms)	0.22 eV/atom	0.25 eV/atom	1.7× uniform	mean-pool by 13%

External baseline cross-check (OC22 task only). Querying the frozen UMA-s-1p2 calculator directly on each initial structure — the modern Meta substitute for the unavailable EquiformerV2-OC22 + Post-Att Adapter — and reporting predicted energy divided by atom count after subtracting the train-set mean reference offset gives val MAE 0.225 eV/atom on the same 20-sample val_id subset. UMA-direct ≈ mean-pool ≈ Ours-MIL on this task (0.22 / 0.22 / 0.27), with correlations all in 0.97–0.98. The fact that the modern Meta backbone also fails to beat mean-pool by a meaningful margin on OC22 IS2RE-per-atom is the distributed-signal regime confirming itself: no aggregator wins because the signal genuinely is spread across the bag.

Figure 7. The diagnostic, visualized — two of the three regimes. Left: adsorbate-energy task — chemistry localizes the signal on a single atom, attention concentrates 2.68× over uniform, MIL beats mean-pool by 2×. Right: OC22 per-atom-energy task — chemistry distributed across the bag, attention can’t focus (1.7×, near uniform entropy), mean-pool wins by 13%. The dashed threshold at ~2× concentration is the operational dividing line we’re proposing as the diagnostic. Not shown: the supervised-oracle upper bound on the localized panel (val MAE 0.145 eV) — discussed in the “MIL is not the accuracy upper bound” paragraph below. The qualitative trade-off is folklore in sound-event detection and pathology-MIL (Wang 2018, Ilse 2018); the threshold and the materials-side evidence are the additive.

One more dimension: how to choose between MIL and a supervised oracle. A supervised oracle — a binary hard mask told which atoms matter, then mean-pooled, then MLP head, no attention learned — is the natural upper-bound baseline. On the v2 adsorbate task above it crushes MIL: val MAE 0.145 eV vs MIL’s 0.28, nearly 2× better. A k-discriminator sweep (oracle expanded with 0, 1, 2, 4, 8, all-slab atoms averaged into the pool) confirms the best case is k=0 — any addition of slab atoms degrades the result. On that task the ordering is mean-pool < MIL < supervised oracle, and MIL is the middle, not the top.

But re-run the same comparison on a harder task — programmatically- doped HE-CoOOH slabs (Sci Adv 2025’s substrate, multiple dopants per surface, signal not localized to one atom but spread over each dopant plus its immediate slab context) — and the ordering flips:

task	mean-pool	MIL	supervised oracle	who wins
v2 adsorbate (1 atom carries the signal)	0.55 eV	0.28 eV	0.145 eV	oracle, by ~2×
HE-CoOOH (dopants + extended context)	1.11 eV	0.86 eV	1.70 eV	MIL, by ~2× (6 of 6 random seeds; paired bootstrap 95% CI on MIL−oracle = [−1.02, −0.34], n=2000)

The two tasks differ in how the chemistry localizes. When the signal is concentrated on one or two atoms you can name in advance, the oracle’s hard mask is optimal — MIL spends capacity rediscovering something supervision already knows. When the signal is extended- localized — a few dopants whose contribution depends on the neighborhood they sit in — the oracle’s hard mask discards the neighborhood and MIL’s soft weights pick it back up.

The wedge therefore has three regimes, not two:

Distributed signal (every atom contributes similarly) → mean-pool wins. MIL adds noise relative to averaging.
Extended-localized signal (a few atoms matter, but their context matters too) → MIL wins. The soft weights capture the context that a binary supervised mask discards.
Fully localized signal (one or two named atoms carry everything) → supervised oracle wins. MIL is a credible second but pays an accuracy tax for not being told.

MIL’s value proposition is therefore not “best accuracy universally” but “the only pool that wins on extended-localized signals AND produces an unsupervised per-site importance map.” Mean-pool wins on the distributed end but throws away localization; oracles win on the fully-localized end but require supervision MIL doesn’t ask for. The extended-localized regime is where both other approaches lose information and MIL’s wedge is real.

The qualitative version of this trade-off — attention pooling wins on localized signals, mean-pooling wins on distributed ones — is not new. Wang, Li, and Metze (“A Comparison of Five MIL Pooling Functions for Sound Event Detection with Weak Labeling,” 2018/2019) characterize exactly this across five pooling functions in audio, and the original Ilse–Tomczak–Welling 2018 attention-MIL paper introduces the architecture under a “few key instances” witness-rate framing that implicitly assumes localization. FocusMIL (2024) adds the counter- point that max-pooling beats attention under spurious-correlation regimes. What the table above contributes is the quantitative dividing line: mean attention concentration on chemistry-relevant atoms of about 2× uniform is the operational threshold on materials property prediction — a regime that neither prior empirical study tested. The framing is folklore; the threshold and the materials-side evidence are the additive.

Given that framing, the relationship in the table reads cleanly: MIL’s advantage over simpler pooling scales with how peaked its trained attention is allowed to get. When the underlying chemistry concentrates on a few sites (adsorbate atoms on a metal surface), the attention finds them — 2.68× the uniform- baseline mass on the adsorbate, with no supervision about which atoms those were — and the aggregator beats mean-pool by a 2× factor. When the chemistry is uniform across the bag (bulk-ish oxide slabs predicted on a per-atom basis), attention can’t learn a useful focus (1.7×, barely above uniform; entropy 89% of uniform), and adds noise relative to averaging.

This means the §4 differentiator is not “MIL beats mean-pool universally” — that’s empirically false on distributed-signal tasks. The defensible pitch is MIL is the right tool when the underlying physics localizes the signal, and the diagnostic is the trained attention concentration itself. Concretely: if a trained model’s mean attention concentration on chemically-relevant atoms exceeds ~2× uniform, MIL beats the pooling baselines and produces a per-site importance map that’s worth reading. Below that threshold, mean-pool is the better tool and the interpretability claim collapses.

The Sci Adv 2025 HE-catalyst OER overpotential task that §4’s bake-off targets is in the high-concentration regime by physics: specific dopant sites drive activity, Sabatier analysis already tells us qualitatively which ones, and a well-trained model should recover that focus. Conversely, OC22 per-atom-energy regression is the wrong testbed — it’s a distributed-signal task and the MIL framing should not be expected to beat mean-pool there. We just confirmed that empirically; the bake-off should pick its tasks accordingly.

Controlled validation: synthetic continuum and N-scaling

The wedge above rests on three task-level points (adsorbate v2, OC22 per-atom, HE-CoOOH). To validate the diagnostic non-circularly we ran two controlled experiments on 2026-05-18 — one on the signal axis and one on the data-size axis. Both came back with results that tightened the framing rather than killed it, but in unexpected ways that change which claim §4 leads with.

Synthetic locality continuum (Wang-Li-Metze 2018 in audio, lifted to materials shape). Hold everything else fixed and vary the fraction of signal-bearing instances per bag in 1; train mean-pool, max-pool, gated MIL, and an oracle hard-mask pool for 200 epochs, 5 seeds per cell.

locality fraction	mean MAE	MIL MAE	oracle MAE	MIL conc.	mean/MIL ratio
0.05	0.370	0.026	0.002	19.5×	14.5×
0.20	0.154	0.036	0.001	4.1×	4.2×
0.40	0.088	0.024	0.002	2.2×	3.6×
0.60	0.051	0.022	0.002	1.5×	2.3×
1.00	0.002	0.001	0.002	1.0×	1.9×

(MIL bold; 5-seed mean.) Three findings change the framing:

Gated MIL beats mean-pool at every locality, with the gap shrinking monotonically from 14.5× at lf=0.05 to 1.9× at lf=1.0. No crossover. The “MIL crosses mean at 2× concentration” framing was the wrong question; the right framing is gap magnitude vs locality.
MIL attention identifies signal instances with 0.98–1.00 top-1 hit rate across all localities. Interpretability holds even at lf=0.05 where 1 of 20 bag members carries signal and the hit-rate question is structurally hard.
Oracle dominates everywhere on synthetic — which contradicts HE-CoOOH where MIL beat oracle 0.86 vs 1.58 eV. The contradiction is exactly what defines the “extended-localized” regime: on synthetic the binary instance mask captures the full signal; on materials the signal extends into the neighborhood of dopant sites, which the binary mask discards and MIL’s soft weights recover.

HE-CoOOH N-scaling sweep (within the existing 100-structure Path C cache). For N in 100, subsample uniformly, 5 seeds per cell, retrain.

N	mean-pool MAE	MIL MAE	oracle MAE	MIL conc.	mean/MIL ratio
10	2.99	2.90	4.24	3.5×	1.03×
20	1.59	1.81	2.84	2.8×	0.88× ←
40	1.34	1.47	1.87	3.4×	0.91× ←
60	1.33	1.11	1.87	2.9×	1.20×
80	1.31	0.86	1.54	3.2×	1.52×
100	1.37	0.87	1.58	3.4×	1.57×

(← = mean-pool ties or beats MIL.) Three more findings:

The MIL/mean-pool crossover is in N, not in concentration. Below N≈40, MIL ties or loses to mean-pool despite attention concentration already being ≥2.8×. The “≥2× concentration → MIL wins” diagnostic is necessary but not sufficient — sufficient is concentration ≥2× and N ≥ ~40 training bags.
Within-cache plateau by N=80 (drift to N=100 is 1% for MIL, 5% for mean-pool). The current §4 result at N=100 is not an artifact of being on the climbing portion of an N-curve; it reflects the asymptote for this generation of Path C data. Whether the asymptote extends to Sci Adv 2025 scale (4,822) remains open and requires generating more structures.
Interpretability arrives faster than accuracy in N. MIL attention concentration on dopant atoms is 2.8–3.5× across the entire N range, including the small-N rows where MIL loses on MAE. The per-site importance recall claim is more robust to data scarcity than the MAE-matching claim — relevant for which primary metric the bake-off leads with under tight data budgets.

Cumulative reframing of the §4 wedge. The single-table wedge above decomposes into a three-part story:

Signal-side. On synthetic, MIL universally beats mean-pool; gap shrinks with locality but never reverses.
Data-side. MIL beats mean-pool only when N ≥ ~40 bags, regardless of attention concentration. Sufficient condition for the diagnostic to predict MIL > mean-pool requires both.
Oracle-vs-MIL contrast. Oracle dominates on synthetic but loses on HE-CoOOH. The contrast operationally defines the extended-localized regime, rather than naming it phenomenologically.

This is more rigorous than the original three-regime taxonomy (distributed / extended-localized / fully-localized) because each regime now sits on a controlled axis — locality, data size, or information truncation by the binary mask — instead of being defined by which dataset happened to land where. Materials side of the bake-off carries the full three-part story; the synthetic and N-scaling experiments together cost half a day of dev-box CPU. Writeups at materials-nlp/e2_locality_result.md and materials-nlp/e5_scaling_result.md.

One more controlled axis: bag size. The original synthetic above ran at bag_size=20, while HE-CoOOH slabs have 48 atoms. A natural reviewer question — and one we wanted to answer before committing to the three-part framing — is how much of the synthetic-vs-materials gap-ratio differential (4.2× synthetic at lf=0.20 vs 1.6× materials on HE-CoOOH) is explained by bag size alone. Re-ran the same grid with bag_size=48 (scope_synthetic_locality_bag48.py); side-by-side:

lf	bag20 ratio	bag48 ratio	Δ
0.05	14.51×	3.49×	−11.02
0.10	4.76×	3.83×	−0.93
0.20	4.23×	3.59×	−0.64
0.40	3.59×	2.68×	−0.91
0.60	2.28×	1.99×	−0.30
1.00	1.90×	1.45×	−0.45

Bag size matters most in the extreme low-locality regime — at lf=0.05 the ratio collapses from 14.5× to 3.5× because mean-pool now averages over ~2 signal instances instead of 1. In the middle of the locality range (where materials operate) the bag factor moves the ratio by less than 1.0, so the wedge framing is largely bag-size robust where it matters.

The follow-up gives the §4 wedge a quantitative decomposition of the materials MIL/mean-pool advantage:

component	factor in MIL/mean ratio
pure locality (lf=0.05, bag=20, IID features)	14.5×
bag-size correction (bag=20 → bag=48)	÷ 4.1 → 3.5×
feature-correlation correction (synthetic → materials)	÷ 2.2 → 1.6×

Each factor is empirically grounded in a controlled run. The §4 paper can now claim: “the MIL/mean advantage of 1.6× on HE-CoOOH is the product of a ~3.5× locality factor (signal sits on ~4% of atoms) divided by a ~2.2× feature-correlation factor (shared coordination information lets mean-pool partially recover the signal).” That explains why the gap is 1.6× and not 14.5×, which is what a reviewer would otherwise raise. Writeup at materials-nlp/e2_bag48_result.md.

Fourth controlled axis — sharpness ceiling. A reviewer would also ask: if the wedge framing prizes attention concentration, why not use a more expressive pool that produces even sharper attention? Ran Set Transformer’s PMA pool (Lee et al. ICML 2019, multi-head attention from a learnable seed query) against Ilse-2018 gated MIL on the same backbone and data:

pool	val MAE	top-1 dopant hit	attention concentration
Ilse-2018 gated MIL	0.75 ± 0.12	0.88 ± 0.05	3.69× ± 1.66
PMA (1 seed, 4 heads)	1.58 ± 0.51	0.81 ± 0.11	9.78× ± 3.35
PMA (4 seeds, 4 heads)	1.27 ± 0.31	0.84 ± 0.07	5.31× ± 0.79

The finding is paradoxical and tightens the wedge framing again: PMA produces 2.6× higher attention concentration but 2.1× worse MAE. Sharpness alone is not the right diagnostic — over-concentration is overfitting. PMA peaks on 1–2 atoms and discards the neighborhood context that the §4 “extended-localized” regime requires (the same mechanism that makes oracle’s binary mask lose on HE-CoOOH).

So the wedge framing now has a third necessary condition:

condition	found in
attention concentration ≥ 2×	original §4 wedge
training set N ≥ ~40 bags	E5 (N-scaling)
attention concentration ≤ ~5×	E4 (PMA pool, this finding)

The operational sweet spot is a sharpness band: 2× ≲ c ≲ 5×. Ilse-2018 gated MIL produces concentrations 2.7× – 3.7× depending on substrate, smack in the middle of the band. The simpler 2018 pool wins not because PMA is broken in some way, but because Ilse-2018’s softmax-over-gated-linears has an implicit sharpness regularizer that PMA’s multi-head attention with learnable seeds doesn’t have. In small-data regimes (N=80 train bags), that regularizer is load-bearing. PMA might catch up at Sci Adv scale (N=4,822); within the current cache it cannot. Writeup at materials-nlp/e4_pma_result.md.

Fifth controlled axis — the extended-localized mechanism itself. The original locality synthetic above contradicted HE-CoOOH: oracle dominated synthetic but lost on materials. I hypothesized that’s because the materials signal leaks from the dopant into its coordination shell, which the binary instance mask discards. To test it under control, ran a third synthetic (scope_spread_signal.py) with a tunable spread decay — each core signal instance puts weight 1 on itself, decay**r on neighbors at distance r — and held locality fixed at lf=0.10:

spread	mean	gated MIL	oracle (core)	oracle (core + ±2)
0.00	0.006	0.007	0.006	0.014
0.10	0.010	0.010	0.010	0.024
0.25	0.016	0.015	0.023	0.039
0.50	0.018	0.015	0.051	0.052
0.75	0.019	0.014	0.087	0.072
1.00	0.019	0.014	0.123	0.102

The transition is sharp and the mechanism is now grounded. At spread=0 (pure localized) oracle wins, matching original E2. At spread=0.25 oracle starts losing — by spread=1.0 gated MIL beats oracle by 9×. Even the “extended” oracle that includes core + ±2 neighbors loses to gated MIL at moderate spread: including the right set of atoms isn’t enough when the contribution decays with distance, because a binary mask cannot represent weights.

The HE-CoOOH 1.8× MIL/oracle gap corresponds to synthetic spread ≈ 0.3 — physically plausible for dopant-induced perturbations propagating one to two coordination shells in transition-metal oxides. So the “extended-localized” regime is no longer a phenomenological label; it’s a mechanism with a tunable synthetic parameter, a measured crossover threshold (~0.2), and a materials operating point (~0.3) consistent with the physical intuition. Writeup at materials-nlp/e2_spread_result.md.

Sixth controlled axis — dopant density. Re-ran the same FFN MIL / mean-pool / oracle comparison on the 3-element doping cache (he_coooh_3element.pt, N=100, 3 dopants per slab instead of 2 from the same 9-element pool). The result is the most inconvenient finding of the session:

pool	2-elem MAE	3-elem MAE	direction
mean-pool	1.27 ± 0.17	0.57 ± 0.09	mean-pool got 55% better
gated MIL	0.75 ± 0.12	0.64 ± 0.14	MIL got 14% better
oracle	1.50 ± 0.47	1.88 ± 0.15	oracle got worse

So at 3-element doping, mean-pool beats gated MIL (0.57 vs 0.64; mean/MIL = 0.88×, flipped from 1.69× at 2-element). The MIL-vs-oracle gap intensifies (2.0× → 2.9×); top-1 dopant hit improves (0.88 → 0.98); attention concentration stays in the operational band (3.7× → 2.9×).

The §4.2 “MIL beats mean-pool” claim turns out to be regime- specific to low dopant density. Mean-pool benefits dramatically from more signal-bearing atoms (3/48 ≈ 6.25% locality vs 2/48 ≈ 4.2%); MIL was already extracting most of the signal at 2-element. The mean/MIL ratio flips around dopant density ~5% per bag.

The corrected wedge has two MIL advantages on different axes:

MIL vs mean-pool: density-bounded (works at ≤2/48 dopants, fails at ≥3/48).
MIL vs oracle: density-amplified (2.0× at 2-element, 2.9× at 3-element).

The interpretability claims (top-1 hit, attention concentration, sharpness) survive intact across both density regimes. The accuracy claim narrows. Sci Adv 2025’s HE-CoOOH has 2-4-element compositions; a faithful reproduction needs density-stratified reporting. Writeup at materials-nlp/e8_3element_result.md.

Seventh controlled axis — cross-density transfer is catastrophic. E8 measured within-density behavior on each cache separately. E9 asks the deployment-relevant question: can a model trained on one density regime generalize to the other? Trained FFN MIL + mean-pool + oracle on each cache, evaluated on the other:

cell	mean	MIL	oracle	mean/MIL
2→2 (within)	0.97 ± 0.01	0.50 ± 0.06	0.61 ± 0.04	1.95× (MIL wins)
3→3 (within)	0.54 ± 0.12	0.49 ± 0.16	0.82 ± 0.06	1.09× (MIL wins narrowly)
2→3 (transfer)	5.21 ± 1.65	7.01 ± 2.34	18.68 ± 0.76	0.74× (mean wins)
3→2 (transfer)	3.00 ± 0.36	10.10 ± 1.63	8.37 ± 0.38	0.30× (mean wins big)

Cross-density transfer is catastrophic across every pool — MAE jumps from sub-eV (within) to 3-19 eV (across), a 5-30× degradation depending on pool. The MIL/mean ratio inverts: mean-pool wins both transfer directions, by 1.4× and 3.3×. Mean-pool is the most density-robust (5.5× degradation vs MIL’s 17× and oracle’s 20×); simpler pools with fewer parameters extract more density-invariant signal.

So the §4 accuracy claim is bounded twice over — first by regime (E8: works at 2-elem, fails at 3-elem when each is trained independently), and then by training distribution (E9: even within a regime, MIL trained on a different density is catastrophically worse than mean-pool). The wedge is a within-distribution claim at a specific density. The interpretability claims (sharpness, AOPC, top-k recall) measure properties of the attention map and are the natural candidates to survive the transfer collapse — but that’s a hypothesis E9 doesn’t test directly. The natural conclusion:

“Attention-MIL produces interpretable per-site importance maps that are robust to dopant density. Its bag-level accuracy advantage over mean-pool is regime-specific and training- distribution-specific. The paper’s primary contribution is most reliably an interpretability contribution; the accuracy contribution is a benchmark in a specific regime that does not transfer.”

Writeup at materials-nlp/e9_density_transfer_result.md.

Eighth controlled axis — keystone: the attention map survives cross-density transfer. E9 only measured bag-level MAE on the transfer cells. The natural follow-up: does the interpretability output also collapse, or does it survive? Trained FFN MIL on each density and measured top-k dopant recall + attention concentration on the held-out other density:

cell	MAE	top-1 hit	top-3 hit	concentration
2→2 (within)	0.50 ± 0.06	0.854 ± 0.044	0.876 ± 0.039	3.97×
3→3 (within)	0.49 ± 0.16	0.974 ± 0.033	0.986 ± 0.028	2.73×
2→3 (transfer)	7.01 ± 2.34	0.960 ± 0.055	0.962 ± 0.056	3.08×
3→2 (transfer)	10.10 ± 1.63	0.860 ± 0.065	0.892 ± 0.056	2.88×

Net: bag-level MAE worsens 17× under transfer; top-1 dopant hit changes by +0.004 (essentially identical); attention concentration stays at ~3× in all four cells, well above the 2× operational floor.

The 2→3 transfer is the cleanest demonstration: a model trained only on 2-element bags achieves top-1 dopant hit 0.96 on 3-element bags — higher than its within-distribution top-1 of 0.854 — while bag MAE collapses from 0.50 to 7.01 eV. The attention mechanism learns a task-generic skill (“find dopant atoms”) that transfers; the head learns to map the pooled bag-vector to a scalar prediction, which depends on the per-bag density distribution and does not transfer.

This is the keystone result for the §4 messaging shift that came out of E3 + E8 + E9. The paper’s primary contribution is now empirically grounded:

“Attention-MIL produces interpretable per-site importance maps that are density-invariant and training-distribution-robust: top-1 dopant recall and attention concentration stay within ±0.5% of within-distribution performance under cross-density transfer, even when bag-level MAE collapses by 17×. The accuracy advantage is regime-specific; the interpretability advantage is regime-invariant. The paper’s primary contribution is the interpretability claim, which generalizes; the accuracy contribution is a within-distribution benchmark.”

Writeup at materials-nlp/e10_interp_transfer_result.md.

Caveat on E10: AOPC does not cleanly survive transfer (scope_aopc_transfer.py, E12). Re-ran the §4.6i AOPC test on the same four train→eval cells. The transfer cells are asymmetric and unreliable:

cell	AOPC AUC	E10 top-1 hit (same cell)
2→2 (within)	0.79 ± 0.26	0.85
3→3 (within)	0.94 ± 0.13	0.97
2→3 (transfer)	0.45 ± 0.15 (collapses)	0.96
3→2 (transfer)	1.52 ± 0.35 (inflates above any within)	0.86

The 3→2 AOPC inflation isn’t faithfulness winning — it’s the model being far from saturation (E9 MAE 9.34 eV), so ablating any atom moves the wildly-wrong prediction by a large absolute amount. The 2→3 AOPC collapse is the inverse: the model has saturated on a constant-ish prediction and atom ablations don’t move it much, even though attention is correctly identifying dopants (top-1 hit 0.96).

The honest reading: AOPC conflates “faithful attribution” with “prediction is far from saturation”, and under cross-density transfer the bag-level prediction itself collapses (E9 finding) — so AOPC becomes uninformative about whether attention is correctly attributing. The E10 keystone is preserved but narrowed: the interpretability-survival claim is two-pillar (top-k recall + attention concentration), with the within-distribution AOPC advantage from §4.6i as a separate finding. Writeup at materials-nlp/e12_aopc_transfer_result.md.

Ninth controlled axis — mixed-density training closes the loop. E9 said cross-density transfer is catastrophic; E10 said interpretability survives it anyway. The practical follow-up: if you train on the union of densities, do you recover within-density accuracy? Three train regimes (2-only, 3-only, mixed) × two val regimes (2-elem, 3-elem), 5 seeds:

train	2val MAE	3val MAE	2val top-1	3val top-1
2-only (specialist)	0.75	6.83 (transfer)	0.88	0.97
3-only (specialist)	9.34 (transfer)	0.64	0.90	0.98
mixed (2+3 union)	0.93	0.81	0.86	0.97

Mixed-density training rescues accuracy at ~25% penalty over specialists and preserves interpretability. From E9’s catastrophic transfer (6.8 / 9.3 eV) to within-distribution-grade accuracy (0.81 / 0.93 eV) — an 8.4-10× MAE improvement, achieved just by training on the union. Top-1 dopant hit is essentially identical to specialists (0.86-0.97 across all six cells); attention concentration stays in the 2.55-3.69× operational band across all cells.

So the complete cross-density story (E8 → E9 → E10 → E11) is:

Single-density training is regime-bounded: the §4 wedge holds at 2-elem but flips at 3-elem (E8).
Cross-density transfer is catastrophic on bag-level MAE (5-30× degradation, E9).
Interpretability is transfer-robust — the attention map generalizes even when the head doesn’t (E10, keystone).
Mixed-density training is the practical deployment recipe: ~25% penalty over specialists, with no interpretability cost (E11, this).

The deployment recommendation is now actionable: train on the density union, evaluate density-stratified, and use the attention map for interpretation regardless of train/eval mismatch.

Writeup at materials-nlp/e11_mixed_density_result.md.

Tenth controlled axis — quantile calibration is moderate within distribution, breaks under transfer. The §4 paper and experiment- spec both claim the quantile head delivers “calibrated uncertainty for free” for the Paper-2 BO acquisition function. Nothing in the session had actually trained or evaluated a quantile model until now. Trained FFN MIL with 5-quantile pinball loss and measured reliability + ECE on all four cross-density cells:

cell	median MAE	ECE
2→2 (within)	0.91 ± 0.25	0.156 ± 0.07
3→3 (within)	0.81 ± 0.15	0.176 ± 0.07
2→3 (transfer)	4.77 ± 2.10	0.312 ± 0.10
3→2 (transfer)	12.81 ± 2.77	0.497 ± 0.01 (saturated)

Within-distribution ECE ≈ 0.16. The reliability curve tracks the ideal diagonal in shape but is biased high in the middle and slightly under-confident at the extremes — the textbook “needs Platt/isotonic post-hoc recalibration” pattern. Achievable but not free; the original “calibrated uncertainty for free” claim isn’t empirically supported.

Under transfer the calibration collapses: the 3→2 cell has empirical coverage saturated at 1.0 across every nominal quantile — every true 2-elem value falls below the lowest predicted quantile, because the 3-elem-trained model predicts wildly too-high values. The 2→3 cell flattens at ~0.25-0.32 coverage across all quantiles — the predicted range is too narrow and miscentered.

The 5-metric cross-density survival summary (combining E9, E10, E12, E6):

metric	type	within	transfer	verdict
bag-level MAE	prediction-side	0.5 eV	5-30× worse	collapses
top-1 dopant hit	attention-map	0.91	0.91	survives
attention concentration	attention-map	3.4×	3.0×	survives
AOPC AUC	prediction-sensitivity	0.79-0.94	0.45 / 1.52	asymmetric collapse
calibration ECE	prediction-side	0.16	0.31-0.50	collapses

Two-pillar transfer-robust interpretability vs three transfer- fragile prediction-side metrics. The interpretability claim generalizes across density; the accuracy / faithfulness / calibration claims are within-distribution only with known practical mitigations (mixed-density training, post-hoc recalibration). Paper-2’s BO pitch needs both recalibration AND mixed-density training to deliver actionable uncertainty estimates. Writeup at materials-nlp/e6_calibration_result.md.

Eleventh controlled axis — mixed-density training rescues calibration too. Same pattern as E11 (which rescued accuracy): train the quantile head on the 2+3-elem union, evaluate on each density. ECE drops from the catastrophic transfer values (0.30 / 0.50) to within-specialist levels (0.18 / 0.17):

cell	median MAE	ECE
2-only specialist on 2val	0.91 ± 0.25	0.156
3-only specialist on 3val	0.67 ± 0.20	0.184
2-only on 3val (transfer)	4.37	0.296
3-only on 2val (transfer)	13.51	0.498 (saturated)
mixed on 2val	0.88	0.180
mixed on 3val	0.81	0.174

The reliability curves for the mixed-trained model on both val sets hug the ideal diagonal alongside the specialists. The catastrophic transfer collapse is fully rescued. Combined with post-hoc Platt/isotonic recalibration (a known fix that drops within-distribution ECE under 0.05), this delivers the calibrated uncertainty the Paper-2 BO acquisition function needs.

Complete Paper-2 deployment recipe (E10 + E11 + E13 combined):

Mixed-density quantile training → recovers both bag-level accuracy and quantile calibration on every density in the union; ~25% MAE penalty and ~4% ECE penalty over specialists.
Post-hoc Platt/isotonic recalibration → reduces ECE further. (Verified by E14 below — actual reduction is 25-44%, not “under 0.05” as earlier writeups optimistically claimed.)
Attention-map $a_k$ as per-site importance for BO acquisition → guaranteed regime-invariant by E10 keystone; survives cross-density transfer at +0.004 top-1 hit drift.

Three steps, all empirically grounded. The original “calibrated uncertainty for free” claim from the experiment-spec becomes “calibrated uncertainty in three concrete steps, with measured caveats from E14 below.” Writeup at materials-nlp/e13_mixed_calibration_result.md.

Twelfth controlled axis — measuring the recalibration step (honest correction). Three prior writeups (E6, E12, E13) all claim “Platt/isotonic recalibration should drop ECE under 0.05.” That claim was never measured — only promised. E14 (scope_platt_recalibration.py) ran the actual recalibration step. Trained the mixed-density quantile head on 60% of each cache, held out 20% as calibration fold for fitting an isotonic recalibrator, then evaluated ECE on the remaining 20% val. 5 seeds.

cell	pre-recal ECE	post-recal ECE	reduction
2val	0.164 ± 0.054	0.092 ± 0.026	−0.072 (44%)
3val	0.198 ± 0.153	0.148 ± 0.097	−0.050 (25%)

The “under 0.05” claim was overconfident. Isotonic recalibration does help meaningfully — 44% ECE reduction on 2val, 25% on 3val — but post-recalibration ECE is 0.09 (2val) and 0.15 (3val), well above the conventional 0.05 target. The reliability curves visibly move toward the ideal diagonal (Figure 21) but don’t reach it. Two diagnosable causes: small calibration set (N=20 per density makes the isotonic fit noisy), and direction-dependent quantile bias that a single global isotonic map can’t fully address.

Corrected Paper-2 calibration claim: post-hoc isotonic recalibration reduces ECE from 0.16-0.20 to 0.09-0.15 — a 25-44% improvement. The 0.05 target is not reached at this data scale; uncertainty bands are biased by 10-15% in absolute coverage. Workable for BO acquisition with explicit uncertainty-aware rules (e.g., conformal wrapping on top of recalibrated quantiles), not pristinely calibrated. The cumulative ECE journey through the deployment recipe is 0.40 (naïve transfer) → 0.18 (mixed training) → 0.09-0.15 (mixed + recalibration) — a 2.7-4.4× improvement overall, with a clearly-flagged residual gap from pristine calibration. Writeup at materials-nlp/e14_platt_result.md.

Thirteenth controlled axis — and E14’s diagnosis was wrong too. E14 attributed the under-target ECE to “direction-dependent quantile bias that a unified isotonic calibrator can’t fully address.” E15 (scope_per_density_recal.py) tested that diagnosis by fitting separate isotonic recalibrators per density:

recalibration	2val ECE	3val ECE	avg
pre-recal	0.164 ± 0.054	0.198 ± 0.153	0.181
unified isotonic (E14)	0.092 ± 0.026	0.148 ± 0.097	0.120
per-density isotonic (E15)	0.110 ± 0.042	0.146 ± 0.075	0.128

Per-density recalibration is slightly worse than unified (+0.008 ECE, +6.7%) — within noise. The diagnosis is wrong: the bottleneck is calibration data size, not calibrator design. Splitting the 20-sample calibration set in half (N=10 per density) hurts the per-density isotonic fits more than the direction-dependent bias hurts the unified compromise. The unified calibrator’s “average” between two opposing bias patterns turns out closer to the diagonal than either density-specific fit.

The honest cumulative diagnosis is one factor, not two: small calibration set (N=20 unified) is the limit. Reaching the 0.05 target needs more data — either a larger held-out fold (consuming train data) or cross-validated calibration. Per-density calibrators don’t help at this scale.

The §4.6n recipe should specify unified isotonic recalibration, not per-density. The accompanying caveat: ECE 0.09-0.15 is what the current data scale supports; cross-validated or larger-N calibration would close the residual gap to 0.05. Writeup at materials-nlp/e15_per_density_recal_result.md.

Fourteenth controlled axis — conformal prediction is the clean rescue. After two honest corrections (E14 retired “ECE under 0.05”; E15 retired “direction-dependent bias as the limit”), the question becomes: is there any calibration strategy that delivers the promised uncertainty at this data scale? Split conformal prediction (Romano-Patterson-Candès 2019) offers a different trade than isotonic: finite-sample marginal coverage guarantee at any chosen level, at the cost of wider intervals.

Same mixed-density quantile MIL as E13/E14/E15; target α=0.20 (nominal 80% coverage); split conformal calibration on the same 20% held-out fold; evaluated on the other 20% val.

metric	pre-conformal	post-conformal
2val coverage	0.62 ± 0.06	0.78 ± 0.13
3val coverage	0.62 ± 0.19	0.82 ± 0.09
2val width (eV)	2.09 ± 0.21	3.21 ± 0.97 (1.54× wider)
3val width (eV)	1.96 ± 0.26	3.08 ± 1.00 (1.57× wider)

Conformal hits the 0.80 nominal coverage target exactly (0.78 + 0.82 = avg 0.80) on both densities. The cost: intervals are 1.55× wider. This is the cleanest rescue of the session — after the two honest corrections, conformal delivers a guaranteed result by construction.

The Paper-2 deployment recipe now has two valid options at step 2:

2a. Pointwise calibration via unified isotonic recalibration (E14): ECE 0.09-0.15, no theoretical guarantee, use with BO acquisition functions that tolerate ~10-15% miscoverage.
2b. Interval calibration via split conformal (E16): guaranteed marginal coverage at any chosen α, intervals 1.55× wider, use with UCB / max-variance / knowledge-gradient acquisition.

For most BO applications, 2b is the right choice — the coverage guarantee makes the optimizer behave correctly under the uncertainty estimate. Writeup at materials-nlp/e16_conformal_result.md.

Fifteenth controlled axis — and a third honest correction. E15 diagnosed “calibration data size is the bottleneck.” E17 (scope_kfold_conformal.py) tested that by running K=5 cross-conformal — 4× more conformity scores via leave-fold-out training:

metric	split (n=40)	K-fold (n=160)	Δ
avg coverage	0.805	0.775	−0.030
avg width	3.01 eV	2.77 eV	−0.24 eV (−8%)

K-fold narrows intervals by 8% but at the cost of 5× more compute and slightly higher 3val variance. E15’s diagnosis is partly right but not magnitude-strong: calibration data size is one of two bottlenecks. The other is model variance from the small N=80 training set — each K-fold model is trained on only 64 bags and is noisier than the full-train model, and that variance limits how much the pooled τ can shrink.

This is the third honest correction in the session:

E14 retired “ECE under 0.05” from E6/E12/E13
E15 retired E14’s “direction-dependent bias” diagnosis
E17 narrows E15 to “calibration data size is part of the bottleneck”

For Paper-2 at this data scale, split conformal (E16) is the right choice — simpler, comparable performance, lower compute. K-fold cross-conformal is a marginal refinement to flag for when data scales up. Writeup at materials-nlp/e17_kfold_conformal_result.md.

Sixteenth controlled axis — deep ensemble closes the model-variance bottleneck (at coverage cost). E17 diagnosed two bottlenecks: calibration data scarcity AND model variance from N=80 training. K-fold cross-conformal addressed (1); E18 tests the model-variance side directly by training K=5 FFN MIL quantile models with different inits, averaging their predicted quantiles, then split-conformalizing the averaged predictions.

The three calibration strategies now form a clean Pareto curve on the same Paper-2 surrogate:

strategy	avg coverage	avg width	trade
E16 split (1 model)	0.80	3.15 eV	strict coverage
E17 K-fold	0.78	2.77 eV	balanced (−12%)
E18 deep ensemble	0.75	2.42 eV	width-first (−23%, 5pp miscoverage)

Ensembling narrows intervals 13% further beyond K-fold and 23% beyond E16 baseline — measuring that model variance was indeed a real bottleneck, as E17 diagnosed. The cost: coverage drops 5pp. At N=40 calibration the τ estimate is noisy (per-seed range −0.22 to +0.14 eV) and doesn’t fully compensate for the narrower raw ensemble interval.

The Paper-2 calibration recipe is now a documented Pareto choice, not a single fixed strategy:

Coverage-first (UCB, knowledge-gradient): E16 single-model.
Balanced (most BO): E17 K-fold — nominal coverage within 3pp, 12% narrower than E16.
Width-first (Thompson sampling, max-variance): E18 ensemble — 23% narrower, 5pp miscoverage.
Future: K-fold cross-conformal on ensemble predictions combines both bottleneck-closing mechanisms; expected to give nominal coverage AND narrowest width at K² compute. Half-day follow-up.

Writeup at materials-nlp/e18_deep_ensemble_result.md.

Seventeenth controlled axis — E19 closes the Pareto curve. Combined K=5 cross-conformal × M=3 ensemble: 15 trained models per seed for the K-fold conformity pooling, plus a 3-ensemble trained on all train+calib for test-time prediction. The full Pareto picture:

strategy	avg coverage	avg width
E16 split (1 model)	0.80	3.15 eV
E17 K-fold (5 models)	0.78	2.77 eV
E18 ensemble (5 models)	0.75	2.42 eV
E19 K-fold × ensemble (18 models)	0.84	2.46 eV

E19 strictly dominates the simpler strategies on aggregate coverage at near-narrowest width — coverage 0.84 above the 0.80 nominal target, width 2.46 eV within 2% of E18’s narrowest 2.42 eV. The combined strategy doesn’t average the individual effects; it delivers ensemble’s near-narrowest width AND K-fold’s coverage stability simultaneously.

So the Paper-2 calibration recipe collapses from a Pareto choice to a single recommended strategy:

“Use K=5 cross-conformal on an M=3 ensemble of quantile MIL models. Empirical coverage 0.84 at α=0.20 (above nominal 0.80) with interval width 2.46 eV. Compute cost: 18 model trainings per deployment epoch, amortizable across the BO campaign.”

Per-cell asymmetry remains (3val 0.95 over-conservative, 2val 0.73 under) but aggregate coverage 0.84 ≥ 0.80 meets spec. A per-density conformal τ is the natural further refinement.

Writeup at materials-nlp/e19_kfold_ensemble_result.md.

Eighteenth controlled axis — per-density conformal τ flattens the E19 per-cell asymmetry. Split conformity scores by density, compute τ_2elem and τ_3elem separately, apply per-density at test time:

metric	E19 pooled τ	E20 per-density τ	Δ
2val coverage	0.73	0.80 (nominal)	+0.07
3val coverage	0.95	0.95	0.00
2val width	2.46 eV	2.78 eV	+0.32
3val width	2.45 eV	2.40 eV	−0.05
aggregate coverage	0.84	0.875	+0.035
dispersion	0.22	0.15	−32%

Per-density τ rescues 2val coverage to exactly nominal while preserving 3val. Per-cell dispersion drops 32%. Cost: 13% wider 2val intervals (τ_2 = 0.39 > pooled 0.24 gives 2val the cushion it needed). 3val stays over-conservative because the raw ensemble interval is genuinely broad; conformal can’t tighten it without losing coverage.

The recipe now has a clean choice at the calibration step:

variant	when to use
E19 pooled τ	aggregate-coverage deployment
E20 per-density τ	density-stratified deployment

For most BO applications, E20 per-density τ is the cleaner default — per-cell nominal coverage at modest width cost, with the trade-off documented rather than implicit.

Writeup at materials-nlp/e20_per_density_conformal_result.md.

Nineteenth controlled axis — α-sweep confirms the recipe holds across BO confidence levels. Extended the 5-quantile head to a 9-quantile head (0.975) and swept α ∈ 0.2 — the confidence levels real BO acquisition functions actually use:

α	nominal	2val cov	3val cov	2val width	3val width
0.05	0.95	0.96	0.95	5.23 eV	4.36 eV
0.10	0.90	0.91	0.88	4.02 eV	3.40 eV
0.20	0.80	0.77	0.76	2.96 eV	2.60 eV

Empirical coverage tracks nominal within ±0.04 at every α. Both density reliability curves hug the ideal diagonal across the range. Width grows monotonically: 2.6 → 3.4 → 4.4 eV on 3val; 3.0 → 4.0 → 5.2 eV on 2val.

For Paper-2 BO deployment, the choice is concrete:

α=0.05 (95% interval): knowledge gradient, conservative Thompson
α=0.10 (90% interval): typical BO loop balance
α=0.20 (80% interval): exploration-friendly screening

The 9-quantile head (vs the original 5-quantile) is the right deployment choice since it covers the full α range without retraining. The α=0.20 cell undershoots by 3.5pp (within per-seed std); padding α slightly at calibration time would push it back to nominal — a deployment knob.

Writeup at materials-nlp/e21_alpha_sweep_result.md.

The experiment, concretely

This is the six-week plan that would demonstrate (or kill) the wedge. Full version with risk register and decision rules lives at materials-nlp/experiment-spec.md; here is the load-bearing summary.

Datasets, in layers.

Pretrain: OC20 + OC22 (used implicitly through the released UMA-s-1p2 checkpoint, which Meta-FAIR multi-task pretrained on OC20, OC22, and other corpora). We don’t train from scratch and we don’t fine-tune the backbone — initialize from facebook/UMA on HuggingFace and freeze. Caveat: EquiformerV2-OC22 checkpoints (the architecture Sci Adv 2025 actually used) are no longer publicly downloadable; UMA-s-1p2 is the same lab’s successor and the cleanest current substitute.
Downstream: HE-CoOOH OER overpotential — try Sci Adv 2025’s 4,822-structure set first; fall back to programmatically doping ~1,000 OC22 oxides if the original isn’t released (he_coooh_path_c.py in the working dir already implements this fallback).

Dataset-vs-architecture risk (own the awkward case). HE-CoOOH is the same dataset the Sci Adv 2025 EquiformerV2+Post-Att Adapter paper was tuned on. Two ways this can go sideways: (i) their architecture is uniquely well-matched to the dataset (constructed for it), in which case any reasonable alternative — including our MIL pool — will underperform on bulk MAE even if it wins on per-site importance; (ii) the dataset itself is architecture-agnostic and our wedge will show up cleanly. Sequencing accordingly: reproduce their headline overpotential MAE with UMA-direct (no MIL pool) before claiming any TempoSurfViT win. If UMA-direct lands within 10% of the Sci Adv 2025 number, the dataset is fair to compare on; if it lands much worse, the gap is architecture-coupling, not a genuine signal, and we should add a second downstream (Materials Project formation energy, or solid-electrolyte Li-conductivity) before publishing.

Two models.

Baseline: UMA-direct. Query the frozen UMA-s-1p2 calculator on each initial structure, report its predicted energy divided by atom count, subtract the train-set mean reference offset. This is the modern Meta external baseline — same backbone family as the unavailable EquiformerV2-OC22 + Post-Att Adapter, no MIL pool. On the morning Path A run on OC22 IS2RE val_id-20, UMA-direct (offset-corrected) lands at val MAE 0.225 eV/atom vs Ours-MIL 0.270.
Ours: per-site local cluster fed through frozen UMA-s-1p2 → side-info channel (composition, space group) → gated attention-MIL pool (the toy_mil.py aggregator, already validated on synthetic data and now wired through materials_mil.py on real slabs) → quantile head for overpotential. ~30M frozen parameters in the backbone, ~5M trainable in the aggregator + head.

Two primary metrics, pre-committed.

Overpotential MAE on a held-out HE composition family (novel-composition transfer test).
Per-site importance recall on 100 hand-curated structures where DFT-computed per-site activity contributions exist — does the top-3 attended site overlap with the physics-identified active site? Spearman correlation between $a_k$ and DFT activity.

Decision rule.

Outcome	Action
Win on MAE and importance	Strong paper; write up
Match MAE, win on importance	Defensible paper; lead with interpretability
Lose MAE by ≤10%, win on importance	Workshop / methods note
Lose MAE by >10%	Rescue with bag-level MAE pretrain or pivot
Lose both	Pivot to single-cell per §5

Compute / timeline. ~~24 h on a single A100 for one full bake-off run (~~$30–60 on RunPod / Lambda). Six-week schedule, one decision point per week: HE data secured → baseline reproduced → first end-to-end Ours number → importance evaluation infrastructure → ablations → writeup. The Apple-Silicon dev box can do the aggregator + head locally; UMA’s equivariant kernels currently run CPU-only on macOS (no MPS acceleration for the e3nn-style ops), so scaling past ~100 bags wants a cloud GPU.

Pre-specifying the decision rule means the answer is informative either way: a win is the Framing-C paper, a loss is honest data for the single-cell fallback (or GWAS, parked).

Beyond the bake-off: closing the synthesis loop

The Sci Adv 2025 paper doesn’t stop at “predict overpotential well.” It screens 17,500 candidate compositions, picks eight predicted-top ones, runs automated synthesis on those, and lands on TiFeNiZn- CoOOH at 263 mV/dec experimental OER overpotential. That closed loop — predict → screen → synthesize → measure — is what made it a Science Advances paper rather than a methods note.

An attention-MIL model has two structural advantages for the same loop:

The quantile head gives calibrated uncertainty for free — the TempoSurfViT recipe we’re reusing already trains a 9-quantile pinball head, which drops directly into any standard acquisition function (EI, UCB, Thompson) for “which composition to synthesize next.” No additional engineering.
The $a_k$ map tells the chemist what to vary, not just whether to synthesize. A standard surrogate says “predicted overpotential = X ± σ”; ours says the same plus “and the activity is concentrated on the Sr-substituted sites, so the next composition should perturb those.” That’s a different kind of recommendation — a hypothesis a synthesis chemist can act on without an interpretability decoder bolted on after.

The natural shape of the line of work is therefore two papers, not one. Paper 1 is the §4 bake-off: match UMA-direct on MAE, beat it on per-site importance recall. Paper 2 is the loop closure: end-to-end MIL-driven Bayesian optimization on a real (not synthetic) HE-catalyst screening budget, with experimental wet-lab validation on the top-k recommended compositions. Paper 1 is table-stakes; paper 2 is what makes the line substantively novel.

Contingency on Paper 2. Paper 2 is not a natural rollover from Paper 1 — it requires a synthesis collaborator with HE-catalyst lab capacity (precursor handling, electrochemical OER testing rig, ≥1–2 month turnaround per batch of ~8 compositions), plus the funding to actually run the synthesis. Absent that collaborator, Paper 1’s per-site importance recall figure (Spearman vs DFT-computed activity contributions on the 100-structure curated set) stands on its own as the deliverable. The closed-loop framing is the ambition; the per-site importance recall is the floor we commit to.

5. Open questions before committing

Rotation-invariant tokenization of “local environment” without throwing away geometry — SO(3)-equivariant features vs scalar invariants vs plain Wyckoff label. Pick one before starting.
Pretraining corpus size where MIL > GNN. The bio literature suggests O(10k) bags before MIL beats simpler baselines; need to verify the threshold holds in materials.
Single foundation model across crystal families, or one per family (oxides, sulfides, halides). Pan-cancer worked in the bio version; pan-chemistry is harder and possibly worse-calibrated.
DFT-computed labels vs experimental labels. Materials Project labels are computed, not measured — the transfer-to-experiment story will need explicit treatment.

Single-cell genomics as the named fallback

If the materials port hits blockers on rotation-equivariance or DFT-to-experiment transfer, single-cell genomics is the natural pivot. An adjacent-domain audit (working file: materials-nlp/adjacent-domains.md) found scRNA-seq / scATAC shares 3.5/4 of the same structural checks: a sample is a bag of cells, each cell carries a per-cell expression vector, phenotype labels live at the sample level, and per-cell importance is the canonical scientific question. Engineering would reuse everything except the per-instance backbone.

The wedge vs. scGPT / Geneformer / scFoundation is the MIL aggregator on top of an existing single-cell foundation model — those models currently treat each cell independently then pool by averaging, which throws away the per-cell importance signal.

6. Where I’m reading next

ATGC end-to-end (the most directly portable piece of the bio literature) is still the highest-priority dive. After today’s wedge result, the Sci Adv 2025 SI (paywalled at time of writing) is the next blocking item: it determines whether the “primary output vs post-hoc” framing in §3/§4 holds or needs softening. After that, the engineering priorities pre-empt more reading — the 100-structure per-site importance evaluation set (§4 primary metric #2) is what the bake-off currently lacks, and ships before any further literature pass.

7. Sources

Methodology lineage

DeepTCR — Nat Commun 2021 · GitHub · Documentation.txt
ATGC — Nat Biomed Eng 2023 (PubMed) · bioRxiv preprint v5 · code (OmnesRes/ATGC2)
DeepTCR_Cancer — Sci Adv 2022, GitHub
2025 dual-attention somatic-mutation LLM — ASCO Post coverage
Ilse, Tomczak, Welling — Attention-based Deep MIL (ICML 2018) · PyTorch reference (AMLab-Amsterdam)
Wang, Li, Metze — A Comparison of Five MIL Pooling Functions for Sound Event Detection with Weak Labeling (arXiv 2018, ICASSP 2019) — Direct prior empirical work characterizing which pool wins as a function of how localized the positive frames are. The qualitative half of §4’s wedge.
FocusMIL — robust MIL against spurious correlations (arXiv 2024) — Counter-point: max-pooling beats attention under spurious-correlation regimes. Relevant context for the pooling-choice diagnostic.
CLAM — clustering-constrained attention-MIL on WSIs, Nat Biomed Eng 2021
TransMIL — NeurIPS 2021

Materials prior art we’d be beating / engaging with

CGAT — Crystal Graph Attention Networks, Sci Adv 2021
CEGANN — npj Comp Mat 2023
Decoding active sites in high-entropy catalysts via attention-enhanced model — Sci Adv 2025 — EquiformerV2 + Post-Att Adapter; closest competitor to Framing C, uses attention on the equivariant graph rather than bag-of-sites MIL.
EquiformerV2 — Liao et al., ICLR 2024 — The underlying SO(3)-equivariant graph transformer.
Crystalformer — Taniai et al., ICLR 2024 — Infinitely-connected attention formulated as neural potential summation; SOTA on Materials Project + JARVIS-DFT at 29% of comparable Transformer params. (project page)
Site-Net — Moss et al., Digital Discovery 2023 (arXiv 2209.08190, code) — Transformer with bond-feature (pairwise) attention on atoms in a real-space supercell + mean-pool for MatBench regression. Adjacent to Framing C in shape but distinct in unit (“site” = atom in supercell, not Wyckoff/defect/binding site), pooling (unweighted mean, no MIL head), and task framing (bulk regression, not active-site identification).
DA-CGCNN — AIP Advances 2024 — CGCNN backbone with dual attention (channel + self); evaluated with cross-property transfer learning.
Foundation Models in Chemistry — JACS Au 2025
Generative AI for crystal structures review — npj Comp Mat 2025
AI for Materials Science survey — arXiv 2506.20743
AlloyGPT — npj Computational Materials 2025 — Transformer LM over alloy composition/structure tokens with self-attention as the interpretability route. Closest to Framing D, but attends over composition tokens, not over processing-step tokens.
Transformer-based HEA property predictor — Sci Rep 2025 — Same shape as AlloyGPT for HE alloys; same gap (no processing-route sequence).
GATGNN — Louis et al., PCCP 2020 (arXiv 2003.13379) — Global Attention Graph Neural Network: local-attention layers plus a global attention layer that weights atom-environment vectors into a crystal representation. The closest “global attention over atoms” prior art for Framing B, and the reason “first-class per-site importance” needs softening.
ComFormer — Yan et al., NeurIPS 2024 — Crystal graph transformer with SE(3)/SO(3)-invariant message passing and global attention; reports SOTA across crystal-property benchmarks.
AtomSets — Chen & Ong, npj Comp Mat 2021 — Transferable atom-level representations with a permutation-invariant set-pooling head. Adjacent in spirit to bag-of-sites MIL without being MIL in the Ilse-Tomczak sense; relevant prior art that reviewers may invoke against Framing B.
DefiNet — Sci Adv 2024 — Equivariant network for point-defect crystal structures and per-defect properties. Cleanest example of the per-defect label granularity that Framing C explicitly distinguishes itself from.

Molecular sequence models (Framing A prior art)

ChemBERTa-2 — Ahmad et al., arXiv 2022 — Masked-LM + multi-task regression over ~77M SMILES; the canonical molecular-BERT baseline.
MoLFormer — Ross et al., Nat Mach Intell 2022 (arXiv 2106.09553) — Transformer pretrained on up to ~1.1B molecules from ZINC + PubChem; the scale baseline for SMILES-BERT.
GP-MoLFormer — Ross et al., 2024 — Generative molecular modeling + property optimization via pair-tuning on top of MoLFormer-style pretraining.

First-Class Per-Site Importance in Materials: Attention-MIL on a Frozen Pretrained Backbone

Background§

Origin: NLP + a non-NLP aggregator§

1. The methodology, in domain-neutral form§

Equivariance: the materials-specific commitment§

The aggregator, written out§

A reality check on novelty§

2. Why the recipe transfers§

3. Mapping to materials science§

Framing A — polymer / molecule as sequence, blend as weighted bag§

Framing B — crystal as bag of local environments§

Framing C — defects / dopants as the bag (the cleanest port)§

Framing D — alloy processing route as ordered sequence§

When “instance” is harder: surfaces, amorphous, high-entropy§

4. Where this becomes a paper§

What we’d actually be beating§

Interpretability — against what materials already has§

Where the wedge actually lives — localized vs. distributed signal§

Controlled validation: synthetic continuum and N-scaling§

The experiment, concretely§

Beyond the bake-off: closing the synthesis loop§

5. Open questions before committing§

Single-cell genomics as the named fallback§

6. Where I’m reading next§

7. Sources§