Beyond 37%: When Optimal Stopping Stops Being Useful

All numbers reproducible from the snippets in §10. Simulations at n = 100 (252 for the market backtest in §8.5), 20,000–50,000 trials, seed 20260515.

Abstract

The classical secretary problem yields the $1/e \approx 0.368$ rule for maximizing the probability of selecting the unique best of $n$ candidates observed in random order. We document, via Monte Carlo at $n = 100$ with 20,000–50,000 trials per cell, how the result decays under each of its load-bearing assumptions. Under cardinal payoffs (Bearden 2006), the optimal explore fraction collapses to $\sim 1/\sqrt{n}$ and expected rank improves by a factor of two. Partial recall raises the optimal cutoff to $\sim 60\%$ at recall probability $0.5$ . Adverse selection between arrival order and quality ( $\sigma = 0.5$ ) cuts success probability by a third. A symmetric two-sided application of the rule, in which a match requires mutual acceptance, collapses the joint match rate by a factor of more than twenty. Full-information access to candidate values (Gilbert & Mosteller 1966) raises achievable success to $\approx 0.58$ . We then situate these results in the modern learning-augmented stopping literature — prophet inequalities (Hill & Kertz 1992), sample-driven thresholds (Dütting et al. 2020), and predicted-prior algorithms — and identify production systems (Hyperband, Bayesian optimization, LLM cascade routing, posted-price ad auctions) that operationalize the same problem under richer information structures. The 37% rule is the no-information, indicator-loss corner of a much larger design space; in real decision settings, the choice of metric and the available information structure dominate the choice of cutoff.

Keywords: secretary problem, optimal stopping, prophet inequalities, learning-augmented algorithms, online decisions, Bayesian optimization, Gilbert–Mosteller, Monte Carlo. MSC: 60G40, 62L15, 68W40.

TL;DR

The 37% rule is correct, but for the wrong objective: it maximizes the probability of picking the unique single best — an indicator loss almost no real decision uses.
Under cardinal payoffs (Bearden 2006), the optimal explore fraction collapses to $\sim 1/\sqrt{n}$ , and $E[\text{rank}]$ improves from 20 to $\sim 10$ for $n = 100$ .
Each classical assumption fails in practice: recall ( $q = 0.5$ → explore $\sim 60\%$ ); adverse selection ( $\sigma = 0.5$ cuts $P(\text{best})$ by a third); two-sided (joint 37% rule collapses the match rate by more than $20\times$ ); full information (Gilbert-Mosteller → $P(\text{best}) \approx 0.58$ , a 21 pp gain over the no-information bound).
The modern frame is learning-augmented stopping: prophet inequalities, sample-driven thresholds (Dütting et al. 2020), and predicted-prior algorithms that are consistent under good predictions and robust under bad ones.
Production reality: Hyperband/ASHA/BOHB, Bayesian optimization, LLM cascade routing, and posted-price ad auctions are all stopping problems with learned priors. The 37% rule is the minimax, no-information, indicator-loss corner of a much larger design space.
Takeaway: stopping is online learning, not solo optimization; the choice of metric (P(best) vs. E[value] vs. regret) dominates the answer far more than the cutoff.

1. The popular claim

If you’ve heard one piece of mathematical advice for making big life decisions, it’s probably the 37% rule. Line up your candidates — apartments, jobs, dates — reject the first 37 percent of them, then take the next one that’s better than everything you’ve seen so far. The proof is clean. The statement fits on a coffee mug. It shows up in self-help books and Hacker News threads, and people quote it in earnest.

It’s also the wrong answer to most of the questions it gets used for. Not because the math is wrong — the math is fine. The question is. The classical rule maximizes the probability of picking the single best candidate out of n, assuming you observe only relative ranks, can’t go back to anyone you’ve passed, and know up front exactly how many candidates you’ll see. Each of those caveats does real work. Remove any one and the optimal strategy moves — sometimes a little, sometimes a lot. The rest of this post derives the classical answer, then walks through what changes when each assumption breaks.

2. The theorem

The problem first circulated as a folklore puzzle in the late 1940s (Merrill Flood’s “fiancée problem,” 1949) and was popularized in Martin Gardner’s 1960 Scientific American column; Ferguson (1989) gives the definitive history.

Let $n$ candidates arrive in uniformly random order. After candidate $i$ you see only its rank within $\{1, \ldots, i\}$ . You must accept or reject before seeing candidate $i+1$ . The “threshold policy with cutoff $r$ ” rejects the first $r-1$ and then accepts the first subsequent candidate strictly better than all earlier ones. Lindley (1961) and Dynkin (1963) established:

P_n(r) \;=\; \frac{r-1}{n}\sum_{i=r}^{n}\frac{1}{i-1}.

Letting $n \to \infty$ with $x = r/n$ gives $-x \ln x$ , maximized at $x = 1/e \approx 0.3679$ , with corresponding success probability $1/e \approx 0.3679$ . Hence “37%.”

The result is striking because it is scale-invariant: the optimal fraction is the same whether n = 100 or n = 10^8, and the success rate does not decay with n.

3. Verifying the curve empirically

A Monte Carlo with n = 100 and 50,000 permutations recovers the theoretical curve exactly. Empirical maximum at r/n = 0.36–0.37 with P(best) = 0.373.

Figure 1. Empirical P(picks best) under threshold-r policy, n = 100, 50,000 trials. Each point is a Monte-Carlo estimate of P(best) at that r. Peak coincides with r/n = 1/e.

The function is symmetric-ish in a deceptive way: errors of 10 percentage points in either direction in the cutoff cost only ~2 points of success probability. The rule is robust to mis-tuning, which is part of why it travels well as advice.

4. What the rule actually optimizes

The classical objective is binary: you get credit only for picking the single best candidate. In practice you almost always care about how good the candidate is — picking the second-best is much better than picking the fiftieth-best. Under expected-rank loss (cardinal-payoff variant), the optimal explore fraction collapses to roughly 1/√n — i.e. about √n candidates — and the achieved expected rank scales as √n. The $\sqrt{n}$ result is most commonly cited from Bearden (2006); related cardinal formulations appear earlier (e.g., Mucci 1973), and the broader rank-based secretary literature is reviewed in Freeman (1983).

For n = 100:

Policy	Explore fraction	P(best)	E[rank of chosen]
Random pick	—	0.010	50.5
37% rule (r = 37)	0.37	0.373	20.1
Bearden √n rule (r = 10)	0.10	0.234	9.7
Oracle (sees all first)	—	1.000	1.0

The trade-off is real. The 37% rule maximizes the probability you do the absolute best; the √n rule cuts your expected rank in half, at the cost of some probability of landing the very best. If you would not feel catastrophically worse choosing the third-best apartment than the very best, you should be exploring less than 37%, not more.

5. The hidden assumptions

The classical theorem rests on five assumptions, each of which fails in common applications.

5.1 Recall

If a passed candidate can be re-approached with probability q of accepting the apology, the optimal explore fraction rises sharply. Christian and Griffiths (2016) report ~61% at q = 0.5; a 30,000-trial sweep agrees:

`q` (recall prob)	optimal `r*/n`	P(best)
0.00	0.36	0.368
0.25	0.50	0.476
0.50	0.62	0.610
0.75	0.78	0.782
1.00	1.00	1.000

If you can plausibly call the second-place apartment back even half the time, the right behavior is closer to “look at two-thirds of them, then start deciding,” not 37%.

Figure 2. As recall probability q rises, the optimal explore fraction r*/n rises with it, and so does success. n = 100, 30,000 trials per point. q = 0 recovers the classical 37% rule; q = 1 lets you look at everyone and pick the best.

5.2 Unknown horizon

If n is itself uncertain, the fixed-r rule degrades. With n ~ Uniform[50, 150] and a policy that assumes n = 100 (so r = 37), the success probability falls from 0.368 to 0.350 — small, but the policy is no longer optimal. Presman and Sonin (1972) characterize the optimal policy under random horizons.

5.3 Independence and exchangeability

The classical setup assumes candidates arrive in uniformly random order. Real labor, dating, and rental markets exhibit adverse selection: the candidates still available in month 6 are not a uniform sample of the original pool. To get a sense of how much this matters, I generated n = 100 i.i.d. uniform values and sorted them by arrival time t_i = −v_i + σ · Z_i, where Z_i is standard normal. Large σ recovers random arrival; σ → 0 is strict descending (best candidate first).

`σ` (arrival noise)	regime	P(best) under 37% rule
10.0	effectively random	0.370
2.0	mild adverse selection	0.358
1.0	moderate	0.328
0.5	strong	0.233
0.25	severe	0.079
0.0	best leaves first	0.000

The rule collapses faster than intuition predicts. Once arrival is correlated with quality at all (σ = 0.5), success drops by more than a third. At σ = 0.25 — a market where good candidates are noticeably more likely to be gone by next month — the 37% rule is essentially random. This pushes the optimal policy toward shorter exploration: each draw is informative not just about the sample but about the pool’s drift.

Figure 3. The 37% rule’s success probability as arrival becomes correlated with quality, n = 100, 30,000 trials. Arrival time is t_i = −v_i + σ·Z_i; large σ is the classical uniform-random regime (dashed reference line), small σ pulls high-quality candidates to the front of the queue where the rule is still observing.

5.4 The other side is also choosing

The classical problem treats candidates as passive objects waiting to be picked. In dating, hiring, and most other markets people actually run a 37% rule on, that is glaringly false: the other side has its own preferences and its own stopping rule. So what happens when both sides apply the 37% rule and a match only happens when both say yes?

Two-sided simulation, n = 100 candidates, both sides applying the 37% rule on their own private rankings, 50,000 trials:

Setting	Probability
One-sided: A picks A’s #1	0.368
Two-sided: a match happens at all	0.017
Two-sided: the match is A’s #1	0.010
Two-sided: the match is A’s #1 and B’s #1	0.007

Figure 4. The same 37% rule applied unilaterally vs. by both sides with mutual-acceptance required. n = 100, 50,000 trials. Two-sided success is the joint probability that both sides accept the same candidate at the same moment.

Requiring mutual acceptance under the same rule both sides individually think is optimal collapses the match rate by a factor of more than twenty. The reason: the 37% rule’s whole job is to be picky, and two picky agents very rarely have their “running best” align at the same moment in time. (When they do match, it’s usually high-quality — average rank 1.8 for both sides — but you only get that conditional on the rare match actually occurring.)

A caveat on what this is and isn’t. The simulation runs each side’s unilateral optimal rule and requires mutual acceptance ex post. Concretely: neither side stops on its own running best alone — both keep observing every candidate, both keep updating their internal running bests, and a match registers only when a single candidate simultaneously beats both running maxima. It is a heuristic baseline, not a game-theoretic equilibrium analysis — neither side is best-responding to the other’s strategy or re-optimizing under the joint structure. The equilibrium analysis is the subject of the two-sided secretary literature itself (Brams and Kilgour, 2001), which generally finds equilibrium thresholds that are less picky than $1/e$ when both sides are choosing. The point of the simulation here is to bound the cost of naïvely lifting a unilateral rule into a two-sided setting; the equilibrium would do better, but not enough to salvage the 37% prescription as advice in markets where the other side is also choosing. The broader takeaway generalizes: any time the “candidate” is also an agent running an optimization, the right framework is game theory, not stopping theory in isolation. Posted-price mechanisms (Hill and Kertz, 1992; prophet inequalities), stable matching, and online ad auctions are all versions of the same point — your stopping rule is one move in a game where the others get to move too.

5.5 No prior on values

The classical problem is the no-information variant: you see only ranks. The next section is what changes when you see the values themselves.

6. Full information: the Gilbert-Mosteller variant

Suppose candidates arrive with i.i.d. values from a known distribution (e.g. uniform on [0, 1]). Moser (1956), Sakaguchi (1961), and Gilbert and Mosteller (1966) solve the corresponding optimal stopping problem by backward induction. The optimal policy is a decreasing threshold sequence: at step i, accept the current value y if it exceeds the running maximum AND y ≥ s_{n-i}, where s_k is computed by dynamic programming.

For n = 100, the threshold solve produces:

step `i`	candidates left `k`	threshold `s_k`
1	99	0.992
25	75	0.990
50	50	0.984
75	25	0.967
90	10	0.917
99	1	0.000

Empirically P(pick max) = 0.583, against the asymptotic limit of ≈ 0.580 (Gilbert and Mosteller, 1966). That is a 21 percentage-point gain over the no-information 1/e, achieved purely by being allowed to see the numerical values. The intuition: rank information tells you about the sample so far, but not the population; a prior on values gives you both.

Figure 5. Gilbert-Mosteller optimal threshold s as a function of step i, for n = 100, uniform[0,1] values, from a backward-induction solve. Be picky early; relax steadily; by the last step the threshold drops to zero, so the policy accepts whatever is left.

The shape — picky early, monotonically relaxing — generalizes to most full-information stopping problems and is the working intuition behind posted-price mechanisms in algorithmic mechanism design.

7. What’s actually being researched now

The classical and full-information variants frame two extremes:

No information (secretary): worst case over priors, success → 1/e ≈ 0.368.
Full information (Moser/Gilbert-Mosteller): exact prior, success → ≈ 0.5802.

Modern work lives between these, in three directions:

Prophet inequalities (Krengel and Sucheston, 1977; Samuel-Cahn, 1984; surveyed in Hill and Kertz, 1992) generalize the full-information setting and underpin posted-price mechanism design. Simple threshold policies achieve a tight 1/2-approximation to the offline optimum even when distributions differ across steps.
Sample-driven and advice-augmented stopping. Dütting, Lattanzi, Paes Leme, and Vassilvitskii (2020) give a unified LP framework — “Secretaries with Advice” — covering the no-information classic, sample-access variants, full-information prophet inequalities, and a noisy-binary-advice model meant to represent ML predictions. This is probably the cleanest entry point to the modern literature.
Learning-augmented optimal stopping (e.g. Optimal Stopping with a Predicted Prior, arXiv:2511.03289, 2025) treats the prior as ML-predicted and asks for algorithms that are consistent when the prediction is good and robust when it isn’t. This is the natural framing whenever the “value distribution” is itself something a model is estimating, as in ad auctions, recommendation, and high-dimensional American-option pricing (cf. Deep Optimal Stopping, Becker, Cheridito, and Jentzen, JMLR 2019).

A small empirical curiosity: a 2025 Frontiers in AI study (Sheehan and Yakimenko) trains 672 neural architectures across 20,000 runs and finds that randomly exploring approximately 37% of a search space recovers the best architecture about as often as theory predicts, suggesting that even in modern NAS pipelines the no-information bound is not far from operative when value structure is sufficiently flat.

8. The ML engineer’s view: stopping is online learning

Strip the romance of “secretary problems” and what’s left is a setup any ML engineer recognizes: examples arrive one at a time, you take an irreversible action on each, you can’t query the same example twice, and your performance is measured against an offline oracle. The same description fits multi-armed bandits, Bayesian optimization, hyperparameter search (Hyperband, ASHA, BOHB), neural-architecture search, online auctions, LLM-cascade routing, and early stopping in model training. The 37% rule is the minimax-over-orderings, no-information, no-prediction, indicator-loss corner of this design space — optimal against an adversary who can permute candidates arbitrarily and pays the analyst zero for a near-best pick — and once you’re framing the problem as an engineer rather than a mathematician, three things change.

8.1 P(best) is the wrong metric

Probability of selecting the unique single best is a 0/1 loss. In production, you measure regret against the offline best — OPT − reward(policy) — or equivalently expected value, plus tail risk if the loss is asymmetric. Look at what that does to the leaderboard for the same problem: candidate values from a mixture distribution (0.8 × Beta(2, 5) + 0.2 × Beta(5, 1) — mostly mediocre, with occasional gems), n = 100, 20,000 trials.

Sample-driven rows below are averaged over 200 independent sample draws (each draw → solve thresholds once → evaluate on 2,000 sequences), so the numbers reflect the policy class rather than a single lucky sample.

Policy	P(best)	E[value]	E[rank]
Classical 37% rule (rank-only)	0.371	0.76	20.0
Sample-driven GM, `m = 5` samples	0.149	0.85	11.2
Sample-driven GM, `m = 10` samples	0.236	0.88	9.9
Sample-driven GM, `m = 15` samples	0.263	0.90	8.8
Sample-driven GM, `m = 25` samples	0.354	0.86	12.5
Sample-driven GM, `m = 50` samples	0.438	0.84	13.8
Full-info GM (known distribution)	0.585	0.86	11.9

Three things jump out. First, P(best) rises monotonically with m, converging from rank-only’s 0.37 toward full-info GM’s 0.59 — more distributional information always helps for the indicator objective. Second, E[value] is not monotone: it peaks around m = 15 at $\approx 0.90$ , above full-info GM’s 0.86, then declines as m grows further. Third, E[rank] mirrors E[value], with a minimum at m = 15 of $\approx 9$ that rises back toward the full-info bound of 12 as samples grow. The reason is exactly the wrong-objective critique: full-info GM is optimal for P(best), and on a heavy-lower-tail distribution it patiently waits for near-1.0 values and often ends up taking whatever’s left. A moderately-noisy threshold sequence — what you get from $\sim 15$ samples — accidentally accepts earlier on average and lands on higher values, even though it picks the best less often.

Restated as regret against the offline oracle (OPT ≈ E[max of n] ≈ 0.99 on this distribution):

Policy	Regret = OPT − E[value]
Classical 37% rule	0.23
Full-info GM	0.13
Sample-driven GM (`m = 15`)	0.09

A roughly 60% regret reduction at m = 15 relative to the classical rule — and a smaller but real improvement over the known-distribution full-info policy — is the kind of result an engineer would notice in a dashboard, and it survives averaging over the sample draws. The figure below shows the full mean-over-draws curve.

Figure 6. Sample-driven Gilbert-Mosteller stopping on a Beta mixture, n = 100, averaged over 200 sample draws × 2,000 sequences each. E[value] is non-monotone in m: it peaks near m ≈ 15 above full-info GM, then declines back toward the full-info bound as samples grow. The dashed lines are the rank-only classical rule and the full-information GM-optimal policy for reference.

8.2 Predictor noise is graceful

Same set-up, but values are uniform on [0, 1] and the policy sees a noisy observation x_i = v_i + σ · Z_i. The 37% rule discards x and uses rank; GM uses x directly — re-using the same thresholds that were solved for clean uniform-[0, 1] values, even though x can lie outside that range. (A properly redesigned predicted-prior policy would re-solve the DP for the observation distribution v + σ·Z itself; this is intentionally the naïve adaptation.) As σ grows, the predicted-prior policy degrades smoothly:

`σ` (predictor noise)	P(best)	E[value]	E[rank]
0.00 (full info)	0.588	0.89	11.3
0.05 (mild noise)	0.210	0.94	6.3
0.25	0.049	0.84	16.5
1.00 (heavy noise)	0.016	0.59	41.4
Classical 37% rule (no predictor at all)	0.372	0.80	19.8

P(best) collapses with noise, but E[value] stays high until σ is quite large; and the predicted-prior policy still beats the classical rule on E[value] at σ ≤ 0.25. The lesson is the one Dütting et al. (2020) and the learning-augmented stopping literature have been arguing: a partially broken predictor is still better than no predictor, as long as you report on the right metric.

Figure 7. Predicted-prior Gilbert-Mosteller stopping on uniform[0,1] values, with noisy observations x = v + σ·Z. The dashed reference is the classical 37% rule (which ignores the predictor and uses ranks only). At σ ≈ 0.35 the noise has degraded the predictor enough that the rank-only rule catches up; below that, even a noisy predictor buys you several points of expected value.

(One small empirical curiosity worth flagging: the policy’s E[value] peaks slightly above the σ = 0 baseline, around σ = 0.05 — because GM is optimal for P(best), not for expected value, and mild noise accidentally nudges the policy toward earlier acceptance. It is one of the cleanest demonstrations available that “optimal under objective A” does not imply “good under objective B.” Reporting matters.)

8.3 Where this lives in production

The names change and the assumptions differ; strict equivalences are rare, but the underlying problem structure recurs:

Bayesian hyperparameter search (Hyperband, ASHA, BOHB). Fundamentally sequential halving in a multi-armed bandit; the bracket schedule plays the role of a budgeted explore phase and shares mathematical DNA with prophet-style cutoffs without being identical to them.
Bayesian optimization (Google Vizier, BoTorch, Ax). The acquisition functions (EI, UCB, Thompson sampling) are not prophet-inequality thresholds; they share a Bayesian decision-theoretic foundation in which a learned posterior over values plays the role that a known prior plays in the full-information stopping problem.
LLM cascade routing. “Try the small model first; commit if confidence ≥ τ” is the cleanest direct analogue on this list — a full-information stopping policy with the confidence score as the realized value.
Online ad auctions and posted-price mechanisms. Prophet inequalities are operative here in a precise sense; posted-price mechanisms in algorithmic mechanism design use threshold rules whose competitive ratio is a prophet bound.
Neural architecture search. The Sheehan and Yakimenko (2025) result — random 37% exploration is empirically near-optimal for NAS — is a point of contact, not a derivation; it suggests rank-only reasoning is the right abstraction when value structure is flat enough.

The pop-science framing — “use the 37% rule for big life decisions” — is the deepest possible reduction of this problem space. Any engineer building any of the systems above would be cashiered for using it.

8.4 An open problem at the foundations

For all the closed-form policies above, the closest natural cousin to the engineer’s actual objective — minimize expected rank of the chosen candidate — is the Robbins problem: given n i.i.d. uniform values arriving in random order, observe each value (not just its rank), and choose a stopping policy to minimize E[rank of chosen]. As of 2026, this problem is unsolved. The asymptotic optimum is known to lie strictly between two bounds, and no closed-form rule is known to achieve the lower one.

Policy	Asymptotic E[rank]	Status
Random pick	`n/2 ≈ 50` (for n=100)	trivial
Classical 37% rule (rank-only)	grows linearly in $n$ (≈ 20.1 at $n=100$ ; see §4)	empirical
Optimal rank-only rule (Chow et al. 1964)	bounded; $\to 3.8695\ldots$	closed form
Bearden √n rule (rank-only, cardinal payoff)	`≈ √n`	closed form
Robbins optimum (value-aware)	∈ [≈1.908, ≈2.327]	open since 1990

Two things make this remarkable. First, the gap between the best-known-policy upper bound and the best-known lower bound is a literal constant of about 0.4, and has stayed roughly that wide for three decades despite serious attention. Second, the problem statement fits in a single sentence and concerns iid uniform values, which means “we still don’t know the optimal way to do best-of-N sampling under expected-rank loss” is, in 2026, a true statement about the state of mathematics. Anyone building rerank or sampling policies for LLM inference is implicitly making a choice on this frontier (Bruss, 2005 is the survey to start with).

8.5 A real-data sanity check

The experiments above are on clean i.i.d. data. Here is what happens when the i.i.d. assumption is replaced with something one degree closer to a real decision. You hold one share of an index ETF and must sell on exactly one trading day in the next year. Daily prices follow a geometric Brownian motion with parameters approximating long-run US equities (μ = 7%/yr, σ = 16%/yr, starting price 100, n = 252 trading days). For the sample-driven policy, you also get five prior years of price history. Twenty thousand simulated years.

Policy	Mean sale price	Regret vs peak
Random day	103.61	$13.67
85th-percentile threshold	102.67	$14.61
Classical 37% rule	104.63	$12.64
Sample-driven GM (5 yrs of history)	104.79	$12.49
Hold to year-end	107.19	$10.09
Oracle (sells at peak)	117.28	$0

A single representative path makes the failure mode visible. Below, one simulated year is plotted with each policy’s stopping point marked. The classical 37% rule pulls the trigger in early summer at $108, well before the year-end peak of$ 116; the sample-driven GM does slightly better at $113; the no-policy “hold to year-end” lands on the year’s high almost by definition, because drift carries the running maximum upward over the horizon.

Figure 8. One simulated trading year, daily prices under geometric Brownian motion (μ = 7%, σ = 16%, n = 252). On this realization the year-end coincides with the peak; the 37% rule sells in early summer at $108, eight dollars below the year-end high. Across 20,000 such years (table above), the no-policy “hold to year-end” strategy beats every threshold rule.

The simplest strategy — never stop, just take the last price — beats every stopping rule, including the sample-driven policy that learned the value distribution from five years of historical data. This is §5.3 made empirical. The secretary problem assumes exchangeable arrivals; a market with positive drift has anything but. New running maxes tend to occur early in the year, before drift has had time to push prices up, so a rule that fires on “first new running max after day 93” systematically sells too early. The sample-driven GM doesn’t save you because the GM thresholds — calibrated for the marginal price distribution — say nothing about the temporal evolution of the running maximum.

The lesson is the one the post has been pulling on in different forms the whole way through: every theorem above presupposes a model of how values arrive, and most real decisions violate that model. The 37% rule, the √n rule, Gilbert-Mosteller, sample-driven GM — all of them get beaten by “wait” in any setting where waiting is monotonically informative, which is most settings worth caring about. Production ML systems escape this not by using better stopping rules but by modeling the structure explicitly: drift, autocorrelation, multi-fidelity evaluation, learned predictors over time. The conceptual descendant of the secretary problem in industry is not the 37% rule; it is the acquisition function in Bayesian optimization, which holds the same explore-then-exploit shape but throws away the iid assumption that made the original theorem clean.

9. Takeaways

The 37% rule is correct, but for an objective almost no one actually has: maximizing the probability of selecting the unique single best candidate. If you instead care about the expected quality of the candidate you pick — and in dating, hiring, and apartment-hunting you almost certainly do — you should explore far less, on the order of √n rather than n/e. Allow even partial recall and the right behavior moves toward exploring sixty percent of the pool. Let yourself see actual values rather than ranks and the achievable success rate jumps from thirty-seven percent to fifty-eight. Once correlations between arrival order and quality enter the picture, the classical rule degrades much faster than its reputation for robustness suggests. And in any market where the other side is also choosing — dating, hiring, school admissions — applying the 37% rule unilaterally is worse still, because two picky agents almost never have their preferences line up at the same moment. Stopping is a game, not a solo optimization.

The engineering version of this story is starker. Even fifteen historical observations of the value distribution cut regret against the offline oracle by roughly 60% relative to the classical rule — and on a heavy-lower-tail distribution actually beat the full-information policy on E[value] and E[rank], because the moderately-noisy thresholds accept earlier than a policy that patiently waits for near-1.0 values almost never sees. The literature has long since moved past closed-form thresholds toward learning-augmented stopping rules, where the prior on values is itself estimated from data; that is the part of the field worth reading now, and it is the version any engineer building Hyperband, Vizier, an ad auction, or an LLM cascade is already using under a different name.

Open questions

How robust are the assumptions in practice? §5.3 suggests the 37% rule collapses with even mild adverse selection. I would like to see the same simulation done on real labor-market or rental-market panel data, where one can estimate the actual correlation between arrival order and quality and read off what the optimal explore fraction should have been.

What is the right consistency-robustness tradeoff for an ML-predicted prior? Learning-augmented stopping rules are designed to fall back gracefully when a predictor is wrong, but the empirical question of “how wrong is the predictor in deployment, and does that match the theoretical worst case?” is open in most application areas I have looked at.

10. Simulation code

All numbers in the post come from four short numpy-only scripts seeded with 20260515.

Classical sweep (§3 / Figure 1):

import numpy as np
rng = np.random.default_rng(20260515)
N, TRIALS = 100, 50_000

def p_best(r):
    wins = 0
    for _ in range(TRIALS):
        perm = rng.permutation(N) + 1
        best_seen = perm[:r].min() if r else np.inf
        chosen = perm[-1]
        for x in perm[r:]:
            if x < best_seen:
                chosen = x; break
        wins += (chosen == 1)
    return wins / TRIALS

for r in range(0, N + 1):
    print(r, p_best(r))

Recall variant (§5.1 / Figure 2):

def p_best_recall(r, q):
    wins = 0
    for _ in range(TRIALS):
        perm = rng.permutation(N) + 1
        best_seen = perm[:r].min() if r else np.inf
        chosen = None
        for x in perm[r:]:
            if x < best_seen:
                chosen = x; break
        if chosen is None:
            chosen = best_seen if rng.random() < q else perm[-1]
        wins += (chosen == 1)
    return wins / TRIALS

Gilbert-Mosteller DP (§6 / Figure 5):

GRID = 2000
xs = np.linspace(0, 1, GRID + 1)
f = [np.zeros(GRID + 1)]            # f_0(x) = 0
thresholds = []
for k in range(1, N + 1):
    accept = xs ** (k - 1)
    cont = f[k - 1]
    integrand = np.maximum(accept, cont)
    thresholds.append(xs[np.argmax(accept >= cont)])
    dx = 1.0 / GRID
    tail = np.zeros(GRID + 1)
    for i in range(GRID - 1, -1, -1):
        tail[i] = tail[i + 1] + 0.5 * (integrand[i] + integrand[i + 1]) * dx
    f.append(xs * cont + tail)

Unknown horizon (§5.2). Policy fixes r = 37 (calibrated for n = 100) while the true horizon is drawn from Uniform{50, …, 150}:

def p_best_unknown_horizon():
    wins = 0
    for _ in range(TRIALS):
        n = int(rng.integers(50, 151))
        perm = rng.permutation(n) + 1
        r = 37
        if r >= n:
            chosen = perm[-1]
        else:
            best_seen = perm[:r].min()
            chosen = perm[-1]
            for x in perm[r:]:
                if x < best_seen:
                    chosen = x; break
        wins += (chosen == 1)
    return wins / TRIALS

Adverse selection (§5.3):

def p_best_adverse(sigma):
    wins = 0
    for _ in range(TRIALS):
        v = rng.random(N)
        order = np.argsort(-v + sigma * rng.standard_normal(N))
        a = v[order]
        r = round(N / np.e)
        best_seen = a[:r].max()
        chosen = a[-1]
        for x in a[r:]:
            if x > best_seen:
                chosen = x; break
        wins += (chosen == v.max())
    return wins / TRIALS

Two-sided 37% rule (§5.4):

def two_sided():
    matches = a_top = 0
    for _ in range(TRIALS):
        pa, pb = rng.permutation(N) + 1, rng.permutation(N) + 1
        a_best, b_best = pa[:r].min(), pb[:r].min()
        for i in range(r, N):
            if pa[i] < a_best and pb[i] < b_best:
                matches += 1
                a_top += (pa[i] == 1)
                break
            a_best = min(a_best, pa[i])
            b_best = min(b_best, pb[i])
    return matches / TRIALS, a_top / TRIALS

Sample-driven GM with empirical CDF (§8.1). Given m historical samples, build the empirical CDF and run a backward-induction threshold solve on the remaining n − m candidates:

def gm_thresholds_empirical(samples, k_max, grid=1000):
    xs = np.linspace(0, 1, grid + 1)
    F = np.searchsorted(np.sort(samples), xs, side="right") / len(samples)
    f_prev = np.zeros_like(xs)
    thresholds = []
    sample_bin = np.clip(np.searchsorted(xs, np.sort(samples)), 0, grid)
    for k in range(1, k_max + 1):
        accept = F ** (k - 1)
        integrand = np.maximum(accept, f_prev)
        thresholds.append(xs[np.argmax(accept >= f_prev)])
        cell = np.zeros_like(xs)
        np.add.at(cell, sample_bin, integrand[sample_bin] / len(samples))
        tail = np.r_[np.cumsum(cell[::-1])[::-1][1:], 0.0]
        f_prev = F * f_prev + tail
    return thresholds

Mean-over-draws driver for §8.1. The §8.1 table averages over independent sample realizations so the policy class — not a single lucky draw — is what gets measured:

def mixture(rng, size):  # 0.8 * Beta(2,5) + 0.2 * Beta(5,1)
    pick_high = rng.random(size) < 0.2
    return np.where(pick_high,
                    rng.beta(5, 1, size),
                    rng.beta(2, 5, size))

def run_policy(thresholds, v):
    running_max = -np.inf
    chosen = v[-1]
    for i in range(N):
        s = thresholds[N - i - 1]
        if v[i] > running_max:
            running_max = v[i]
            if v[i] >= s:
                chosen = v[i]; break
    return chosen

def evaluate_sample_driven(m, draws=200, seqs=2_000):
    pbs, evs = [], []
    for d in range(draws):
        rng_s = np.random.default_rng(20260515 + 1000 * d)
        samples = mixture(rng_s, m)
        thresh = gm_thresholds_empirical(samples, N)
        rng_e = np.random.default_rng(20260515 + 1000 * d + 7)
        pb = ev = 0
        for _ in range(seqs):
            v = mixture(rng_e, N)
            chosen = run_policy(thresh, v)
            ev += chosen
            pb += int(chosen == v.max())
        pbs.append(pb / seqs); evs.append(ev / seqs)
    return np.mean(pbs), np.mean(evs)

A single fixed draw of size m can give P(best) and E[value] anywhere in roughly $[0.15, 0.56] \times [0.63, 0.97]$ for m = 25 (we checked 20 independent draws); the averaging is what produces the non-monotone E[value] curve in Figure 6.

Noisy-predictor GM (§8.2). Re-use the GM threshold sequence for uniform values, but feed it a noisy observation in place of the true value:

def predictor_policy(sigma, thresholds_u):
    v = rng.random(N)
    x = v + sigma * rng.standard_normal(N)
    running_max, chosen_i = -np.inf, N - 1
    for i in range(N):
        s = thresholds_u[N - i - 1]
        if x[i] > running_max:
            running_max = x[i]
            if x[i] >= s:
                chosen_i = i
                break
    return v[chosen_i]

Market-timing backtest (§8.5). Geometric Brownian motion for a year of trading days, then apply each policy:

N = 252
MU, SIGMA = 0.07 / N, 0.16 / np.sqrt(N)

def simulate_year():
    logrets = rng.normal(MU - 0.5 * SIGMA**2, SIGMA, N)
    return 100 * np.exp(np.cumsum(logrets))

def classical_37(prices):
    r = round(N / np.e)
    best = prices[:r].max()
    for p in prices[r:]:
        if p > best:
            return p
    return prices[-1]

def hold_to_end(prices):
    return prices[-1]

oracle = np.array([simulate_year().max() for _ in range(TRIALS)])
got = np.array([classical_37(simulate_year()) for _ in range(TRIALS)])
print("regret vs oracle =", (oracle - got).mean())

References

Cited in text.

Bearden, J. N. (2006). A new secretary problem with rank-based selection and cardinal payoffs. Journal of Mathematical Psychology, 50(1), 58–59. — The √n rule and the cardinal-payoff result of §4.
Becker, S., Cheridito, P., & Jentzen, A. (2019). Deep optimal stopping. Journal of Machine Learning Research, 20(74), 1–25.
Brams, S. J., & Kilgour, D. M. (2001). Revisiting the Secretary Problem. — The two-sided variant cited in §5.4.
Bruss, F. T. (2005). What is known about Robbins’ problem? Journal of Applied Probability, 42(1), 108–120. — Survey of the open expected-rank problem cited in §8.4.
Chow, Y. S., Moriguti, S., Robbins, H., & Samuels, S. M. (1964). Optimal selection based on relative rank (the “secretary problem”). Israel Journal of Mathematics, 2(2), 81–90. — The optimal rank-only policy for expected rank; asymptote $\to 3.8695$ , cited in §8.4.
Christian, B., & Griffiths, T. (2016). Algorithms to Live By: The Computer Science of Human Decisions. Henry Holt.
Dütting, P., Lattanzi, S., Paes Leme, R., & Vassilvitskii, S. (2020). Secretaries with Advice. arXiv:2011.06726. — The LP unification cited in §7.
Dynkin, E. B. (1963). The optimum choice of the instant for stopping a Markov process. Soviet Mathematics Doklady, 4, 627–629.
Ferguson, T. S. (1989). Who solved the secretary problem? Statistical Science, 4(3), 282–289. — The historical record cited in §2.
Gardner, M. (1960). Mathematical Games. Scientific American, February.
Gilbert, J. P., & Mosteller, F. (1966). Recognizing the maximum of a sequence. Journal of the American Statistical Association, 61(313), 35–73.
Hill, T. P., & Kertz, R. P. (1992). A Survey of Prophet Inequalities in Optimal Stopping Theory. Contemporary Mathematics, 125, 191–207.
Krengel, U., & Sucheston, L. (1977). Semiamarts and finite values. Bulletin of the AMS, 83, 745–747.
Lindley, D. V. (1961). Dynamic programming and decision theory. Journal of the Royal Statistical Society. Series C, 10(1), 39–51.
Moser, L. (1956). On a problem of Cayley. Scripta Mathematica, 22, 289–292.
Mucci, A. G. (1973). Differential equations and optimal choice problems. Annals of Statistics, 1(1), 104–113. — Early treatment of cardinal-payoff variants relevant to the √n result of §4.
Presman, E. L., & Sonin, I. M. (1972). The best choice problem for a random number of objects. Theory of Probability and Its Applications, 17(4), 657–668.
Sakaguchi, M. (1961). Dynamic programming of some sequential sampling design. Journal of Mathematical Analysis and Applications, 2, 446–466.
Samuel-Cahn, E. (1984). Comparison of threshold stop rules and maximum for independent nonnegative random variables. Annals of Probability, 12(4), 1213–1216.
Optimal Stopping with a Predicted Prior. arXiv:2511.03289 (2025).
Sheehan, M., & Yakimenko, O. (2025). Neural architecture search applying optimal stopping theory. Frontiers in Artificial Intelligence, 8.

Further reading.

Correa, J., Dütting, P., Fischer, F., & Schewior, K. (2019). Prophet Inequalities for IID Random Variables from an Unknown Distribution. arXiv:1904.02050. — A constant number of samples per distribution nearly matches the full-information bound.
Freeman, P. R. (1983). The Secretary Problem and Its Extensions: A Review. International Statistical Review, 51(2), 189–206. — Older but exhaustive: 60+ variants catalogued.

Production ML systems cited in §8.3. Each of these is the same stopping problem under a different name; reading them in sequence is the fastest way to see the unifying structure.

Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. (2018). Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. JMLR, 18(185), 1–52. — Bracketed early stopping as the explicit explore-then-commit primitive.
Li, L., Jamieson, K., Rostamizadeh, A., Gonina, E., Ben-tzur, J., Hardt, M., Recht, B., & Talwalkar, A. (2020). A System for Massively Parallel Hyperparameter Tuning (ASHA). MLSys. — Async successive halving; what Hyperband looks like in production.
Golovin, D., Solnik, B., Moitra, S., Kochanski, G., Karro, J., & Sculley, D. (2017). Google Vizier: A Service for Black-Box Optimization. KDD. — Bayesian optimization as a Google-scale service. The acquisition function is a prophet-inequality threshold with a learned prior.
Frazier, P. I. (2018). A Tutorial on Bayesian Optimization. arXiv:1807.02811. — The cleanest single reference for the prior-over-values stopping view.

Beyond 37%: When Optimal Stopping Stops Being Useful

1. The popular claim§

2. The theorem§

3. Verifying the curve empirically§

4. What the rule actually optimizes§

5. The hidden assumptions§

5.1 Recall§

5.2 Unknown horizon§

5.3 Independence and exchangeability§

5.4 The other side is also choosing§

5.5 No prior on values§

6. Full information: the Gilbert-Mosteller variant§

7. What’s actually being researched now§

8. The ML engineer’s view: stopping is online learning§

8.1 P(best) is the wrong metric§

8.2 Predictor noise is graceful§

8.3 Where this lives in production§

8.4 An open problem at the foundations§

8.5 A real-data sanity check§

9. Takeaways§

Open questions§

10. Simulation code§

References§