All numbers reproducible from the snippets in §10. Simulations at n = 100 (252 for the market backtest in §8.5), 20,000–50,000 trials, seed 20260515.
The classical secretary problem yields the rule for maximizing the probability of selecting the unique best of candidates observed in random order. We document, via Monte Carlo at with 20,000–50,000 trials per cell, how the result decays under each of its load-bearing assumptions. Under cardinal payoffs (Bearden 2006), the optimal explore fraction collapses to and expected rank improves by a factor of two. Partial recall raises the optimal cutoff to at recall probability . Adverse selection between arrival order and quality () cuts success probability by a third. A symmetric two-sided application of the rule, in which a match requires mutual acceptance, collapses the joint match rate by a factor of more than twenty. Full-information access to candidate values (Gilbert & Mosteller 1966) raises achievable success to . We then situate these results in the modern learning-augmented stopping literature — prophet inequalities (Hill & Kertz 1992), sample-driven thresholds (Dütting et al. 2020), and predicted-prior algorithms — and identify production systems (Hyperband, Bayesian optimization, LLM cascade routing, posted-price ad auctions) that operationalize the same problem under richer information structures. The 37% rule is the no-information, indicator-loss corner of a much larger design space; in real decision settings, the choice of metric and the available information structure dominate the choice of cutoff.
Keywords: secretary problem, optimal stopping, prophet inequalities, learning-augmented algorithms, online decisions, Bayesian optimization, Gilbert–Mosteller, Monte Carlo. MSC: 60G40, 62L15, 68W40.
1. The popular claim
If you’ve heard one piece of mathematical advice for making big life decisions, it’s probably the 37% rule. Line up your candidates — apartments, jobs, dates — reject the first 37 percent of them, then take the next one that’s better than everything you’ve seen so far. The proof is clean. The statement fits on a coffee mug. It shows up in self-help books and Hacker News threads, and people quote it in earnest.
It’s also the wrong answer to most of the questions it gets used for.
Not because the math is wrong — the math is fine. The question is. The
classical rule maximizes the probability of picking the single best
candidate out of n, assuming you observe only relative ranks, can’t go
back to anyone you’ve passed, and know up front exactly how many
candidates you’ll see. Each of those caveats does real work. Remove any
one and the optimal strategy moves — sometimes a little, sometimes
a lot. The rest of this post derives the classical answer, then walks
through what changes when each assumption breaks.
2. The theorem
The problem first circulated as a folklore puzzle in the late 1940s (Merrill Flood’s “fiancée problem,” 1949) and was popularized in Martin Gardner’s 1960 Scientific American column; Ferguson (1989) gives the definitive history.
Let candidates arrive in uniformly random order. After candidate you see only its rank within . You must accept or reject before seeing candidate . The “threshold policy with cutoff ” rejects the first and then accepts the first subsequent candidate strictly better than all earlier ones. Lindley (1961) and Dynkin (1963) established:
Letting with gives , maximized at , with corresponding success probability . Hence “37%.”
The result is striking because it is scale-invariant: the optimal fraction
is the same whether n = 100 or n = 10^8, and the success rate does not
decay with n.
3. Verifying the curve empirically
A Monte Carlo with n = 100 and 50,000 permutations recovers the
theoretical curve exactly. Empirical maximum at r/n = 0.36–0.37 with
P(best) = 0.373.
The function is symmetric-ish in a deceptive way: errors of 10 percentage points in either direction in the cutoff cost only ~2 points of success probability. The rule is robust to mis-tuning, which is part of why it travels well as advice.
4. What the rule actually optimizes
The classical objective is binary: you get credit only for picking the
single best candidate. In practice you almost always care about how good
the candidate is — picking the second-best is much better than picking
the fiftieth-best. Under expected-rank loss (cardinal-payoff variant),
the optimal explore fraction collapses to roughly 1/√n — i.e. about
√n candidates — and the achieved expected rank scales as √n. The
result is most commonly cited from
Bearden (2006);
related cardinal formulations appear earlier (e.g., Mucci 1973), and
the broader rank-based secretary literature is reviewed in
Freeman (1983).
For n = 100:
| Policy | Explore fraction | P(best) | E[rank of chosen] |
|---|---|---|---|
| Random pick | — | 0.010 | 50.5 |
| 37% rule (r = 37) | 0.37 | 0.373 | 20.1 |
| Bearden √n rule (r = 10) | 0.10 | 0.234 | 9.7 |
| Oracle (sees all first) | — | 1.000 | 1.0 |
The trade-off is real. The 37% rule maximizes the probability you do the absolute best; the √n rule cuts your expected rank in half, at the cost of some probability of landing the very best. If you would not feel catastrophically worse choosing the third-best apartment than the very best, you should be exploring less than 37%, not more.
5. The hidden assumptions
The classical theorem rests on five assumptions, each of which fails in common applications.
5.1 Recall
If a passed candidate can be re-approached with probability q of accepting
the apology, the optimal explore fraction rises sharply. Christian and
Griffiths (2016) report ~61% at q = 0.5; a 30,000-trial sweep agrees:
q (recall prob) | optimal r*/n | P(best) |
|---|---|---|
| 0.00 | 0.36 | 0.368 |
| 0.25 | 0.50 | 0.476 |
| 0.50 | 0.62 | 0.610 |
| 0.75 | 0.78 | 0.782 |
| 1.00 | 1.00 | 1.000 |
If you can plausibly call the second-place apartment back even half the time, the right behavior is closer to “look at two-thirds of them, then start deciding,” not 37%.
5.2 Unknown horizon
If n is itself uncertain, the fixed-r rule degrades. With
n ~ Uniform[50, 150] and a policy that assumes n = 100
(so r = 37), the success probability falls from 0.368 to 0.350 — small,
but the policy is no longer optimal. Presman and Sonin (1972) characterize
the optimal policy under random horizons.
5.3 Independence and exchangeability
The classical setup assumes candidates arrive in uniformly random order.
Real labor, dating, and rental markets exhibit adverse selection: the
candidates still available in month 6 are not a uniform sample of the
original pool. To get a sense of how much this matters, I generated
n = 100 i.i.d. uniform values and sorted them by arrival time
t_i = −v_i + σ · Z_i, where Z_i is standard normal. Large σ recovers
random arrival; σ → 0 is strict descending (best candidate first).
σ (arrival noise) | regime | P(best) under 37% rule |
|---|---|---|
| 10.0 | effectively random | 0.370 |
| 2.0 | mild adverse selection | 0.358 |
| 1.0 | moderate | 0.328 |
| 0.5 | strong | 0.233 |
| 0.25 | severe | 0.079 |
| 0.0 | best leaves first | 0.000 |
The rule collapses faster than intuition predicts. Once arrival is
correlated with quality at all (σ = 0.5), success drops by more than a
third. At σ = 0.25 — a market where good candidates are noticeably more
likely to be gone by next month — the 37% rule is essentially random.
This pushes the optimal policy toward shorter exploration: each draw is
informative not just about the sample but about the pool’s drift.
5.4 The other side is also choosing
The classical problem treats candidates as passive objects waiting to be picked. In dating, hiring, and most other markets people actually run a 37% rule on, that is glaringly false: the other side has its own preferences and its own stopping rule. So what happens when both sides apply the 37% rule and a match only happens when both say yes?
Two-sided simulation, n = 100 candidates, both sides applying the
37% rule on their own private rankings, 50,000 trials:
| Setting | Probability |
|---|---|
| One-sided: A picks A’s #1 | 0.368 |
| Two-sided: a match happens at all | 0.017 |
| Two-sided: the match is A’s #1 | 0.010 |
| Two-sided: the match is A’s #1 and B’s #1 | 0.007 |
Requiring mutual acceptance under the same rule both sides individually think is optimal collapses the match rate by a factor of more than twenty. The reason: the 37% rule’s whole job is to be picky, and two picky agents very rarely have their “running best” align at the same moment in time. (When they do match, it’s usually high-quality — average rank 1.8 for both sides — but you only get that conditional on the rare match actually occurring.)
A caveat on what this is and isn’t. The simulation runs each side’s unilateral optimal rule and requires mutual acceptance ex post. Concretely: neither side stops on its own running best alone — both keep observing every candidate, both keep updating their internal running bests, and a match registers only when a single candidate simultaneously beats both running maxima. It is a heuristic baseline, not a game-theoretic equilibrium analysis — neither side is best-responding to the other’s strategy or re-optimizing under the joint structure. The equilibrium analysis is the subject of the two-sided secretary literature itself (Brams and Kilgour, 2001), which generally finds equilibrium thresholds that are less picky than when both sides are choosing. The point of the simulation here is to bound the cost of naïvely lifting a unilateral rule into a two-sided setting; the equilibrium would do better, but not enough to salvage the 37% prescription as advice in markets where the other side is also choosing. The broader takeaway generalizes: any time the “candidate” is also an agent running an optimization, the right framework is game theory, not stopping theory in isolation. Posted-price mechanisms (Hill and Kertz, 1992; prophet inequalities), stable matching, and online ad auctions are all versions of the same point — your stopping rule is one move in a game where the others get to move too.
5.5 No prior on values
The classical problem is the no-information variant: you see only ranks. The next section is what changes when you see the values themselves.
6. Full information: the Gilbert-Mosteller variant
Suppose candidates arrive with i.i.d. values from a known distribution
(e.g. uniform on [0, 1]). Moser (1956), Sakaguchi (1961), and Gilbert
and Mosteller (1966) solve the corresponding optimal stopping problem by
backward induction. The optimal policy is a decreasing threshold
sequence: at step i, accept the current value y if it exceeds the
running maximum AND y ≥ s_{n-i}, where s_k is computed by
dynamic programming.
For n = 100, the threshold solve produces:
step i | candidates left k | threshold s_k |
|---|---|---|
| 1 | 99 | 0.992 |
| 25 | 75 | 0.990 |
| 50 | 50 | 0.984 |
| 75 | 25 | 0.967 |
| 90 | 10 | 0.917 |
| 99 | 1 | 0.000 |
Empirically P(pick max) = 0.583, against the asymptotic limit of
≈ 0.580 (Gilbert and Mosteller, 1966). That is a 21 percentage-point
gain over the no-information 1/e, achieved purely by being allowed to see
the numerical values. The intuition: rank information tells you about the
sample so far, but not the population; a prior on values gives you both.
The shape — picky early, monotonically relaxing — generalizes to most full-information stopping problems and is the working intuition behind posted-price mechanisms in algorithmic mechanism design.
7. What’s actually being researched now
The classical and full-information variants frame two extremes:
- No information (secretary): worst case over priors, success →
1/e ≈ 0.368. - Full information (Moser/Gilbert-Mosteller): exact prior, success →
≈ 0.5802.
Modern work lives between these, in three directions:
- Prophet inequalities (Krengel and Sucheston, 1977; Samuel-Cahn,
1984; surveyed in Hill and Kertz, 1992)
generalize the full-information setting and underpin posted-price
mechanism design. Simple threshold policies achieve a tight
1/2-approximation to the offline optimum even when distributions differ across steps. - Sample-driven and advice-augmented stopping. Dütting, Lattanzi, Paes Leme, and Vassilvitskii (2020) give a unified LP framework — “Secretaries with Advice” — covering the no-information classic, sample-access variants, full-information prophet inequalities, and a noisy-binary-advice model meant to represent ML predictions. This is probably the cleanest entry point to the modern literature.
- Learning-augmented optimal stopping (e.g. Optimal Stopping with a Predicted Prior, arXiv:2511.03289, 2025) treats the prior as ML-predicted and asks for algorithms that are consistent when the prediction is good and robust when it isn’t. This is the natural framing whenever the “value distribution” is itself something a model is estimating, as in ad auctions, recommendation, and high-dimensional American-option pricing (cf. Deep Optimal Stopping, Becker, Cheridito, and Jentzen, JMLR 2019).
A small empirical curiosity: a 2025 Frontiers in AI study (Sheehan and Yakimenko) trains 672 neural architectures across 20,000 runs and finds that randomly exploring approximately 37% of a search space recovers the best architecture about as often as theory predicts, suggesting that even in modern NAS pipelines the no-information bound is not far from operative when value structure is sufficiently flat.
8. The ML engineer’s view: stopping is online learning
Strip the romance of “secretary problems” and what’s left is a setup any ML engineer recognizes: examples arrive one at a time, you take an irreversible action on each, you can’t query the same example twice, and your performance is measured against an offline oracle. The same description fits multi-armed bandits, Bayesian optimization, hyperparameter search (Hyperband, ASHA, BOHB), neural-architecture search, online auctions, LLM-cascade routing, and early stopping in model training. The 37% rule is the minimax-over-orderings, no-information, no-prediction, indicator-loss corner of this design space — optimal against an adversary who can permute candidates arbitrarily and pays the analyst zero for a near-best pick — and once you’re framing the problem as an engineer rather than a mathematician, three things change.
8.1 P(best) is the wrong metric
Probability of selecting the unique single best is a 0/1 loss. In
production, you measure regret against the offline best — OPT − reward(policy) — or equivalently expected value, plus tail risk if the
loss is asymmetric. Look at what that does to the leaderboard for the
same problem: candidate values from a mixture distribution
(0.8 × Beta(2, 5) + 0.2 × Beta(5, 1) — mostly mediocre, with occasional
gems), n = 100, 20,000 trials.
Sample-driven rows below are averaged over 200 independent sample draws (each draw → solve thresholds once → evaluate on 2,000 sequences), so the numbers reflect the policy class rather than a single lucky sample.
| Policy | P(best) | E[value] | E[rank] |
|---|---|---|---|
| Classical 37% rule (rank-only) | 0.371 | 0.76 | 20.0 |
Sample-driven GM, m = 5 samples | 0.149 | 0.85 | 11.2 |
Sample-driven GM, m = 10 samples | 0.236 | 0.88 | 9.9 |
Sample-driven GM, m = 15 samples | 0.263 | 0.90 | 8.8 |
Sample-driven GM, m = 25 samples | 0.354 | 0.86 | 12.5 |
Sample-driven GM, m = 50 samples | 0.438 | 0.84 | 13.8 |
| Full-info GM (known distribution) | 0.585 | 0.86 | 11.9 |
Three things jump out. First, P(best) rises monotonically with m,
converging from rank-only’s 0.37 toward full-info GM’s 0.59 — more
distributional information always helps for the indicator objective.
Second, E[value] is not monotone: it peaks around m = 15 at
, above full-info GM’s 0.86, then declines as m
grows further. Third, E[rank] mirrors E[value], with a minimum at
m = 15 of that rises back toward the full-info bound of
12 as samples grow. The reason is exactly the wrong-objective
critique: full-info GM is optimal for P(best), and on a
heavy-lower-tail distribution it patiently waits for near-1.0 values
and often ends up taking whatever’s left. A moderately-noisy threshold
sequence — what you get from samples — accidentally
accepts earlier on average and lands on higher values, even though
it picks the best less often.
Restated as regret against the offline oracle (OPT ≈ E[max of n] ≈ 0.99
on this distribution):
| Policy | Regret = OPT − E[value] |
|---|---|
| Classical 37% rule | 0.23 |
| Full-info GM | 0.13 |
Sample-driven GM (m = 15) | 0.09 |
A roughly 60% regret reduction at m = 15 relative to the classical
rule — and a smaller but real improvement over the
known-distribution full-info policy — is the kind of result an
engineer would notice in a dashboard, and it survives averaging over
the sample draws. The figure below shows the full mean-over-draws
curve.
8.2 Predictor noise is graceful
Same set-up, but values are uniform on [0, 1] and the policy sees a
noisy observation x_i = v_i + σ · Z_i. The 37% rule discards x and
uses rank; GM uses x directly — re-using the same thresholds that
were solved for clean uniform-[0, 1] values, even though x can lie
outside that range. (A properly redesigned predicted-prior policy
would re-solve the DP for the observation distribution v + σ·Z
itself; this is intentionally the naïve adaptation.) As σ grows,
the predicted-prior policy degrades smoothly:
σ (predictor noise) | P(best) | E[value] | E[rank] |
|---|---|---|---|
| 0.00 (full info) | 0.588 | 0.89 | 11.3 |
| 0.05 (mild noise) | 0.210 | 0.94 | 6.3 |
| 0.25 | 0.049 | 0.84 | 16.5 |
| 1.00 (heavy noise) | 0.016 | 0.59 | 41.4 |
| Classical 37% rule (no predictor at all) | 0.372 | 0.80 | 19.8 |
P(best) collapses with noise, but E[value] stays high until σ is
quite large; and the predicted-prior policy still beats the classical
rule on E[value] at σ ≤ 0.25. The lesson is the one
Dütting et al. (2020) and the
learning-augmented stopping literature have been arguing: a partially
broken predictor is still better than no predictor, as long as you
report on the right metric.
(One small empirical curiosity worth flagging: the policy’s E[value]
peaks slightly above the σ = 0 baseline, around σ = 0.05 — because GM
is optimal for P(best), not for expected value, and mild noise
accidentally nudges the policy toward earlier acceptance. It is one of
the cleanest demonstrations available that “optimal under objective A”
does not imply “good under objective B.” Reporting matters.)
8.3 Where this lives in production
The names change and the assumptions differ; strict equivalences are rare, but the underlying problem structure recurs:
- Bayesian hyperparameter search (Hyperband, ASHA, BOHB). Fundamentally sequential halving in a multi-armed bandit; the bracket schedule plays the role of a budgeted explore phase and shares mathematical DNA with prophet-style cutoffs without being identical to them.
- Bayesian optimization (Google Vizier, BoTorch, Ax). The acquisition functions (EI, UCB, Thompson sampling) are not prophet-inequality thresholds; they share a Bayesian decision-theoretic foundation in which a learned posterior over values plays the role that a known prior plays in the full-information stopping problem.
- LLM cascade routing. “Try the small model first; commit if confidence ≥ τ” is the cleanest direct analogue on this list — a full-information stopping policy with the confidence score as the realized value.
- Online ad auctions and posted-price mechanisms. Prophet inequalities are operative here in a precise sense; posted-price mechanisms in algorithmic mechanism design use threshold rules whose competitive ratio is a prophet bound.
- Neural architecture search. The Sheehan and Yakimenko (2025) result — random 37% exploration is empirically near-optimal for NAS — is a point of contact, not a derivation; it suggests rank-only reasoning is the right abstraction when value structure is flat enough.
The pop-science framing — “use the 37% rule for big life decisions” — is the deepest possible reduction of this problem space. Any engineer building any of the systems above would be cashiered for using it.
8.4 An open problem at the foundations
For all the closed-form policies above, the closest natural cousin to
the engineer’s actual objective — minimize expected rank of the chosen
candidate — is the Robbins problem: given n i.i.d. uniform values
arriving in random order, observe each value (not just its rank), and
choose a stopping policy to minimize E[rank of chosen]. As of 2026,
this problem is unsolved. The asymptotic optimum is known to lie
strictly between two bounds, and no closed-form rule is known to achieve
the lower one.
| Policy | Asymptotic E[rank] | Status |
|---|---|---|
| Random pick | n/2 ≈ 50 (for n=100) | trivial |
| Classical 37% rule (rank-only) | grows linearly in (≈ 20.1 at ; see §4) | empirical |
| Optimal rank-only rule (Chow et al. 1964) | bounded; | closed form |
| Bearden √n rule (rank-only, cardinal payoff) | ≈ √n | closed form |
| Robbins optimum (value-aware) | ∈ [≈1.908, ≈2.327] | open since 1990 |
Two things make this remarkable. First, the gap between the best-known-policy upper bound and the best-known lower bound is a literal constant of about 0.4, and has stayed roughly that wide for three decades despite serious attention. Second, the problem statement fits in a single sentence and concerns iid uniform values, which means “we still don’t know the optimal way to do best-of-N sampling under expected-rank loss” is, in 2026, a true statement about the state of mathematics. Anyone building rerank or sampling policies for LLM inference is implicitly making a choice on this frontier (Bruss, 2005 is the survey to start with).
8.5 A real-data sanity check
The experiments above are on clean i.i.d. data. Here is what happens
when the i.i.d. assumption is replaced with something one degree closer
to a real decision. You hold one share of an index ETF and must sell on
exactly one trading day in the next year. Daily prices follow a
geometric Brownian motion with parameters approximating long-run US
equities (μ = 7%/yr, σ = 16%/yr, starting price 100, n = 252
trading days). For the sample-driven policy, you also get five prior
years of price history. Twenty thousand simulated years.
| Policy | Mean sale price | Regret vs peak |
|---|---|---|
| Random day | 103.61 | $13.67 |
| 85th-percentile threshold | 102.67 | $14.61 |
| Classical 37% rule | 104.63 | $12.64 |
| Sample-driven GM (5 yrs of history) | 104.79 | $12.49 |
| Hold to year-end | 107.19 | $10.09 |
| Oracle (sells at peak) | 117.28 | $0 |
A single representative path makes the failure mode visible. Below, one simulated year is plotted with each policy’s stopping point marked. The classical 37% rule pulls the trigger in early summer at 116; the sample-driven GM does slightly better at $113; the no-policy “hold to year-end” lands on the year’s high almost by definition, because drift carries the running maximum upward over the horizon.
The simplest strategy — never stop, just take the last price — beats every stopping rule, including the sample-driven policy that learned the value distribution from five years of historical data. This is §5.3 made empirical. The secretary problem assumes exchangeable arrivals; a market with positive drift has anything but. New running maxes tend to occur early in the year, before drift has had time to push prices up, so a rule that fires on “first new running max after day 93” systematically sells too early. The sample-driven GM doesn’t save you because the GM thresholds — calibrated for the marginal price distribution — say nothing about the temporal evolution of the running maximum.
The lesson is the one the post has been pulling on in different forms the whole way through: every theorem above presupposes a model of how values arrive, and most real decisions violate that model. The 37% rule, the √n rule, Gilbert-Mosteller, sample-driven GM — all of them get beaten by “wait” in any setting where waiting is monotonically informative, which is most settings worth caring about. Production ML systems escape this not by using better stopping rules but by modeling the structure explicitly: drift, autocorrelation, multi-fidelity evaluation, learned predictors over time. The conceptual descendant of the secretary problem in industry is not the 37% rule; it is the acquisition function in Bayesian optimization, which holds the same explore-then-exploit shape but throws away the iid assumption that made the original theorem clean.
9. Takeaways
The 37% rule is correct, but for an objective almost no one actually
has: maximizing the probability of selecting the unique single best
candidate. If you instead care about the expected quality of the
candidate you pick — and in dating, hiring, and apartment-hunting you
almost certainly do — you should explore far less, on the order of √n
rather than n/e. Allow even partial recall and the right behavior moves
toward exploring sixty percent of the pool. Let yourself see actual
values rather than ranks and the achievable success rate jumps from
thirty-seven percent to fifty-eight. Once correlations between
arrival order and quality enter the picture, the classical rule degrades
much faster than its reputation for robustness suggests. And in any
market where the other side is also choosing — dating, hiring, school
admissions — applying the 37% rule unilaterally is worse still, because
two picky agents almost never have their preferences line up at the same
moment. Stopping is a game, not a solo optimization.
The engineering version of this story is starker. Even fifteen historical observations of the value distribution cut regret against the offline oracle by roughly 60% relative to the classical rule — and on a heavy-lower-tail distribution actually beat the full-information policy on E[value] and E[rank], because the moderately-noisy thresholds accept earlier than a policy that patiently waits for near-1.0 values almost never sees. The literature has long since moved past closed-form thresholds toward learning-augmented stopping rules, where the prior on values is itself estimated from data; that is the part of the field worth reading now, and it is the version any engineer building Hyperband, Vizier, an ad auction, or an LLM cascade is already using under a different name.
Open questions
How robust are the assumptions in practice? §5.3 suggests the 37% rule collapses with even mild adverse selection. I would like to see the same simulation done on real labor-market or rental-market panel data, where one can estimate the actual correlation between arrival order and quality and read off what the optimal explore fraction should have been.
What is the right consistency-robustness tradeoff for an ML-predicted prior? Learning-augmented stopping rules are designed to fall back gracefully when a predictor is wrong, but the empirical question of “how wrong is the predictor in deployment, and does that match the theoretical worst case?” is open in most application areas I have looked at.
10. Simulation code
All numbers in the post come from four short numpy-only scripts seeded
with 20260515.
Classical sweep (§3 / Figure 1):
import numpy as np
rng = np.random.default_rng(20260515)
N, TRIALS = 100, 50_000
def p_best(r):
wins = 0
for _ in range(TRIALS):
perm = rng.permutation(N) + 1
best_seen = perm[:r].min() if r else np.inf
chosen = perm[-1]
for x in perm[r:]:
if x < best_seen:
chosen = x; break
wins += (chosen == 1)
return wins / TRIALS
for r in range(0, N + 1):
print(r, p_best(r))
Recall variant (§5.1 / Figure 2):
def p_best_recall(r, q):
wins = 0
for _ in range(TRIALS):
perm = rng.permutation(N) + 1
best_seen = perm[:r].min() if r else np.inf
chosen = None
for x in perm[r:]:
if x < best_seen:
chosen = x; break
if chosen is None:
chosen = best_seen if rng.random() < q else perm[-1]
wins += (chosen == 1)
return wins / TRIALS
Gilbert-Mosteller DP (§6 / Figure 5):
GRID = 2000
xs = np.linspace(0, 1, GRID + 1)
f = [np.zeros(GRID + 1)] # f_0(x) = 0
thresholds = []
for k in range(1, N + 1):
accept = xs ** (k - 1)
cont = f[k - 1]
integrand = np.maximum(accept, cont)
thresholds.append(xs[np.argmax(accept >= cont)])
dx = 1.0 / GRID
tail = np.zeros(GRID + 1)
for i in range(GRID - 1, -1, -1):
tail[i] = tail[i + 1] + 0.5 * (integrand[i] + integrand[i + 1]) * dx
f.append(xs * cont + tail)
Unknown horizon (§5.2). Policy fixes r = 37 (calibrated for n = 100) while the true horizon is drawn from Uniform{50, …, 150}:
def p_best_unknown_horizon():
wins = 0
for _ in range(TRIALS):
n = int(rng.integers(50, 151))
perm = rng.permutation(n) + 1
r = 37
if r >= n:
chosen = perm[-1]
else:
best_seen = perm[:r].min()
chosen = perm[-1]
for x in perm[r:]:
if x < best_seen:
chosen = x; break
wins += (chosen == 1)
return wins / TRIALS
Adverse selection (§5.3):
def p_best_adverse(sigma):
wins = 0
for _ in range(TRIALS):
v = rng.random(N)
order = np.argsort(-v + sigma * rng.standard_normal(N))
a = v[order]
r = round(N / np.e)
best_seen = a[:r].max()
chosen = a[-1]
for x in a[r:]:
if x > best_seen:
chosen = x; break
wins += (chosen == v.max())
return wins / TRIALS
Two-sided 37% rule (§5.4):
def two_sided():
matches = a_top = 0
for _ in range(TRIALS):
pa, pb = rng.permutation(N) + 1, rng.permutation(N) + 1
a_best, b_best = pa[:r].min(), pb[:r].min()
for i in range(r, N):
if pa[i] < a_best and pb[i] < b_best:
matches += 1
a_top += (pa[i] == 1)
break
a_best = min(a_best, pa[i])
b_best = min(b_best, pb[i])
return matches / TRIALS, a_top / TRIALS
Sample-driven GM with empirical CDF (§8.1). Given m historical
samples, build the empirical CDF and run a backward-induction threshold
solve on the remaining n − m candidates:
def gm_thresholds_empirical(samples, k_max, grid=1000):
xs = np.linspace(0, 1, grid + 1)
F = np.searchsorted(np.sort(samples), xs, side="right") / len(samples)
f_prev = np.zeros_like(xs)
thresholds = []
sample_bin = np.clip(np.searchsorted(xs, np.sort(samples)), 0, grid)
for k in range(1, k_max + 1):
accept = F ** (k - 1)
integrand = np.maximum(accept, f_prev)
thresholds.append(xs[np.argmax(accept >= f_prev)])
cell = np.zeros_like(xs)
np.add.at(cell, sample_bin, integrand[sample_bin] / len(samples))
tail = np.r_[np.cumsum(cell[::-1])[::-1][1:], 0.0]
f_prev = F * f_prev + tail
return thresholds
Mean-over-draws driver for §8.1. The §8.1 table averages over independent sample realizations so the policy class — not a single lucky draw — is what gets measured:
def mixture(rng, size): # 0.8 * Beta(2,5) + 0.2 * Beta(5,1)
pick_high = rng.random(size) < 0.2
return np.where(pick_high,
rng.beta(5, 1, size),
rng.beta(2, 5, size))
def run_policy(thresholds, v):
running_max = -np.inf
chosen = v[-1]
for i in range(N):
s = thresholds[N - i - 1]
if v[i] > running_max:
running_max = v[i]
if v[i] >= s:
chosen = v[i]; break
return chosen
def evaluate_sample_driven(m, draws=200, seqs=2_000):
pbs, evs = [], []
for d in range(draws):
rng_s = np.random.default_rng(20260515 + 1000 * d)
samples = mixture(rng_s, m)
thresh = gm_thresholds_empirical(samples, N)
rng_e = np.random.default_rng(20260515 + 1000 * d + 7)
pb = ev = 0
for _ in range(seqs):
v = mixture(rng_e, N)
chosen = run_policy(thresh, v)
ev += chosen
pb += int(chosen == v.max())
pbs.append(pb / seqs); evs.append(ev / seqs)
return np.mean(pbs), np.mean(evs)
A single fixed draw of size m can give P(best) and E[value] anywhere
in roughly for m = 25 (we
checked 20 independent draws); the averaging is what produces the
non-monotone E[value] curve in Figure 6.
Noisy-predictor GM (§8.2). Re-use the GM threshold sequence for uniform values, but feed it a noisy observation in place of the true value:
def predictor_policy(sigma, thresholds_u):
v = rng.random(N)
x = v + sigma * rng.standard_normal(N)
running_max, chosen_i = -np.inf, N - 1
for i in range(N):
s = thresholds_u[N - i - 1]
if x[i] > running_max:
running_max = x[i]
if x[i] >= s:
chosen_i = i
break
return v[chosen_i]
Market-timing backtest (§8.5). Geometric Brownian motion for a year of trading days, then apply each policy:
N = 252
MU, SIGMA = 0.07 / N, 0.16 / np.sqrt(N)
def simulate_year():
logrets = rng.normal(MU - 0.5 * SIGMA**2, SIGMA, N)
return 100 * np.exp(np.cumsum(logrets))
def classical_37(prices):
r = round(N / np.e)
best = prices[:r].max()
for p in prices[r:]:
if p > best:
return p
return prices[-1]
def hold_to_end(prices):
return prices[-1]
oracle = np.array([simulate_year().max() for _ in range(TRIALS)])
got = np.array([classical_37(simulate_year()) for _ in range(TRIALS)])
print("regret vs oracle =", (oracle - got).mean())
References
Cited in text.
- Bearden, J. N. (2006). A new secretary problem with rank-based selection and cardinal payoffs. Journal of Mathematical Psychology, 50(1), 58–59. — The √n rule and the cardinal-payoff result of §4.
- Becker, S., Cheridito, P., & Jentzen, A. (2019). Deep optimal stopping. Journal of Machine Learning Research, 20(74), 1–25.
- Brams, S. J., & Kilgour, D. M. (2001). Revisiting the Secretary Problem. — The two-sided variant cited in §5.4.
- Bruss, F. T. (2005). What is known about Robbins’ problem? Journal of Applied Probability, 42(1), 108–120. — Survey of the open expected-rank problem cited in §8.4.
- Chow, Y. S., Moriguti, S., Robbins, H., & Samuels, S. M. (1964). Optimal selection based on relative rank (the “secretary problem”). Israel Journal of Mathematics, 2(2), 81–90. — The optimal rank-only policy for expected rank; asymptote , cited in §8.4.
- Christian, B., & Griffiths, T. (2016). Algorithms to Live By: The Computer Science of Human Decisions. Henry Holt.
- Dütting, P., Lattanzi, S., Paes Leme, R., & Vassilvitskii, S. (2020). Secretaries with Advice. arXiv:2011.06726. — The LP unification cited in §7.
- Dynkin, E. B. (1963). The optimum choice of the instant for stopping a Markov process. Soviet Mathematics Doklady, 4, 627–629.
- Ferguson, T. S. (1989). Who solved the secretary problem? Statistical Science, 4(3), 282–289. — The historical record cited in §2.
- Gardner, M. (1960). Mathematical Games. Scientific American, February.
- Gilbert, J. P., & Mosteller, F. (1966). Recognizing the maximum of a sequence. Journal of the American Statistical Association, 61(313), 35–73.
- Hill, T. P., & Kertz, R. P. (1992). A Survey of Prophet Inequalities in Optimal Stopping Theory. Contemporary Mathematics, 125, 191–207.
- Krengel, U., & Sucheston, L. (1977). Semiamarts and finite values. Bulletin of the AMS, 83, 745–747.
- Lindley, D. V. (1961). Dynamic programming and decision theory. Journal of the Royal Statistical Society. Series C, 10(1), 39–51.
- Moser, L. (1956). On a problem of Cayley. Scripta Mathematica, 22, 289–292.
- Mucci, A. G. (1973). Differential equations and optimal choice problems. Annals of Statistics, 1(1), 104–113. — Early treatment of cardinal-payoff variants relevant to the √n result of §4.
- Presman, E. L., & Sonin, I. M. (1972). The best choice problem for a random number of objects. Theory of Probability and Its Applications, 17(4), 657–668.
- Sakaguchi, M. (1961). Dynamic programming of some sequential sampling design. Journal of Mathematical Analysis and Applications, 2, 446–466.
- Samuel-Cahn, E. (1984). Comparison of threshold stop rules and maximum for independent nonnegative random variables. Annals of Probability, 12(4), 1213–1216.
- Optimal Stopping with a Predicted Prior. arXiv:2511.03289 (2025).
- Sheehan, M., & Yakimenko, O. (2025). Neural architecture search applying optimal stopping theory. Frontiers in Artificial Intelligence, 8.
Further reading.
- Correa, J., Dütting, P., Fischer, F., & Schewior, K. (2019). Prophet Inequalities for IID Random Variables from an Unknown Distribution. arXiv:1904.02050. — A constant number of samples per distribution nearly matches the full-information bound.
- Freeman, P. R. (1983). The Secretary Problem and Its Extensions: A Review. International Statistical Review, 51(2), 189–206. — Older but exhaustive: 60+ variants catalogued.
Production ML systems cited in §8.3. Each of these is the same stopping problem under a different name; reading them in sequence is the fastest way to see the unifying structure.
- Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. (2018). Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. JMLR, 18(185), 1–52. — Bracketed early stopping as the explicit explore-then-commit primitive.
- Li, L., Jamieson, K., Rostamizadeh, A., Gonina, E., Ben-tzur, J., Hardt, M., Recht, B., & Talwalkar, A. (2020). A System for Massively Parallel Hyperparameter Tuning (ASHA). MLSys. — Async successive halving; what Hyperband looks like in production.
- Golovin, D., Solnik, B., Moitra, S., Kochanski, G., Karro, J., & Sculley, D. (2017). Google Vizier: A Service for Black-Box Optimization. KDD. — Bayesian optimization as a Google-scale service. The acquisition function is a prophet-inequality threshold with a learned prior.
- Frazier, P. I. (2018). A Tutorial on Bayesian Optimization. arXiv:1807.02811. — The cleanest single reference for the prior-over-values stopping view.