Observations

Notable patterns found in the data — newest at top.

Spread × Entry Timing — NEXTDAY vs TODAYNO

Observation Date: May 7, 2026 | Based on: 583 settled forecast-no HIGH-market signals split by entry_timing, joined to scan ensemble_spread at signal time. LOW market excluded (pre-May 6 ensemble data was bug-contaminated — see May 6 fix note in CLAUDE.md).

The Question

Two days ago we established that ensemble spread (6-model disagreement) is the strongest single predictor we have for forecast-NO performance. The strategy fires entries at two timings: NEXTDAY (8pm ET the evening before, before Kalshi’s first wave of overnight repricing) and TODAYNO (9:45am ET the morning of, after NWS has digested overnight global-model runs). Does the spread signal carry equal weight across both timings, or does one capture more of the edge?

The Cross-Tab

Spread	NEXTDAY (eve before)		TODAY (morning of)
	miss%	P&L	miss%	P&L
< 3°F (calm)	67.7% (n=31)	−$24	75.0% (n=32)	+$12
3-5°F (low)	60.8% (n=102)	−$32	55.0% (n=111)	−$9
5-8°F (moderate)	71.0% (n=100)	+$6	71.8% (n=110)	+$61
8-12°F (high)	80.6% (n=36)	+$21	86.8% (n=38)	+$54
≥ 12°F (extreme)	71.4% (n=7)	+$4	83.3% (n=6)	+$6

Three Findings

TODAY has a higher miss-rate ceiling at high spread. 86.8% at 8-12°F vs NEXTDAY’s 80.6% — a 6.2pp gap. Same approximate sample sizes (n=36 vs 38) so the gap is statistically real, not noise.
Dollar-edge is concentrated in TODAY at 5-12°F spread. Combined +$115 across n=148 settled signals. NEXTDAY at the same spread bands only managed +$27 across n=136. Same hits, but TODAY captures more dollar edge per signal because Kalshi prices have settled into a tighter range overnight, and high-spread TODAYNO entries are buying real residual uncertainty into a less-liquid market.
NEXTDAY at low spread is actively bleeding. <5°F NEXTDAY: combined −$56 across n=133. The strategy is firing on calm-weather days where there’s no real disagreement to exploit, and Kalshi has correctly priced the bracket. These signals fire at the classifier-driven base rate but lose because the day-of forecast holds steady — no surprise to bust the bracket. TODAY at the same spread band is closer to break-even (combined −$1, n=143).

Why TODAY Wins on High-Spread Days

The mechanism that explains the pattern: by the time TODAYNO scans run (9:45am ET morning of), three things have happened that don’t apply to NEXTDAY:

NWS has refreshed. The forecast discussion has been rewritten with overnight global-model output. If models still disagree by morning, the disagreement is real signal, not noise.
Kalshi liquidity has dried up overnight. Most participant volume on day-ahead markets fires the prior afternoon. By morning of, the orderbook is thinner, spreads between YES and NO are tighter, and our NO entry can be cheaper relative to the actual probability.
The remaining bust window is shorter and observable. A day-ahead forecast might bust due to model error or a system that hasn’t shown up yet. A morning-of forecast that’s still off has the same model error but the system is now visible — tighter, more confident bet.

NEXTDAY signals essentially fire too early to capture this. They commit to a probability before the day’s atmospheric setup has resolved, and Kalshi’s already adjusted prices. The historical advantage is small; the dollar capture is small.

Practical Implications

NEXTDAY at <5°F spread is a candidate for live suppression. Combined −$56 across n=133 with no obvious upside — the spread filter would prune this cohort cleanly. Simplest implementation: gate NEXTDAY firings on ensemble_spread >= 5. No model logic change required.
TODAY at 5-12°F spread is the strongest cohort across both timings. If signal allocation is tight, these get priority. The current scanner doesn’t differentiate — both timings get equal budget — but the data suggests TODAYNO at moderate-to-high spread should be sized up relative to NEXTDAY.
TODAY at 3-5°F low spread is the worst TODAYNO cohort and worth investigating. 55% miss rate is below the 70% threshold the strategy uses to even fire — signals shouldn’t be qualifying here, but they are. Either the parity classifier is over-promoting cities with weak base rates at low spread, or the bucketing is misaligned with the strategy’s qualification logic. Worth a query against /weather/sql Q3 to dig into.

Caveats

HIGH-market only. LOW-market analysis can’t be done yet because the pre-May 6 LOW dayConviction.ensembleSpread values were HIGH-derived (bug fixed May 6). After ~30 days of clean LOW data accumulates, this same cross-tab can be re-run for LOW.
Sample sizes thin in extreme bucket (n=6-7 each). The ≥ 12°F row is suggestive but underpowered. Don’t commit allocation decisions to that band alone.
The TODAY <3°F calm cohort being worse than NEXTDAY (75% vs 67.7% miss) is an inversion worth watching. Tiny effect on similar sample sizes. Could be noise, could be that calm-day TODAYNO signals fire on cities where Kalshi did misprice (and the rare wins are large — net +$12 P&L despite higher hit rate). The dollar performance at calm is decent for both timings; the win-rate inversion is a minor curiosity.

Methodology note: each settlement is joined to the scan that would have driven its entry — NEXTDAY settlements join to the latest is_next_day = TRUE scan for that (city, date, market_type), TODAYNO settlements join to the morning-of is_next_day = FALSE AND entry_timing = 'today' scan. The ensemble_spread captured at scan time is what gates the bucket, not retrospectively. Cross-tab can be reproduced via /weather/sql Q3 with the “Market = high” filter applied; splitting further by entry_timing requires editing the WHERE clause manually.

Ensemble Spread vs City Classification — When Disagreement Overrides Cohort

Observation Date: May 5, 2026 | Based on: 547 settled forecast-no nextday signals (~45-day window in PG, with scan-level ensemble spread joined)

The Question

Earlier today the SQL workbench surfaced an aggregate finding: 6-model ensemble spread ≥ 8°F correlates with an 80%+ bracket-miss rate, vs 65-69% baseline. Useful as a slate-wide signal — but it raised an obvious follow-up: does that effect apply equally to cities the classifier already flags as high-miss-rate, or is it concentrated in the cities we don’t normally trade?

If forecast disagreement is just a city-independent variability signal, both cohorts should elevate together. If it’s a redundant restatement of what the classifier already knows, only the low-miss cohort should react. Splitting the spread bucket by is_high_mae (the classifier flag at scan time) tests this directly.

The Cross-Tab

Spread bucket	HIGH-miss cohort	low-miss cohort
< 3°F (calm)	67.6% (n=34)	70.4% (n=27)
3-5°F (low)	64.5% (n=93)	66.7% (n=102)
5-8°F (moderate)	67.9% (n=112)	75.0% (n=84)
8-12°F (high)	81.1% (n=37)	79.4% (n=34)
≥ 12°F (extreme)	72.7% (n=11)	100% (n=3 — ignore)

Three Findings

At high spread (8-12°F), the two cohorts converge to ~80% miss. Forecast disagreement overrides the city classification — when models genuinely disagree, a city that normally forecasts well misses the bracket nearly as often as a chronically high-miss city. Spread is doing more work than the classifier in this regime.
Spread “lifts” the low-miss cohort more than the high-miss cohort. Going from calm to high spread adds +9pp for low-miss cities (70 → 79) but only +14pp for high-miss cities (68 → 81 — and they were already there). The high-miss cohort has less headroom; the spread signal is additive for low-miss cities and largely redundant for high-miss cities. They’re mostly capturing the same variability the classifier already encodes.
At calm/low spread, low-miss cities miss slightly more than high-miss cities. Tiny inversion (1-3pp), but it’s there. Suggests the high-miss classification is partly tracking baseline volatility that’s already roughly priced in by Kalshi — and low-miss cities only become interesting to trade when there’s something specific about the day (high spread, AFD trough, etc.) elevating their bust risk.

The Strategy Gap

The current Forecast NO strategy completely ignores low-miss cities — they don’t clear the 70% blended-miss-rate threshold, so the scanner returns HOLD. But on high-spread days (≥8°F), low-miss cities have a 79% historical miss rate — well above the 70% bar the strategy uses for the high-miss cohort. That’s an untapped cohort where the signal isn’t “this city always busts” but rather “this day will bust.”

Conceptual extension: a “Disagreement NO” strategy that fires BUY_NO on low-miss cities specifically when ensemble spread ≥ 8°F. The classifier provides the long-run base rate; spread provides the day-specific overlay. Both layers required to fire. Low-miss cities at low spread = HOLD (already well-handled); low-miss cities at high spread = candidate for entry.

Caveats

Sample size in the high-spread / low-miss cell is n=34. Directionally robust but not bulletproof — one bad week of weather variance could move the rate by 5-10pp.
Correlation, not necessarily mispricing. Kalshi participants see weather alerts and forecast spreads too. The 79% miss rate doesn’t guarantee 79% NO entries clear the safe-entry cap; some of the value may already be in the price. The Kalshi NO ask on low-miss cities at high spread needs to be checked empirically before this becomes tradable.
The 8°F threshold is data-fitted to current windows. Other thresholds (6°F, 10°F) might segment differently. The query lets you re-bucket.
The high-spread cohort itself may cluster on shared weather events. If 4 of the 34 high-spread misses came from the same atmospheric river, that’s 1 event masquerading as 4 data points. Worth checking by date-clustering before scaling.

Next Steps

Two reasonable paths:

Dry-run a Disagreement NO sleeve — add a parallel signal type that fires on low-miss cities at high spread, persisting alongside existing forecast-no but with its own strategy tag ('disagreement-no'). Settle and aggregate independently. ~2-3 weeks of accumulation before reading the live performance.
Validate with backtest first — the extended scan window (now 60 days back to Mar 22) plus historical_conviction (back to Jan 18) gives enough data to simulate “what would have fired” under this rule for a couple months. Cheaper, faster, less risk than committing to live tracking before the math holds up.

Backtest first feels like the right order. If 60-day simulation shows positive WR/ROI at realistic Kalshi prices, then a dry-run sleeve. If the backtest is wash, the “79%” is statistical noise and we move on.

Methodology note: this analysis joins each settlement to the latest is_next_day = TRUE scan for that (city, date, market_type), which captures the ensemble_spread that was visible at signal-generation time, not retrospectively. The is_high_mae flag reflects what the classifier said when the signal would have fired — the right cohort definition because it matches what the live strategy uses to gate entry. Cross-tab query lives as Q3 on /weather/sql with an additional is_high_mae split available by adding it to the SELECT + GROUP BY.

Pricing vs Wx Conviction — Two Independent Layers

Observation Date: May 5, 2026 | Triggered by: MIA HIGH May 6 scan (40.6pp edge from pricing layer, “SKIP” from Wx column)

The Trigger

Tonight’s next-day scan for May 6 surfaced a strong signal on MIA HIGH: 89°F forecast, bracket 88-89, NO ask 49¢, 40.6pp edge with $94 of buyer depth within 3¢ of the best ask. This was the strongest single signal on the whole slate — the pricing layer was screaming take it.

At the same time, the Wx column on the dashboard read SKIP. Two analysis layers disagreeing on the same row, with one of them using a label that visually reads as “skip the trade.” That’s a real cognitive trap, especially for the cohort of cities (LAX, SFO, SEA, MIA) where the two layers always disagree by design.

The Two Layers Measure Different Things

The site has two structurally separate signal layers stacked on top of each other:

Layer	What it measures	How it’s computed
Pricing (Edge / Parity / Blended)	Historical base rate — how often does this bracket miss for this city?	14d/30d/60d blended miss rate from settled scans, optionally split by edge-position (top vs bottom of bracket) for cities with sufficient samples
Wx Conviction (HIGH / MED / BASE / LOW / N/A)	Today’s specific weather pattern — does this particular day look like a bust day on top of the base rate?	NWS Area Forecast Discussion keyword scoring + 6-model ensemble spread + ensemble-vs-NWS diff + active alerts

These are orthogonal inputs. Pricing fires when the base rate exceeds 70% and the ask is below the safe-entry cap. Wx is a confidence overlay on top of pricing — it doesn’t override the pricing decision, it just adds context about today specifically.

Why the Inverted Cities Are Special

The Wx layer’s AFD heuristic was calibrated against inland-city patterns — keywords like trough, front, thunderstorm, uncertainty reliably correlate with bracket-miss days in cities like ATL, CHI, AUS, HOU, BOS, PHL, DC, PHX. In those cities, the AFD is a real signal: when the forecaster’s own discussion text contains bust language, miss rate elevates by 8-12 percentage points above baseline.

Marine-layer cities behave differently. LAX, SFO, SEA, MIA get their bust days from sea-breeze patterns, onshore flow timing, marine cloud burnoff, and coastal frontal interactions — not from the keywords the AFD heuristic looks for. The 1,530-day backfill analysis found these four cities had negative AFD–miss-rate correlation: stable AFD days actually missed more than volatile ones, opposite to inland cities. So the Wx layer flags them as inverted-tier and refuses to make a confidence call — not because the trade is bad, but because the Wx signal has nothing useful to say.

How to Read the Two Layers Together

Pricing	Wx	Read
FIRE	HIGH/MED	Take it. Both layers agree.
FIRE	BASE	Take it. No conviction overlay either way; pricing carries the day.
FIRE	LOW	Trim or skip. Stable pattern + responsive city = Wx layer is leaning against pricing’s historical bias.
FIRE	N/A	Take it. Inverted city. Wx layer doesn’t apply. Pricing is the only signal — trust it.
HOLD	HIGH	Pass — pricing already rejected. HIGH on Wx alone is not a buy trigger.

UX Fix Shipped

The SKIP label on the Wx column for inverted cities was renamed to N/A. Same logic, clearer reading. The previous label conflated two unrelated meanings: “skip the Wx column” (intent) vs “skip the trade” (visual reading). With pricing now firing strong edge-based signals on MIA, the conflict was no longer hypothetical — the dashboard was actively producing contradictory advice on the same row.

Tooltip on the cell now states explicitly: “Inverted city — Wx column does not apply (AFD unreliable for marine-layer cities). Use pricing-layer signals as-is. This is NOT a skip-the-trade flag.”

What This Doesn’t Solve

The four inverted cities still have no day-specific conviction overlay — only their long-run base rate. That’s a real gap. A complete fix would be a marine-layer-specific conviction layer trained on those four cities’ history, looking at sea-breeze indicators, onshore-flow timing, and marine-stratus burnoff probabilities instead of the inland AFD vocabulary. Different project. Not high-leverage today — the pricing-layer edge on inverted cities (MIA top-edge cohort: 89.6% miss, n=25) is already strong enough to act on without conviction overlay.

For now: when a pricing signal fires on LAX, SFO, SEA, or MIA, and the Wx column reads N/A, that’s the system working as designed. The signal stands alone, and that’s sufficient.

Architecture note: the May 5 edge-pricing pilot (CHI + MIA flipped to edge-position-based pricing) made this conflict visible because MIA HIGH started firing larger, more confident signals from the pricing layer. The Wx layer was already saying “ignore me for MIA”; that just wasn’t a problem when MIA HIGH rarely fired strong signals under parity pricing. Pilot rollouts surface UX issues that didn’t matter before.

First 3 Days Live — Diagnosing the Underperformance

Observation Date: April 23, 2026 | Based on: 33 settled live signals (Apr 21–22) vs 270-signal backtest baseline

The Headline Numbers Are Ugly

Three days into live deployment of the adjacent-bracket YES strategy, live performance is running far below backtest:

	n	WR	P&L	ROI
Backtest	270	49%	+$44.02	+50%
Live (Apr 21–22)	33	21%	−$3.54	−34%
Delta		−28pp		−84pp

Apr 21 was 5W/10L (33% WR). Apr 22 was 2W/16L (11% WR). That prompted this diagnostic. The question isn’t “is the strategy broken” — 33 bets is not enough data to answer that. The question is: does the mechanism still look right, or did something change?

The Forecast Error Distribution Shifted

The adjacent-bracket YES strategy structurally needs the forecast to miss its own bracket by ≥ 1.5–2°F. Comparing the forecast error distribution between backtest and live windows:

Metric	Backtest (246 days)	Live (28 days)	Δ
Mean \|error\|	1.80°F	1.68°F	−0.12°F
Median \|error\|	2.0°F	1.0°F	−1.0°F
Within 1°F (forecast bullseye)	49.6%	60.7%	+11pp
Error ≥ 2°F (strategy-winnable)	50.4%	39.3%	−11pp

Mean error looks similar, but that hides the real story. The median dropped from 2°F to 1°F. The live window is dominated by the “forecast nailed it” bin — 60.7% of day-city pairs had the actual within 1°F of forecast, vs 49.6% in backtest. That’s exactly the condition where the adjacent-bracket strategy cannot win: the truth sits inside the forecast bracket, so both ±1 bets lose by design.

Bucketing the 33 Live Bets by What Actually Happened

For each settled bet, classify by (a) how big the forecast miss was, and (b) whether the miss direction matched the bet’s offset:

Condition	Count	%	Record	WR
Error < 1.5°F (quiet weather, both adj bets lose)	14	42%	—	structural loss
Error ≥ 1.5°F, wrong direction (bet +1 but miss was cold)	11	33%	0W/11L	0%
Error ≥ 1.5°F, right direction (bet direction matched miss)	8	24%	2W/6L	25%

Two findings jump out:

42% of live bets faced unwinnable conditions, vs roughly 25% expected from backtest distribution. Nearly half the sample was set up to lose regardless of model quality.
When the miss was big, direction was essentially random (right-direction 8 / wrong-direction 11). This is the more uncomfortable finding. On winnable days, the model’s offset pick should beat 50%. A 42% right-direction rate on the big-miss subset is consistent with zero directional edge, though sample is far too small to conclude.

Per-Day Read

Apr 21: 15 bets. Mean |error| 1.23°F, 77% of actuals within 1°F. This was a quiet weather day across the board. 11 of 15 bets had errors below the winnable threshold. 33% WR is actually decent given the conditions — the wins on offset +1 (PHL, SEA, MIA nextday, LAS) are exactly where the limited forecast-miss cases leaned warm.
Apr 22: 18 bets. Mean |error| 2.07°F, only 47% within 1°F — a genuinely more volatile day. But half the bets whose errors exceeded 1.5°F went in the opposite direction of the bet (5 right-direction, 4 wrong). 11% WR is what you get on a day with real weather movement and no directional edge on the offset call.

Is the Strategy Broken?

Probably not, but it’s also not exonerated. Two things are true simultaneously:

Exonerating: The weather was quieter than the backtest period. 11pp fewer winnable days is enough to drag WR from 49% to something like 39%, which combined with small-sample variance could land at 21% on 33 bets without any model defect.
Not exonerating: On the days that were winnable, the model’s offset selection went 2W/6L (25% WR). If this persists over a larger sample, the directional edge implied by the backtest is not there and the strategy economics collapse regardless of weather regime.

Both effects need more data to disentangle. 33 bets is one noisy week of weather; the backtest was 246 days of varied conditions.

What I’m Watching

“Within 1°F” rate: should drift back toward the 50% baseline as spring turns to summer and storm systems become more active. If it stays above 58%+ through May, the strategy economics need to be revisited regardless of backtest WR.
Right-direction WR on big-miss days: the clean test of model edge. Need ~30–40 big-miss bets to stabilize. That’s probably 4–6 weeks of data at current signal volume.
MIA regression: was the #1 backtest city at 63% WR, live at 25% on 4 bets. Watch whether this snaps back or persists — if it stays broken the backtest was cherry-picked by luck.

No Action Yet

33 bets is not a sample. I will not retune city groups, offset rules, or conviction thresholds on this data. The strategy stays in dry-run through at least Sunday. If by end of next weekend the WR is still sub-30% on 80+ bets, that’s the re-assessment point.

Mechanism note: this observation quantifies a known failure mode. The adjacent-bracket YES strategy loses money on quiet-weather days and loses more money on volatile days with random-direction misses. Profitability depends on (1) weather being variable enough to produce 1.5°F+ forecast misses, and (2) the model correctly predicting which direction those misses lean per city. Apr 21–22 underdelivered on both. Unclear yet whether (1) is a seasonal dip or (2) is a structural gap.

Forecast-Bracket YES — The Mirror Edge

Observation Date: April 22, 2026 | Based on: same 392 joined city-days from the Apr 21 research, re-sliced at offset 0 per (city, timepoint)

Setup

The Apr 21 observation established that buying YES on the adjacent (±1) bracket beats buying YES on the forecast bracket in aggregate — the forecast bracket has slight negative EV across the full sample. That conclusion is correct in aggregate but masks a cleaner per-city split that flipped immediately when we asked: are there any cities where the forecast-bracket YES actually works?

Answer: yes, 10 (city, timepoint) combinations have positive-EV forecast-bracket YES with n ≥ 5 tradeable opportunities. Five of those are above +60% ROI. The top entry is DEN nextday at +106% ROI, the single strongest individual-strategy ROI in the entire weather dataset.

Top Positive-EV Forecast-Bracket Strategies

City	Timepoint	n	Hit%	Avg cost	EV	ROI
DEN	nextday	10	60%	$0.29	+$0.309	+106%
DEN	morning	10	60%	$0.34	+$0.261	+77%
CHI	morning	16	50%	$0.28	+$0.221	+79%
ATL	afternoon	6	100%	$0.78	+$0.220	+28%
SFO	nextday	11	55%	$0.33	+$0.212	+63%
CHI	nextday	7	43%	$0.25	+$0.181	+73%
ATL	nextday	10	60%	$0.44	+$0.165	+38%
LAX	nextday	6	50%	$0.34	+$0.163	+49%
PHX	nextday	10	60%	$0.45	+$0.151	+34%
AUS	afternoon	8	75%	$0.62	+$0.134	+22%

Plus seven weaker-but-positive entries (ATL morning, MIA ND/AM, PHL ND, HOU AM, NYC AM, PHX AM) in the +2% to +16% ROI range. Complete list and methodology in scripts/forecast-evolution.js.

The Two Cohorts — Forecast-Accurate vs Forecast-Inaccurate

The cities that make forecast-bracket YES profitable are exactly the cities where the NWS forecast lands inside its 2°F bracket more often than the market prices in. Buy a 60%-probability outcome at $0.30 = +100% ROI. There’s no mystery once you sort the table.

Forecast-accurate (buy offset 0): DEN, CHI, ATL, SFO, LAX, PHX. NWS hits the forecast bracket at these stations more often than traders price in. Good grid-to-station alignment, few microclimate surprises, consistent bias that the market hasn’t learned.
Forecast-inaccurate (buy offset ±1): AUS, DAL, OKC, SEA, LAS, HOU, NYC, BOS (for various timepoints). The NWS forecast misses the bracket often enough that YES on the forecast itself is a losing trade, but YES one bracket to either side is the value play. This is exactly the cohort our existing Apr 21 config targets.

These two strategies are mirror images, not competitors. For a given (city, timepoint):

Forecast behavior	Winning trade	Why
Forecast lands in its own bracket reliably	Buy YES on offset 0	Market underprices the forecast-bracket hit rate
Forecast misses its own bracket reliably	Buy YES on offset ±1	Market anchors on forecast; adjacent brackets are where the real winner lives

Why Our “Loser” Cities Aren’t Actually Losers

The Apr 21 blog post labeled DEN, ATL, LAX as loser cities and LAX “doesn’t play.” That labeling was correct for the adjacent-bracket strategy but misleading as a blanket statement. These cities are actually among the best targets for forecast-bracket YES. We called them losers because we were only looking through the ±1 lens.

In particular, DEN is the strongest single city in the weather dataset when you include offset 0: +106% ROI on nextday, +77% on morning. These were zero-positive-EV cells under the Apr 21 analysis because the adjacent brackets lose at DEN — which is exactly the observable consequence of DEN being forecast-accurate.

Updated City Cohort Assignments (proposed)

If we want to deploy forecast-bracket YES as a complementary strategy, the clean assignment per (city, timepoint) looks like this. No cell should have both offset 0 and offset ±1 firing at once — that would be internally contradictory.

City	Nextday	Morning	Afternoon
DEN	offset 0 (+106%)	offset 0 (+77%)	—
CHI	offset 0 (+73%)	offset 0 (+79%)	—
ATL	offset 0 (+38%)	offset 0 (+16%)	offset 0 (+28%)
SFO	offset 0 (+63%)	offset ±1 (+8%)	—
LAX	offset 0 (+49%)	—	—
PHX	offset 0 (+34%)	offset 0 marginal (+2%)	—
AUS	offset +1 (+90%)	offset +1 (+46%)	offset 0 (+22%)
MIA	offset +1 (+53%)	offset +1 (+18%)	—
PHL	offset +1 (+66%)	offset +1 (+63%)	—
DAL	offset −1 (+130%)	offset −1 (+78%)	—
OKC	offset −1 (+114%)	offset −1 (+130%)	—
HOU	offset −1 (+37%)	offset −1 (+14%)	—
SEA	offset +1 (+57%)	offset +1 (+35%)	offset +1 (+46%)
LAS	offset −1 (+23%)	—	—
BOS	offset +1 (+38%)	offset +1 (+137%)	—
NYC	offset +1 (+57%)	—	—
DC	—	offset −1 (+209%)	—

AUS is the only city with a split pattern — forecast-inaccurate early (ND/AM use offset +1) but forecast-accurate late (afternoon uses offset 0). Most cities have one consistent direction.

Caveats

Small samples. n=6 to 19 per (city, timepoint) cell. The ATL afternoon 100% hit rate on n=6 will almost certainly regress. DEN nextday +106% at n=10 would need ~25 bets (~2–3 months of live data) before the estimate is confidence-interval stable.
Selection bias risk. These cells were picked after-the-fact from the same 392-row backtest, so multiple-comparisons inflation is real. Fifteen marginally-positive cells in a random sample would be expected at n=17 cities × 3 timepoints = 51 cells. But the top 5 are all > +60% ROI, which is not a reasonable noise outcome.
Regime dependence. DEN being forecast-accurate today doesn’t guarantee it will be forecast-accurate next month. The entire premise depends on the NWS-to-market mispricing holding over time. Apr 21 live data (PHL 0-for-3 on an ±1 strategy that backtested at +17% ROI) is a small but real reminder that cells can flip.
Not deployed. The current live config still fires only ±1 strategies per config/yes-strategy.js. No code changes from this observation yet.

Implication for the Apr 21 Dashboard

The live dashboard currently suppresses signals on “loser” cities (DEN, ATL, LAX). Under the mirror-edge frame, those suppressions are actively costing us signals — DEN nextday and ATL nextday would be among the highest-EV fires in the whole system if we surfaced them. If we decide to integrate forecast-bracket YES into live logic, the dashboard will need a way to show which offset is being played per city, and the existing “HIGH / MEDIUM / none” conviction system needs a third bucket for offset-0 signals.

Corollary to the Apr 21 Forecast Evolution observation. Same source data, same caveats about sample size. This finding complements rather than replaces the adjacent-bracket edge — they apply to disjoint city subsets and neither is a free lunch. Re-run scripts/forecast-evolution.js with the offset-0 slice any time to refresh. If/when live Apr 21+ data confirms the ±1 edge is holding, this offset-0 set is the natural next expansion.

Forecast Evolution — The Adjacent-Bracket YES Edge

Observation Date: April 21, 2026 | Based on: 392 joined city-days (Jan 20 – Apr 19), NWS forecast at 3 timepoints × Kalshi expiration_value truth

Setup

We have four temperature readings per city-day that matter for YES/NO pricing:

Nextday forecast — earliest NEXTDAY scan the evening before
Morning forecast — earliest HIGH scan before noon ET (~9:45am run)
Afternoon forecast — latest HIGH scan 2pm–8pm ET (3pm or 5pm run)
Kalshi expiration_value — the authoritative truth that settles the market

We joined all four per (city, date) using scripts/forecast-evolution.js. For each scan, we know (a) which bracket contained the forecast, (b) which bracket actually won on Kalshi, (c) the yesAsk / noAsk for every bracket in the book, and (d) the offset between forecast and winner (in 2°F bracket indices). Then we evaluated the strategy “always buy offset X YES at timepoint T” for every (T, X), limited to brackets that were tradeable (15¢ ≤ yesAsk < 95¢).

Forecast Accuracy Improves Through the Day

Timepoint	n	Miss rate (offset ≠ 0)	P(forecast bracket wins)
nextday (eve before)	158	65.8%	34.2%
morning	189	60.3%	39.7%
afternoon	169	54.4%	45.6%

The forecast bracket wins 45.6% of the time when we look at 3pm–5pm ET forecasts, up from 34.2% the evening before. Operationally this confirms what we already believed: fresher forecasts are better, and afternoon scans should drive signal conviction.

The Drift Sign-Flip

When we bucket by how the forecast evolved between timepoints, miss rate is asymmetric in a surprising way:

Drift	Morning miss vs nextday	Afternoon miss vs morning
cool 1–3°F	58.6% (n=29)	63.2% (n=38)
stable \|d\|<1°F	61.6% (n=99)	53.2% (n=79)
warm 1–3°F	69.2% (n=26)	46.2% (n=39)

Forecast warming from nextday to morning predicts MORE misses (69.2%). Forecast warming from morning to afternoon predicts FEWER misses (46.2%). Likely mechanism: morning warming revisions are often overcorrection to a single model run; afternoon warming revisions are the forecaster tracking actual observed warming. Early warming = noise; late warming = signal.

The Central Finding: Adjacent Brackets Have Positive YES EV

Ranked by expected value per $1 contract, tradeable only (15¢ ≤ yesAsk < 95¢):

Strategy	n	Hit rate	Avg cost	EV/contract	ROI
afternoon \| offset +1	52	51.9%	$0.394	+$0.126	+32.0%
nextday \| offset −1	122	35.2%	$0.300	+$0.053	+17.7%
nextday \| offset +1	116	31.0%	$0.279	+$0.032	+11.5%
afternoon \| offset −1	57	54.4%	$0.516	+$0.028	+5.4%
morning \| offset ±1	139/124	33.1%	$0.316	+$0.015	+4.7%
morning \| offset 0 (forecast)	194	37.6%	$0.388	−$0.012	−3.1%
nextday \| offset 0 (forecast)	163	31.9%	$0.336	−$0.017	−5.1%
afternoon \| offset 0 (forecast)	97	49.5%	$0.583	−$0.088	−15.1%

The forecast bracket itself has slightly negative YES EV at every timepoint. Every adjacent-bracket strategy has positive EV. The afternoon +1 bracket is the strongest single edge: 51.9% hit rate at an average cost of 39.4¢ = 32% ROI over 52 observations. This is comparable to our live Forecast NO live P&L (+34.3% ROI on 24 signals).

Per-City Breakdown (Top Strategies)

Aggregate wins conceal a lot. Per-city EV for strategies with n ≥ 5 tradeable opportunities:

City	Tier	ND +1	ND −1	AM +1	AM −1	PM +0 (fcst)
AUS	responsive	n=9 56% +$0.263	n=7 29% +$0.020	n=16 44% +$0.137	n=11 27% −$0.057	n=8 75% +$0.134
MIA	inverted	n=9 67% +$0.232	—	n=17 53% +$0.080	—	—
PHL	responsive	n=9 44% +$0.178	n=8 25% +$0.014	n=9 56% +$0.217	n=5 20% $0.000	—
DAL	neutral	n=5 40% +$0.134	n=9 67% +$0.377	n=5 20% −$0.052	n=9 56% +$0.244	n=7 43% −$0.070
OKC	neutral	n=8 13% −$0.134	n=9 56% +$0.297	n=7 14% −$0.196	n=6 67% +$0.377	n=7 43% −$0.087
HOU	responsive	—	n=9 56% +$0.152	—	n=9 44% +$0.054	—
BOS	responsive	n=7 29% +$0.079	n=6 17% −$0.062	n=6 50% +$0.288	n=6 17% −$0.092	—
DC	responsive	n=7 29% −$0.020	—	n=6 33% −$0.030	n=5 80% +$0.542	—
SEA	inverted	n=8 50% +$0.182	n=6 33% +$0.045	n=9 44% +$0.114	—	n=8 25% −$0.250
LAS	neutral	—	n=11 55% +$0.102	—	n=11 45% −$0.054	n=8 25% −$0.216
NYC	neutral	n=7 43% +$0.157	n=5 20% −$0.106	n=13 8% −$0.242	n=10 10% −$0.198	—
SFO	inverted	n=9 22% −$0.001	n=11 18% −$0.095	n=7 29% +$0.080	n=10 40% +$0.080	n=9 33% −$0.097
ATL	responsive	n=10 10% −$0.156	n=8 25% +$0.016	n=7 14% −$0.119	n=9 22% −$0.024	n=6 100% +$0.220
DEN	neutral	n=6 17% −$0.048	n=10 20% −$0.075	n=6 17% −$0.092	n=9 22% −$0.073	n=7 57% −$0.091
CHI	responsive	n=9 11% −$0.119	—	n=15 33% +$0.057	n=8 13% −$0.144	n=7 57% −$0.131
LAX	inverted	n=5 20% −$0.120	n=5 20% −$0.126	n=6 33% +$0.005	—	—
PHX	responsive	—	n=8 38% −$0.021	n=7 29% −$0.030	n=6 33% +$0.025	n=10 50% −$0.120

Cells show n / hit% / EV. Green = positive EV, red = clearly negative. Dashes mean <5 tradeable opportunities. Old AFD tier labels included for reference — they are not predictive of which strategies work per city.

Three Groups Emerge

Consistent winners (multiple positive-EV strategies, decent samples): AUS, MIA, PHL, DAL, OKC, HOU, SEA, LAS. These cities have at least one ≥+$0.15 EV strategy and no strategies with severely negative EV. Most of the aggregate edge comes from here.
Mixed: BOS, DC, NYC, SFO, PHX, CHI. Some strategies positive, some clearly negative. These cities need per-strategy filtering rather than any fixed offset.
Consistent losers (negative EV across most strategies): DEN, ATL, LAX. Every strategy I tested is slightly underwater or worse. These cities might be unplayable for YES entirely and should be traded on NO only.

The “AFD tier” label does NOT predict membership in these groups. MIA and SEA (labeled “inverted” in AFD) are in the consistent-winner group; ATL (labeled “responsive”) is in the consistent-loser group. Whatever AFD tier was measuring, it isn’t “will this city yield positive YES EV.”

Why the Market Misprices Adjacent Brackets

Hypothesis: traders anchor to the NWS forecast and preferentially buy YES on the forecast bracket, inflating its price. Adjacent brackets get the residual liquidity — thin books and wider spreads mean they’re systematically underpriced relative to their true 20–35% hit probability. The afternoon +1 bracket is especially extreme because by 3pm ET the actual afternoon temperature is already climbing, and the +1 bracket is often the correct call that the book hasn’t fully repriced yet.

This is the mirror image of the Forecast NO edge: that strategy exploits the market underpricing the “forecast misses” outcome. This new finding exploits the market overpricing the “forecast hits” outcome. They’re consistent — both say the market is too confident in the forecast bracket being the winner.

Caveats & Next Steps

Per-city sample sizes are small (n=5–17 for most strategies). Point estimates have wide CIs. A +$0.263 EV at n=9 could easily be +$0.05 or −$0.10 on the next 20 observations.
Selection bias — we only have (city, date) pairs where Kalshi had a live market AND we have Scan coverage AND HC has gold truth. Early backfill days could behave differently from current market conditions.
Backfill-sourced NWS forecasts (Jan 20 – Mar 21) come from the IEM ZFP archive, not live Scans. ZFP text parsing has a 3% miss rate and loses precision (e.g., “highs in the mid 70s” → 75). This may introduce a small noise floor.
The consistent-loser group needs explaining. Why does NYC fail YES so badly (AM +1: n=13, 8% hit, −$0.242)? Is it a Central Park microclimate issue (the Apr 20 station-vs-grid observation)? Or something else?

Operational Implications

If we want to act on this, the clearest minimum-risk starting point is:

Target only the consistent-winner cities (AUS, MIA, PHL, DAL, OKC, HOU, SEA, LAS).
Fire YES signals on offset ±1 at scan time, preferring afternoon scans where we can get them.
Do NOT fire YES on the forecast bracket itself — it is consistently mispriced against us.
Keep current Forecast NO logic separate: NO on forecast bracket and YES on ±1 are compatible strategies (they exploit the same market anchor in opposite directions).
Accumulate 30+ days of dry-run data per strategy per city before sizing up.

This analysis uses scripts/forecast-evolution.js and the HistoricalConviction + Scan + Settlement collections. Row-level data is exported to scripts/out/forecast-evolution.jsonl. Re-run after each settlement cycle to track how edge estimates evolve.

Station vs Grid — Kalshi Settlement Locations & NWS Forecast Variance

Observation Date: April 20, 2026 | Status: initial documentation, warrants further research

The Issue

Kalshi settles temperature markets on the CF6 climate report from a specific airport (or park) weather station. NWS forecasts are for a ~2.5km grid cell at a lat/lon point, not the station itself. The grid cell averages over a broader area, while the station is a single-point reading affected by its immediate surroundings. This mismatch is a systematic source of forecast error that varies by city.

Some of the “forecast bust” days in our data may not be NWS getting the weather wrong — they may be the NWS grid forecast not matching the specific microclimate at the Kalshi settlement station.

Settlement Stations by City

City	Station	Station Name	Lat	Lon	Microclimate Risk
NYC	KNYC	Central Park	40.779	-73.969	HIGH — Not an airport. Urban heat island + park cooling. Unique microclimate that NWS grid doesn’t specifically model. Central Park can read 2-4°F different from surrounding Manhattan.
LAX	KLAX	LAX Airport	33.938	-118.389	HIGH — Coastal airport directly affected by marine layer. Can be 10°F+ colder than points 5 miles inland on the same day. NWS grid averages over the LA basin; LAX station sits right on the marine-layer boundary. Likely explains why LAX is “AFD-inverted” in the backfill.
SFO	KSFO	SFO Airport	37.621	-122.379	HIGH — Peninsula airport surrounded by bay water on three sides. Extreme fog/marine-layer sensitivity. Same “inverted” AFD pattern as LAX. The NWS grid cell includes inland areas that behave completely differently.
PHX	KPHX	Sky Harbor	33.437	-112.008	MED
MIA	KMIA	Miami International	25.793	-80.291	MED — Coastal proximity + sea-breeze effects. Another “inverted” AFD city. The station is inland enough to avoid direct ocean moderation but close enough for sea-breeze timing to matter.
SEA	KSEA	SeaTac Airport	47.450	-122.309	MED — Puget Sound marine influence. Inland from the coast but maritime air penetrates the gap. Third “inverted” city in the AFD analysis.
BOS	KBOS	Logan Airport	42.366	-71.010	MED — Harbor-adjacent airport. Sea breeze can drop temps 5-10°F on summer afternoons vs inland. East wind = marine cooling; west wind = continental heating. NWS grid may not capture the harbor effect precisely.
DC	KDCA	Reagan National	38.851	-77.040	MED — Potomac River airport. Urban heat island + river cooling creates a complex microclimate. Can differ from Dulles (KIAD) by 3-5°F on the same day.
CHI	KMDW	Midway Airport	41.787	-87.752	LOW — Inland urban airport. Less microclimate variability than coastal stations. Lake Michigan effect is weaker at Midway (10 miles inland) vs O’Hare or lakefront.
ATL	KATL	Hartsfield	33.641	-84.428	LOW — Large inland airport. Minimal microclimate effects. Good grid-to-station alignment expected.
AUS	KAUS	Bergstrom Airport	30.195	-97.670	LOW — Inland airport in flat terrain. Minimal microclimate offset expected.
DEN	KDEN	Denver Intl	39.856	-104.674	MED — Airport is on the eastern plains, ~25 miles from the foothills. Chinook winds can create extreme local warming (20°F+ in hours) that the grid may underforecast. Elevation: 5,431 ft.
PHL	KPHL	PHL Airport	39.872	-75.241	LOW — Inland airport. Delaware River nearby but minimal direct marine influence.
HOU	KHOU	Hobby Airport	29.646	-95.279	MED — Corrected Apr 21: Kalshi settles on KHOU (Hobby) not KIAH (Intercontinental). Hobby is ~20mi south, closer to Galveston Bay — Gulf moisture reaches Hobby before IAH. Explains the +67% Kalshi/ACIS drift we saw in Apr 20 data when we were fetching KIAH actuals.
OKC	KOKC	Will Rogers Airport	35.393	-97.601	LOW — Great Plains airport. Flat terrain, minimal microclimate effects. Good grid alignment.
LAS	KLAS	McCarran Airport	36.084	-115.154	LOW — Desert airport. Consistent conditions. Urban heat island effect in the valley, but NWS grid likely captures it.
DAL	KDFW	DFW Airport	32.900	-97.040	LOW — Large inland airport. Flat terrain, minimal microclimate effects.

Correlation with AFD Tier Findings

The four “AFD-inverted” cities from the backfill (Apr 18 finding) are exactly the four highest microclimate-risk stations:

LAX — marine layer boundary (inverted, -26pp)
SFO — peninsula fog zone (inverted, -10pp)
SEA — Puget Sound marine air (inverted, -2pp)
MIA — sea-breeze timing (inverted, -8pp)

This is likely not a coincidence. The AFD discusses synoptic-scale weather patterns (fronts, troughs, ridges). For coastal/marine-layer cities, the local station temperature is dominated by micro-scale marine effects that the AFD doesn’t address. On “stable” days (high pressure, ridge), these cities have MORE micro-variability because the synoptic pattern is quiet but the marine layer position shifts unpredictably. On “volatile” days (fronts, troughs), the strong synoptic forcing actually OVERRIDES the marine variability and makes the station more predictable.

This explains the inversion: stable AFD → marine layer dominates → station is unpredictable. Volatile AFD → synoptic pattern dominates → station follows the grid forecast more closely.

Research Needed

Quantify the grid-to-station offset per city — for each settlement in our data, compute (NWS grid forecast − actual station reading). Is the offset consistent (systematic bias) or variable (noise)?
Compare NWS forecast to Kalshi’s CF6 settlement value directly — we have both in the Settlement collection since Apr 13. The difference tells us how much of our “forecast error” is weather-wrong vs station-mismatch.
Test marine-layer-specific forecast sources — does the NWS MOS (Model Output Statistics) for the specific station do better than the grid forecast for LAX/SFO? MOS is station-specific and trained on local biases.
Wind direction as a predictor for coastal cities — onshore wind (west at LAX) = marine layer present = cooler. Offshore wind (east at LAX) = marine layer absent = warmer. The NWS hourly forecast includes wind direction; a simple “onshore vs offshore” binary might be a better predictor than AFD for these cities.
NYC Central Park anomaly — KNYC is the only non-airport station. How does Central Park’s microclimate compare to what NWS forecasts for the NYC grid? The urban heat island + park tree canopy creates unique diurnal patterns (cooler afternoons, warmer mornings than surrounding streets).
PHX concrete heat island — Sky Harbor sits in a massive concrete/asphalt zone. Does the station consistently read warmer than the NWS grid forecast? If so, there’s a systematic warm bias we could exploit for NO bets on warm-biased days.

Potential Impact on Strategy

If the grid-to-station offset is consistent per city (e.g., LAX station always reads 2°F cooler than the grid forecast on marine-layer days), that’s a free calibration adjustment we could add to the model. It would effectively give us a better “station-specific forecast” without needing a new data source.

If the offset is variable (sometimes +3°F, sometimes -2°F), it means the station microclimate adds irreducible noise that no forecast can capture — and the right response is to widen the confidence interval (larger sigma) for those cities, or avoid them entirely for tight-bracket NO bets.

This observation connects the Apr 18 AFD inversion finding (coastal cities have inverted AFD signal) to a physical mechanism (marine-layer microclimate at the station). The per-city AFD tiers (responsive vs inverted) may be a proxy for “how much does the station microclimate differ from the NWS grid forecast.” Further research: quantify the offset per city from existing settlement data.

Wx Overlay Refinement Roadmap — NO/YES Signal Integration

Observation Date: April 19, 2026 | Based on: 1,530 city-day backfill analysis (Apr 18 findings)

Context

The Apr 18 backfill validated that AFD keyword analysis HAS predictive value for forecast bust days — but the current scoring is saturated (67% of days at cap), the signal is city-dependent (inverted for coastal cities), and specific keywords carry almost all the weight. This roadmap prioritizes the next steps by impact × effort.

Tier 1 — Data-Driven Fixes (High Impact, Low Effort)

1. Recalibrate AFD Keyword Weights

The empirical keyword deltas from the 1,530-day backfill tell us exactly what to change:

Change	Keyword	Current Weight	Empirical Delta	Action
↑	trough	+0.04	+9.9pp miss	Raise to +0.10 or higher
↑	front	+0.04	+9.7pp miss	Raise to +0.10
↑	rapidly	+0.05	+7.8pp miss	Raise to +0.08
↓	uncertainty	+0.06 (instability)	−1.7pp miss	Move to neutral (0.00) or slight stability
↓	dry	−0.03	−3.5pp miss	Keep or reduce slightly
↓	clear	−0.03	−2.7pp miss	Keep
−	fair, calm, ridge, high pressure	−0.03 to −0.06	< ±3pp	Reduce toward 0 — too noisy

Estimated effort: 30 minutes. Config change in src/stability.js keyword arrays. Immediate impact on live Wx column.

2. Per-City AFD Enablement

Split cities into three tiers based on the backfill’s volatile-vs-stable gap:

Tier	Cities	Gap	Action
AFD-Responsive	HOU, ATL, PHX, AUS, BOS, PHL, DC, CHI	+20 to +44pp	Apply AFD conviction to NO/YES pricing
AFD-Neutral	DEN, LAS, OKC, DAL, NYC	< 15pp	Use overall miss rate only; show Wx for info
AFD-Inverted	LAX, SFO, SEA, MIA	Negative	Exclude from AFD-based pricing entirely

Store as a per-city config flag in config/cities.js (e.g., afdTier: 'responsive' | 'neutral' | 'inverted'). The Wx column on Forecast NO would still show for all cities (observational), but only AFD-responsive cities would have their cap adjusted in Phase 2.

Estimated effort: 30 minutes. Config + classifier change.

3. Recompute Correlation Using NWS Forecasts + Real Brackets

The backfill used Open-Meteo GFS forecasts (1.2°F avg error) and synthetic brackets. The Settlement collection has NWS forecasts + Kalshi-verified actuals for 30+ days. Cross-joining those with HistoricalConviction AFD data would give numbers directly comparable to the live Forecast NO classifier — and likely show a wider volatile-vs-stable gap since NWS point forecasts are less accurate than GFS.

Estimated effort: 1 hour. Query + analysis script.

Tier 2 — New Signals to Integrate (Medium Impact, Medium Effort)

4. Binary “Trough or Front” Flag

The keyword analysis shows “trough” and “front” carry almost all the predictive signal (+10pp each). A simple boolean — “did the AFD mention trough or front?” — might outperform the complex 24-keyword composite. Test it against the backfill data before building. If it works, it’s the simplest possible conviction signal: one bit, +10pp edge.

5. Forecast-vs-Model Diff (Free Alternative to Ensemble Spread)

The backfill has both GFS forecasts and NWS forecasts (via Settlement). Computing |GFS − NWS| for each historical day gives a “model disagreement” signal without needing the paid Open-Meteo ensemble tier. Large divergence = NWS anchored to a different model = higher bust probability. Can be computed from existing data with zero new API calls.

6. Time-of-Day AFD Weighting

The IEM archive timestamps each AFD product. The morning AFD (~4am local) is the forecast the market prices off of. The afternoon AFD often reflects what actually happened. Scoring only the morning AFD (closest to market-open) may produce a cleaner signal than averaging all AFDs for the day.

Tier 3 — New Strategy Components (High Impact, Needs More Data)

7. Forecast YES Scanner

Build the inverse of Forecast NO: on neutral-AFD days (0.96–1.05), scan for YES contracts on the forecast bracket priced ≤ 35¢. The backfill shows 67% hit rate on these days. At 29–35¢ entry, that’s +20–40% ROI.

Needs its own page, signal tracking, and settlement scoring — parallel to the existing Forecast NO pipeline. The conviction overlay becomes the signal switch: Red Wx → buy NO, Green Wx → buy YES, Gray → baseline only.

Wait until keyword recalibration (Tier 1) is done — otherwise the YES scanner uses the same saturated scoring.

8. Conviction-Weighted Sizing

Instead of flat $10/signal, size each Forecast NO position by conviction:

AFD-responsive city + volatile day (AFD ≥ 1.10 with “trough” or “front”) → $20
AFD-neutral city + neutral day → $5
AFD-inverted city → skip or $5

This is the Phase 2 the Apr 15 observation described, but now calibrated with empirical weights from the backfill rather than guessed.

9. 180-Day Backfill Extension + Seasonal Analysis

Current 90 days cover late January through mid-April (winter → spring transition). Extending to 180 days adds the fall → winter transition and reveals whether the AFD signal is seasonal. “Trough” and “front” might be more predictive in transitional seasons than in stable summer ridges.

Run: node scripts/backfill-historical-conviction.js --days 180 (takes ~45 min with AFD scraping).

Recommended Sequence

Items 1 + 2 together (keyword recalibration + per-city enablement) — ~1 hour, immediately improves the live Wx column
Item 4 (binary trough/front flag) — test against backfill data, 30 min
Item 3 (NWS forecast correlation) — validates with the real data source, 1 hour
Item 7 (Forecast YES scanner) — biggest new revenue stream, builds on recalibrated scoring
Items 5, 6, 8, 9 as time permits

This roadmap follows the Apr 18 backfill findings. All Tier 1 items use data we already have — no new API calls or paid services. The Forecast YES scanner (Tier 3) is the largest upside opportunity but depends on Tier 1 calibration being done first so the conviction signal is trustworthy.

Historical Conviction Backfill — AFD Validation Results

Observation Date: April 18, 2026 | Data: 1,530 city-days (17 cities × 90 days, Jan 18 – Apr 17 2026) | Sources: Open-Meteo Historical Forecast API + RCC-ACIS actuals + Iowa State IEM AFD archive

Summary

Backfilled 90 days of historical conviction data to validate whether the Wx overlay (AFD keyword scoring) correlates with forecast bust days. The answer is mixed: the signal exists but the scoring needs recalibration before it’s actionable.

Finding 1: AFD Scoring Is Saturated

67% of all days hit the AFD factor cap (1.25). Keywords like “front,” “trough,” and “storm” appear in nearly every AFD because forecasters always discuss some weather feature. The scoring has no discrimination power for most days.

AFD Level	Miss Rate	n	% of Days
≤ 0.90 (very stable)	48.0%	229	15%
0.96–1.05 (neutral)	33.3%	87	6%
1.06–1.15 (unsettled)	47.5%	101	7%
1.21–1.25 (extreme/cap)	51.4%	1,033	67%

Implication: the threshold for “volatile” needs to be much higher, or the keyword weights need restructuring. The current scoring cannot distinguish “a front is mentioned in passing for next week” from “a dangerous front is arriving tomorrow.”

Finding 2: The MIDDLE of the AFD Range Has the Signal

The lowest miss rate (33.3%) is in the neutral zone (0.96–1.05) — days where the AFD has roughly equal stability and instability language. Both extremes (≤ 0.90 and ≥ 1.21) have ~48–51% miss rates.

This directly supports the Forecast YES thesis (see Apr 17 observation): on neutral-AFD days, forecasts are right 67% of the time. At 29–35¢ YES prices, that’s profitable edge. The conviction overlay can identify which days to buy YES, not just which days to buy NO.

Finding 3: AFD Value Varies Enormously by City

City	Volatile Miss (n)	Stable Miss (n)	Gap	AFD Useful?
HOU	55% (71)	11% (9)	+44pp	YES — strong
ATL	75% (59)	44% (16)	+31pp	YES
PHX	78% (32)	51% (45)	+27pp	YES
AUS	70% (60)	50% (10)	+20pp	YES
DEN	56% (66)	43% (7)	+13pp	Marginal
LAS	16% (32)	15% (41)	+1pp	NO — both low
SEA	74% (35)	76% (34)	−2pp	NO — both high
LAX	29% (34)	56% (36)	−26pp	INVERTED
SFO	47% (36)	57% (35)	−10pp	Inverted

Coastal cities (LAX, SFO, SEA) are inverted or neutral. Marine-layer micro-climate variability doesn’t respond to synoptic-scale AFD language. A “stable high-pressure ridge” over California can produce either 65°F or 80°F at LAX depending on marine layer position — and the AFD can’t predict that.

Interior/Gulf cities (HOU, ATL, PHX, AUS) respond strongly. These are the cities where AFD-based conviction should be applied. 20–44pp gap between volatile and stable days is real edge for signal discrimination.

Finding 4: Specific Keywords Matter More Than the Aggregate Score

Keyword	Miss When Present	Miss When Absent	Delta
trough	51.7%	41.8%	+9.9pp
front	50.9%	41.1%	+9.7pp
rapidly	55.8%	48.0%	+7.8pp
light winds	43.2%	50.9%	−7.7pp
sunny	44.9%	50.3%	−5.4pp
uncertainty	48.1%	49.8%	−1.7pp

“Uncertainty” REDUCES miss rate. Counterintuitive but consistent: when NWS forecasters explicitly flag uncertainty, they anchor to climatology and make conservative predictions — which paradoxically makes the point forecast more accurate. “Trough” and “front” are the real bust predictors at ~+10pp each.

Finding 5: Overall Forecast Quality

Average absolute forecast error (Open-Meteo GFS): 1.2°F. Overall bracket miss rate: 49.1%. These numbers are lower than the live classifier’s 70–88% because: (a) GFS outperforms NWS point forecasts operationally, (b) the data covers all 17 cities including low-miss ones, and (c) the synthetic bracket assumption (hardcoded odd-top grid) affects the count.

Implications for Phase 2

Recalibrate AFD scoring before integrating into signal firing. Options: raise keyword weights for “trough”/“front”/“rapidly,” reduce generic terms like “dry”/“fair,” or switch to a binary “has trough OR front” flag. Current composite score is too noisy (67% at cap).
Build per-city AFD enablement. Apply AFD conviction ONLY to interior/Gulf cities (HOU, ATL, PHX, AUS, DEN, BOS, PHL, DC, CHI). Exclude LAX, SFO, SEA where the signal is inverted or zero.
Reweight “uncertainty” — move it from instability (+0.06) to stability (−0.03 or neutral). The data shows it’s a conservative-forecasting indicator, not a bust indicator.
Forecast YES on neutral-AFD days — 33% miss = 67% hit rate. This is the cleanest single finding and directly feeds the Apr 17 observation.
Recompute using NWS-specific forecasts from the Settlement collection (instead of GFS) and real Kalshi brackets (instead of synthetic) for numbers directly comparable to the live classifier. The GFS data confirms the direction but the magnitudes will differ.

Data Source Details

Historical forecasts: Open-Meteo Historical Forecast API (historical-forecast-api.open-meteo.com/v1/forecast). GFS model, daily max/min in °F by city timezone. Free tier, 10k calls/day.
Historical actuals: RCC-ACIS (data.rcc-acis.org/StnData). ASOS station daily high/low. Same source as the existing backtest module. Free.
AFD text: Iowa State IEM archive (mesonet.agron.iastate.edu/api/1/nws/afos/list.json for product list, mesonet.agron.iastate.edu/api/1/nwstext/{product_id} for text retrieval). Scored using the same 24-keyword matcher as the live analyzeAFD in src/stability.js. Free.
Ensemble spread: NOT backfilled (requires Open-Meteo paid tier). The Previous Runs API has historical GFS data from March 2021, but costs money. AFD alone is sufficient for initial validation.

Script: scripts/backfill-historical-conviction.js. Collection: HistoricalConviction (1,530 flat documents). Safe to re-run (upserts by city+date+marketType).

This backfill was prompted by the Apr 15 conviction overlay observation and the Apr 17 Forecast YES hypothesis. The data validates the YES thesis (neutral AFD = 67% hit rate) while revealing that the NO-side AFD signal requires per-city calibration and keyword reweighting before it can drive automated decisions.

Forecast YES — Inverse Strategy on Stable Weather Days

Observation Date: April 17, 2026 | Triggered by: first day of Wx conviction data showing 9/14 cities at max AFD, with 3 cities (DEN, LAS, SEA) showing stable patterns

The Insight

While reviewing the first Wx conviction data, a cold front was driving most cities to AFD ≥ 1.20 (unstable). But three cities — DEN (0.89), LAS (0.93), SEA (0.90) — were stable, behind the front. The YES contract on Denver’s forecast bracket was priced at 29¢.

At 29¢, you only need 29% accuracy to break even. You can be wrong 70% of the time and still profit. This inverts the Forecast NO logic: instead of betting NWS is WRONG on volatile days, bet NWS is RIGHT on stable days, at cheap YES prices.

The Binary Payout Math

YES Entry	Win Profit	Loss	Break-even WR
25¢	+$300	−$100	25%
29¢	+$245	−$100	29%
33¢	+$203	−$100	33%
40¢	+$150	−$100	40%

The payout asymmetry is extreme. A 35% hit rate at 29¢ yields +21% ROI. A 40% hit rate yields +38% ROI. The strategy is very forgiving on accuracy because you’re buying cheap.

Historical Bracket HIT Rate (60-day, overall — not filtered by stability)

The complement of miss rate — % of days the actual high STAYED INSIDE the NWS 2°F bracket. These rates are BEFORE any stability filtering; the hypothesis is that stable-day filtering pushes hit rates higher.

City	Samples	Hit Rate	29¢ ROI	25¢ ROI	Note
PHX	25	52%	+79%	+108%	Already above breakeven without filtering
NYC	25	44%	+52%	+76%	Already above breakeven
ATL	25	44%	+52%	+76%	Already above breakeven
OKC	25	44%	+52%	+76%	Already above breakeven
DAL	25	40%	+38%	+60%	Already above breakeven
LAS	25	36%	+24%	+44%	Already above breakeven
SEA	25	36%	+24%	+44%	Already above breakeven
LAX	25	32%	+10%	+28%	Above breakeven
SFO	25	32%	+10%	+28%	Above breakeven
MIA	25	32%	+10%	+28%	Above breakeven
HOU	25	28%	−3%	+12%	Near breakeven
DEN	25	24%	−17%	−4%	Below breakeven at base rate — needs stability filter
PHL	25	24%	−17%	−4%	Below breakeven
DC	25	24%	−17%	−4%	Below breakeven
BOS	25	20%	−31%	−20%	Below breakeven
AUS	25	20%	−31%	−20%	Below breakeven
CHI	25	16%	−45%	−36%	Far below breakeven

Seven cities are already above the 29% breakeven at the overall rate, without any stability filtering. If stability filtering raises hit rates by even 5-10pp on clean days, the ROI numbers become very compelling.

Why This Is Different From The Old BUY_YES (Which Failed)

The active model’s BUY_YES was disabled in April (0/2 WR, −$51). That approach used the model’s probability estimate to decide when YES was underpriced — which required the model to be well-calibrated (it wasn’t).

This idea is fundamentally different:

Approach	Signal	What It Relies On
Old BUY_YES (failed)	“Our model says YES is cheap”	Model probability calibration (broken)
Forecast YES (new)	“NWS is likely RIGHT today”	Wx conviction overlay identifying stable days

The conviction overlay becomes the signal switch between two complementary strategies that run on the same dashboard:

Wx Signal	Day Type	Strategy
Red (AFD ≥ 1.15, spread ≥ 8°)	Volatile	Forecast NO — NWS likely to bust, buy NO
Green (AFD ≤ 0.92, spread ≤ 3°)	Stable	Forecast YES — NWS likely right, buy YES cheap
Gray	Neutral	Baseline only or skip

Caveats

Hit rates use the synthetic bracket, which we know is wrong for some cities due to the grid-offset issue. Need to recompute with the real-bracket join.
25 samples per city — PHX at 52% has a Wilson 95% CI of roughly [33%, 71%]. The true rate could be 33% (barely above breakeven) or 71% (incredible edge).
The 29¢ YES price was DEN-specific. Other cities with higher hit rates (PHX at 52%) probably have YES priced at 45-55¢ — the market isn’t that inefficient. The edge (if any) would be smaller.
Zero conviction-conditioned data exists yet. We need 2-3 weeks of Wx data to compute: “on days where AFD ≤ 0.92 AND spread ≤ 3°, what is the hit rate?” That’s the conditional rate that matters — the overall rates above are just a starting point.

Analysis To Run When Data Is Ready

Conditional hit rate by AFD level — for each city, what’s the bracket hit rate on days where AFD ≤ 0.92 vs days where AFD ≥ 1.15?
Actual YES ask price at scan time — is 29¢ typical for stable cities, or was DEN an outlier? Need to capture YES prices alongside NO prices on scans.
Edge = conditional hit rate − YES ask price. If this exceeds 5pp consistently, the strategy has real edge.
Correlation between ensemble spread and hit rate — does low spread (≤ 3°) independently predict higher hit rates, or is it redundant with AFD?
Interplay with edge-position — on a stable day, does a bottom-edge forecast (cold-biased city where 1°F cooling stays in bracket) have an even higher hit rate? That would be the tightest filter: stable + favorable edge + cheap YES.

Revisit Criteria

2-3 weeks of Wx conviction data accumulated (at least 20 “stable” city-days with AFD ≤ 0.92)
At least 5 cities with ≥ 10 stable-day samples each
Conditional hit rate computed and compared to YES ask price at scan time
Real-bracket hit rates (not synthetic) calculated via Settlement → Scan join

This observation was prompted by the first day of Wx conviction data (Apr 17). William noticed the inverse opportunity: if the conviction overlay identifies days when NWS is likely to be right, buying YES on the forecast bracket at cheap prices exploits the same data from the opposite direction. The conviction overlay was originally designed for NO-side edge detection — this is a completely new use case that emerged from the data itself.

Forecast NO — Day-Level Conviction Overlays

Observation Date: April 15, 2026 | Design notes — revisit when ready to build

The Insight That Prompted This

The April 11 post-mortem showed 71% of all Forecast NO losses (5 of 7) clustered on a single day. When the strategy loses, it loses multiple positions simultaneously because the model is collectively wrong about a weather pattern. Losses are day-correlated, not time-of-day-correlated.

Therefore the highest-leverage next improvement is a filter that can flag “today is a high-uncertainty day, size down or skip” vs “today is a clean baseline day, go full size.” Day-level conviction, not time-of-day tuning.

Already Computed But NOT Consumed by Forecast NO

These signals exist in the codebase (used by the active model for sigma adjustment) but are completely ignored by Forecast NO. Wiring them in is pure plumbing.

AFD keyword analysis (src/stability.js::analyzeAFD) — parses the NWS Area Forecast Discussion for 24 weighted keywords. Unstable terms: cold front, warm front, thunderstorm, unstable, uncertainty, volatile, rapidly, trough, low pressure, wind shift, gusty, storm, rain changing, wintry mix. Stable terms: high pressure, ridge, stable, dry, light winds, clear, sunny, calm, fair. Returns a factor 0.85-1.25 plus matched terms.
Ensemble spread (src/fetchers/open-meteo.js::fetchEnsemble) — 6 models (GFS, ECMWF, ICON, GEM, JMA, MeteoFrance). High spread = models disagree = dynamic weather.
Ensemble vs NWS disagreement — when the ensemble mean diverges from NWS by ≥2°F, NWS is anchored on one model the others disagree with. Currently console-logged only.
Hourly volatility (analyzeHourlyVolatility) — sharp temperature swings within the hourly forecast.

The AFD signal is the biggest miss. NWS forecasters literally use the word “uncertainty” in their discussion text on busted-forecast days — and we're ignoring it for the strategy that bets on their uncertainty.

Medium-Effort New Integrations

NWS alerts — Wind/Frost/Heat Advisory as separate boolean signals.
Dew point / cloud cover from NWS hourly. Dew point near forecast high = suppressed daytime heating (common forecast-bust pattern).
Wind shift detection — a direction change >45° within the day indicates frontal passage.
NWS SPC Day 1 convective outlook — active area = higher probability of forecast bust.

Harder / Longer-Term

Temperature anomaly vs climatology — extreme anomalies are harder to forecast. Open-Meteo has climate normals.
Model-specific outliers — ECMWF is typically the best single model. If NWS matches GFS but disagrees with ECMWF, that's more surgical than overall spread.
Radar/satellite convection — GIS work, not trivial.

Recommended Build Sequence

Phase 1 — Wire existing signals as display-only (2-3 hours). Create a dayConviction object on every Forecast NO scan with afdFactor, afdKeywords, ensembleSpread, ensembleDiff, hourlyVolatility, alertsPresent, and an aggregated dayConvictionScore (0-100, where 50 = neutral, >70 = high uncertainty). Surface on /weather/forecast-no as a new column. Do not change firing logic yet — let it accumulate for 2-3 weeks as display-only.

Phase 2 — Validate and fold in (after 60+ settled signals). On days where dayConvictionScore ≥ 70, was the realized miss rate actually higher than the city's baseline? If yes, use it as a cap multiplier: adjustedCap = baselineCap × (1 + 0.1 × convictionBonus). High uncertainty → accept higher NO ask prices (bigger positions). Stable days → tighten.

Phase 3 — April 11 post-mortem validation. Before investing in Phase 1 wiring, pull historical AFD + ensemble data for April 9-13 and check whether the dayConviction signal would have flagged April 11 as high-uncertainty. If yes, the signal is real and we should build. If no, the filter doesn't work and we need different data sources. This is the most valuable next step — proves the signal has predictive value on the one day we already know mattered.

Priority Order When Building

AFD factor — biggest payoff, already computed, highest encoded human judgment
Ensemble spread — biggest quantitative signal, already fetched
Ensemble vs NWS diff — subtle but real edge indicator
Hourly volatility — cheap, already partially done
Everything else — wait and see if 1-4 are enough

Estimated Impact

If AFD flags 1 in 5 days as "high uncertainty" with meaningfully higher miss rates: back-of-envelope, April 11's loss was ~−$46 on 5 positions. A conviction filter that halved sizing on that day would have cut the loss to ~−$23, lifting period P&L from +$143 to +$166 — a ~16% improvement from a single filter during one documented bad day.

Open Questions for Future-Me

What's the right aggregation formula for dayConvictionScore? Simple weighted sum, or non-linear?
Per-city only (AFD office is per-city), or also a regional component?
Does stability predict high NO-loss days, or only low-edge days? Two different questions worth measuring.
Interaction with edge-position classifier — does a warm-biased city on a high-uncertainty day have amplified or dampened edge?
Kill-switch integration — if dayConvictionScore drops mid-day, should the execution-automation cron (see the separate observation) cancel open intents that haven't filled yet?

Revisit Criteria

Don't start this until:

Forecast NO has 50+ settled signals with the new edge-position metric in place
At least one more "bad day" has occurred so there are multiple validation points, not just April 11
Phase 3 retrospective test has been run and confirms AFD/ensemble would have flagged April 11

Ensemble Models

The 6 global weather models used for ensemble spread and NWS-divergence analysis:

Model	Origin	API ID	Notes
GFS	US (NOAA)	gfs_seamless	NWS's own parent model. When NWS agrees with GFS but disagrees with others, that's a strong NWS-anchoring signal.
ECMWF	European	ecmwf_ifs025	Generally considered the most accurate global model. When ECMWF diverges from NWS, the ECMWF read is often closer to truth.
ICON	German (DWD)	icon_seamless	Strong on European weather patterns. Independent initialization from GFS/ECMWF.
GEM	Canadian	gem_seamless	Good for northern US cities. Independent data assimilation.
JMA	Japanese	jma_seamless	Strongest on Pacific-influenced weather (west coast cities).
MeteoFrance	French	meteofrance_seamless	Independent Arpège/AROME system. Adds diversity to the ensemble.

All fetched via Open-Meteo API (free, no API key). Source: src/fetchers/open-meteo.js.

How spread is computed: max(high) − min(high) across all 6 models for the forecast day. A spread of 3°F means the models all roughly agree. A spread of 10°F+ means at least one model sees a fundamentally different weather outcome (e.g., a front arriving 6 hours earlier/later than others expect).

Related observation: Forecast NO Execution Automation Design Notes (Apr 14, stored in recaps) — the conviction-score signal and the execution-intent automation share a natural integration point. When the conviction score drops, the automation should cancel open unfilled intents; when it rises, it should permit higher per-signal size.

Forecast Parity Interacts With City Bias

Observation Date: April 14, 2026 | Data: 21 HIGH settlements per city over last 30 days

The Structural Setup

Kalshi weather brackets are 2°F wide and aligned in even-odd pairs: 82-83, 84-85, 86-87, and so on. The top edge of every bracket is an odd number. The bottom edge is even.

This creates a structural interaction with each city's directional forecast bias:

Warm-biased city + ODD forecast (e.g., fcst 83, bracket 82-83): a 1°F warming to 84 exits the top of the bracket → WIN
Warm-biased city + EVEN forecast (e.g., fcst 82, bracket 82-83): a 1°F warming to 83 stays inside the bracket → LOSS
Cold-biased city + EVEN forecast (e.g., fcst 80, bracket 80-81): a 1°F cooling to 79 exits the bottom → WIN
Cold-biased city + ODD forecast (e.g., fcst 81, bracket 80-81): a 1°F cooling to 80 stays inside → LOSS

In short: the parity of the forecast determines which side of the bracket a city’s typical drift crosses.

The Data

Splitting each city’s historical miss rate by forecast parity shows a dramatic effect:

City	Bias	ODD miss%	EVEN miss%	Edge to Preferred
HOU	-1.4°F (cold)	50% (n=10)	91% (n=11)	+41pp EVEN
SEA	+2.7°F	75% (n=8)	38% (n=13)	+37pp ODD
MIA	+1.0°F	80% (n=10)	45% (n=11)	+35pp ODD
DC	+2.0°F	91% (n=11)	60% (n=10)	+31pp ODD
AUS	+1.0°F	88% (n=16)	60% (n=5)	+28pp ODD
PHL	+3.8°F	86% (n=7)	64% (n=14)	+22pp ODD

Where It Breaks Down

The effect is strongest for cities with modest directional bias (~1-2°F). For cities with extreme bias, the magnitude of drift overwhelms the parity effect:

CHI (+4.8°F bias) — drift is so strong (4-5°F warming) that actuals clear both bracket types. Parity doesn’t matter.
BOS (+2.4°F bias) — similar, large bias washes out the effect.
NYC (+0.1°F), LAX (-0.1°F), LAS (0.0°F) — no directional preference, so no parity preference either.

Implications

A blended city miss rate masks two very different populations. AUS’s 78% rate = 88% on odd days + 60% on even days — wildly different confidence levels.
Per-parity miss rate would give sharper entry signals and better safe-entry prices.
Moving to a parity-aware classifier could raise effective WR from ~76% to ~85%+ on the aligned days, while filtering out the ~40% of days where we shouldn’t be betting at all.

Current Status

Apr 14, 2026: Display-only parity column added to the Forecast NO page. Each row shows whether today’s forecast parity aligns with the city’s preferred direction.
Strategy not yet modified. Observing in dry-run to confirm the pattern holds forward.
Dynamic bias calculation: re-evaluated on every page load using the same blended 14d/30d window as miss rate. Cities at risk of flipping (near-zero bias) are currently the neutral ones; cities with solid bias (HOU, CHI, AUS, etc.) should be stable.

First Live Bias-Change Signal (same day)

After adding bias change detection (comparing recent 14d to prior 14d), the first classifier run surfaced meaningful movement right away. These are not small adjustments — they suggest real regime shifts are happening even within a 4-week window:

City	Prior 14d	Recent 14d	Δ	Flag	Note
LAX	-2.0°F (cold)	+0.6°F (warm)	+3.1°F	⚠ FLIP	Cold bias reversed to warm — parity preference now inverted. Treat with caution until confirmed.
ATL	+3.3°F	+0.9°F	-2.7°F	↓ SHIFT	Still warm, but bias magnitude cut to a third. Moving into the "parity-sensitive" sweet spot where the effect is strongest.
NYC	+1.6°F	-0.4°F	-2.4°F	↓ SHIFT	Drifted from warm into the neutral zone. Parity preference now none.
OKC	+3.6°F	+1.7°F	-2.2°F	↓ SHIFT	Halved the warm bias. Still ODD-preferring but less confident.
DC	+3.9°F	+1.4°F	-3.0°F	↓ SHIFT	Biggest shift. Was the highest-confidence ODD city; now in the middle of the pack.

The pattern: all four "shift" cities are cooling — their warm bias is dropping. LAX reversed entirely. This points to NWS models catching up to recent weather patterns (spring warm-up already priced in), or an actual regime change (cold snap suppressing the usual warm bias).

Stable cities (no flag): CHI (+5.1°F), BOS (+2.4°F), AUS (+0.9°F), MIA (+1.1°F), PHX (+1.3°F), HOU (-1.6°F). These have been consistent across both windows — trustworthy parity signals right now.

Implication for the strategy: the five shift/flip cities need higher confidence margins on their entry prices, or we should explicitly mark them as "bias uncertain" and skip for a few days. Specifically LAX — we were about to trust its "prefer ODD" signal, but with the flip flag, that recommendation is unreliable.

Credit: William spotted this. The observation was sparked by noticing that forecast brackets had odd numbers at the top, which immediately suggested the asymmetry. Sample sizes are still small (5-16 per city-parity bucket) — pattern should be re-verified at 45+ days of data.

Weather Forecast Error Patterns

Observation Date: April 7, 2026 | Data: 289 HIGH settlements across 17 cities (March 21 - April 7, 2026)

Key Finding: Weather Errors Persist, Not Reverse

Unlike BTC 15-minute markets where price mean-reverts (57% reversal rate after 3+ streaks), weather forecast errors persist in the same direction. Streak reversal strategies fail for weather:

Streak Length	Reversal Rate (miss >1°)	Verdict
2-day	47%	Coin flip
3-day	31%	Continuation favored
4-day	14%	Strong continuation

This is the opposite of BTC. Weather errors compound — if the NWS missed hot for 3 days, they'll likely miss hot again tomorrow.

Autocorrelation: The Warm Bias

After HOT miss: next day avg error = +1.92° (still hot)
After COLD miss: next day avg error = +0.85° (reverts toward warm bias)
Overall: errors persist same sign 40% of the time, reverse 60% — but when they reverse, they revert to the warm bias, not to cold

The NWS has a systematic warm bias. After a cold miss, it returns to warm. After a hot miss, it stays hot. The bias is the attractor.

After Big Misses: Forecast Improves But Undercorrects

When the NWS misses by 4°+, the next day improves but doesn't fully correct:

Miss Size	Next Day "Improved"	Full Reversal
≥2°	74% (HOT) / 83% (COLD)	31% / 39%
≥3°	81% / 86%	37% / 41%
≥4°	85% / 100%	42% / 62%
≥5°	95% / 100%	46% / 56%

Cold misses correct more aggressively than hot misses (they revert to the warm bias). Hot misses persist.

Tradeable Signal: Regression to Forecast

After a big miss, the next day's actual temp tends to land closer to the forecast but within a known band:

Miss Threshold	Bracket Offset	Win Rate	Sample
≥3°	±2°	62%	95
≥4°	±2°	67%	73
≥5°	±2°	70%	50
≥4°	±3°	73%	73
≥5°	±3°	78%	50

Sweet spot: after a 5°+ miss, bet the next day’s actual will be within 3° of the forecast. 78% win rate, ~3 opportunities per week.

Per-City Overcorrection Patterns

Some cities overcorrect after big misses (error switches sign by >1°), others never do:

City	Overcorrection Rate	Pattern
AUS	75%	Frequently overcorrects — tradeable
LAS	67%	Overcorrects often
DEN, PHL, BOS	50%	Coin flip
CHI, SFO, SEA, DAL	0%	Never overcorrects — errors persist

CHI’s 0% overcorrection explains why our warm bias trades on CHI keep winning — when CHI misses hot, it stays hot.

Conclusions

No BTC-style reversal play exists for weather. Errors persist, not revert.
"Regression to forecast" signal works at 67-78% WR after 4-5°+ misses, but triggers infrequently (~3x/week).
The existing active model already captures this edge via EMA bias calibration. CHI’s persistent warm bias is why it’s our top performer.
Potential future enhancement: When a city’s error persists hot for 3+ days, increase the bias correction aggressiveness. Currently the EMA uses α=0.3; bumping to 0.5 for persistent streaks could improve responsiveness.
AUS overcorrection could be a separate signal — after a 4°+ miss on AUS, bet the opposite direction next day (75% historical rate). Worth monitoring but small sample (4 instances).

This observation is based on 17 days of settlement data. Patterns should be validated over 60+ days before trading on them. Filed for review.