Observations

Notable patterns found in the data — newest at top.

Spread × Entry Timing — NEXTDAY vs TODAYNO

Observation Date: May 7, 2026 | Based on: 583 settled forecast-no HIGH-market signals split by entry_timing, joined to scan ensemble_spread at signal time. LOW market excluded (pre-May 6 ensemble data was bug-contaminated — see May 6 fix note in CLAUDE.md).

The Question

Two days ago we established that ensemble spread (6-model disagreement) is the strongest single predictor we have for forecast-NO performance. The strategy fires entries at two timings: NEXTDAY (8pm ET the evening before, before Kalshi’s first wave of overnight repricing) and TODAYNO (9:45am ET the morning of, after NWS has digested overnight global-model runs). Does the spread signal carry equal weight across both timings, or does one capture more of the edge?

The Cross-Tab

SpreadNEXTDAY (eve before)TODAY (morning of)
miss%P&Lmiss%P&L
< 3°F (calm)67.7% (n=31)−$2475.0% (n=32)+$12
3-5°F (low)60.8% (n=102)−$3255.0% (n=111)−$9
5-8°F (moderate)71.0% (n=100)+$671.8% (n=110)+$61
8-12°F (high)80.6% (n=36)+$2186.8% (n=38)+$54
≥ 12°F (extreme)71.4% (n=7)+$483.3% (n=6)+$6

Three Findings

Why TODAY Wins on High-Spread Days

The mechanism that explains the pattern: by the time TODAYNO scans run (9:45am ET morning of), three things have happened that don’t apply to NEXTDAY:

NEXTDAY signals essentially fire too early to capture this. They commit to a probability before the day’s atmospheric setup has resolved, and Kalshi’s already adjusted prices. The historical advantage is small; the dollar capture is small.

Practical Implications

Caveats

Methodology note: each settlement is joined to the scan that would have driven its entry — NEXTDAY settlements join to the latest is_next_day = TRUE scan for that (city, date, market_type), TODAYNO settlements join to the morning-of is_next_day = FALSE AND entry_timing = 'today' scan. The ensemble_spread captured at scan time is what gates the bucket, not retrospectively. Cross-tab can be reproduced via /weather/sql Q3 with the “Market = high” filter applied; splitting further by entry_timing requires editing the WHERE clause manually.

Ensemble Spread vs City Classification — When Disagreement Overrides Cohort

Observation Date: May 5, 2026 | Based on: 547 settled forecast-no nextday signals (~45-day window in PG, with scan-level ensemble spread joined)

The Question

Earlier today the SQL workbench surfaced an aggregate finding: 6-model ensemble spread ≥ 8°F correlates with an 80%+ bracket-miss rate, vs 65-69% baseline. Useful as a slate-wide signal — but it raised an obvious follow-up: does that effect apply equally to cities the classifier already flags as high-miss-rate, or is it concentrated in the cities we don’t normally trade?

If forecast disagreement is just a city-independent variability signal, both cohorts should elevate together. If it’s a redundant restatement of what the classifier already knows, only the low-miss cohort should react. Splitting the spread bucket by is_high_mae (the classifier flag at scan time) tests this directly.

The Cross-Tab

Spread bucketHIGH-miss cohortlow-miss cohort
< 3°F (calm)67.6% (n=34)70.4% (n=27)
3-5°F (low)64.5% (n=93)66.7% (n=102)
5-8°F (moderate)67.9% (n=112)75.0% (n=84)
8-12°F (high)81.1% (n=37)79.4% (n=34)
≥ 12°F (extreme)72.7% (n=11)100% (n=3 — ignore)

Three Findings

The Strategy Gap

The current Forecast NO strategy completely ignores low-miss cities — they don’t clear the 70% blended-miss-rate threshold, so the scanner returns HOLD. But on high-spread days (≥8°F), low-miss cities have a 79% historical miss rate — well above the 70% bar the strategy uses for the high-miss cohort. That’s an untapped cohort where the signal isn’t “this city always busts” but rather “this day will bust.”

Conceptual extension: a “Disagreement NO” strategy that fires BUY_NO on low-miss cities specifically when ensemble spread ≥ 8°F. The classifier provides the long-run base rate; spread provides the day-specific overlay. Both layers required to fire. Low-miss cities at low spread = HOLD (already well-handled); low-miss cities at high spread = candidate for entry.

Caveats

Next Steps

Two reasonable paths:

Backtest first feels like the right order. If 60-day simulation shows positive WR/ROI at realistic Kalshi prices, then a dry-run sleeve. If the backtest is wash, the “79%” is statistical noise and we move on.

Methodology note: this analysis joins each settlement to the latest is_next_day = TRUE scan for that (city, date, market_type), which captures the ensemble_spread that was visible at signal-generation time, not retrospectively. The is_high_mae flag reflects what the classifier said when the signal would have fired — the right cohort definition because it matches what the live strategy uses to gate entry. Cross-tab query lives as Q3 on /weather/sql with an additional is_high_mae split available by adding it to the SELECT + GROUP BY.

Pricing vs Wx Conviction — Two Independent Layers

Observation Date: May 5, 2026 | Triggered by: MIA HIGH May 6 scan (40.6pp edge from pricing layer, “SKIP” from Wx column)

The Trigger

Tonight’s next-day scan for May 6 surfaced a strong signal on MIA HIGH: 89°F forecast, bracket 88-89, NO ask 49¢, 40.6pp edge with $94 of buyer depth within 3¢ of the best ask. This was the strongest single signal on the whole slate — the pricing layer was screaming take it.

At the same time, the Wx column on the dashboard read SKIP. Two analysis layers disagreeing on the same row, with one of them using a label that visually reads as “skip the trade.” That’s a real cognitive trap, especially for the cohort of cities (LAX, SFO, SEA, MIA) where the two layers always disagree by design.

The Two Layers Measure Different Things

The site has two structurally separate signal layers stacked on top of each other:

LayerWhat it measuresHow it’s computed
Pricing (Edge / Parity / Blended)Historical base rate — how often does this bracket miss for this city?14d/30d/60d blended miss rate from settled scans, optionally split by edge-position (top vs bottom of bracket) for cities with sufficient samples
Wx Conviction (HIGH / MED / BASE / LOW / N/A)Today’s specific weather pattern — does this particular day look like a bust day on top of the base rate?NWS Area Forecast Discussion keyword scoring + 6-model ensemble spread + ensemble-vs-NWS diff + active alerts

These are orthogonal inputs. Pricing fires when the base rate exceeds 70% and the ask is below the safe-entry cap. Wx is a confidence overlay on top of pricing — it doesn’t override the pricing decision, it just adds context about today specifically.

Why the Inverted Cities Are Special

The Wx layer’s AFD heuristic was calibrated against inland-city patterns — keywords like trough, front, thunderstorm, uncertainty reliably correlate with bracket-miss days in cities like ATL, CHI, AUS, HOU, BOS, PHL, DC, PHX. In those cities, the AFD is a real signal: when the forecaster’s own discussion text contains bust language, miss rate elevates by 8-12 percentage points above baseline.

Marine-layer cities behave differently. LAX, SFO, SEA, MIA get their bust days from sea-breeze patterns, onshore flow timing, marine cloud burnoff, and coastal frontal interactions — not from the keywords the AFD heuristic looks for. The 1,530-day backfill analysis found these four cities had negative AFD–miss-rate correlation: stable AFD days actually missed more than volatile ones, opposite to inland cities. So the Wx layer flags them as inverted-tier and refuses to make a confidence call — not because the trade is bad, but because the Wx signal has nothing useful to say.

How to Read the Two Layers Together

PricingWxRead
FIREHIGH/MEDTake it. Both layers agree.
FIREBASETake it. No conviction overlay either way; pricing carries the day.
FIRELOWTrim or skip. Stable pattern + responsive city = Wx layer is leaning against pricing’s historical bias.
FIREN/ATake it. Inverted city. Wx layer doesn’t apply. Pricing is the only signal — trust it.
HOLDHIGHPass — pricing already rejected. HIGH on Wx alone is not a buy trigger.

UX Fix Shipped

The SKIP label on the Wx column for inverted cities was renamed to N/A. Same logic, clearer reading. The previous label conflated two unrelated meanings: “skip the Wx column” (intent) vs “skip the trade” (visual reading). With pricing now firing strong edge-based signals on MIA, the conflict was no longer hypothetical — the dashboard was actively producing contradictory advice on the same row.

Tooltip on the cell now states explicitly: “Inverted city — Wx column does not apply (AFD unreliable for marine-layer cities). Use pricing-layer signals as-is. This is NOT a skip-the-trade flag.”

What This Doesn’t Solve

The four inverted cities still have no day-specific conviction overlay — only their long-run base rate. That’s a real gap. A complete fix would be a marine-layer-specific conviction layer trained on those four cities’ history, looking at sea-breeze indicators, onshore-flow timing, and marine-stratus burnoff probabilities instead of the inland AFD vocabulary. Different project. Not high-leverage today — the pricing-layer edge on inverted cities (MIA top-edge cohort: 89.6% miss, n=25) is already strong enough to act on without conviction overlay.

For now: when a pricing signal fires on LAX, SFO, SEA, or MIA, and the Wx column reads N/A, that’s the system working as designed. The signal stands alone, and that’s sufficient.

Architecture note: the May 5 edge-pricing pilot (CHI + MIA flipped to edge-position-based pricing) made this conflict visible because MIA HIGH started firing larger, more confident signals from the pricing layer. The Wx layer was already saying “ignore me for MIA”; that just wasn’t a problem when MIA HIGH rarely fired strong signals under parity pricing. Pilot rollouts surface UX issues that didn’t matter before.

First 3 Days Live — Diagnosing the Underperformance

Observation Date: April 23, 2026 | Based on: 33 settled live signals (Apr 21–22) vs 270-signal backtest baseline

The Headline Numbers Are Ugly

Three days into live deployment of the adjacent-bracket YES strategy, live performance is running far below backtest:

nWRP&LROI
Backtest27049%+$44.02+50%
Live (Apr 21–22)3321%−$3.54−34%
Delta−28pp−84pp

Apr 21 was 5W/10L (33% WR). Apr 22 was 2W/16L (11% WR). That prompted this diagnostic. The question isn’t “is the strategy broken” — 33 bets is not enough data to answer that. The question is: does the mechanism still look right, or did something change?

The Forecast Error Distribution Shifted

The adjacent-bracket YES strategy structurally needs the forecast to miss its own bracket by ≥ 1.5–2°F. Comparing the forecast error distribution between backtest and live windows:

MetricBacktest (246 days)Live (28 days)Δ
Mean |error|1.80°F1.68°F−0.12°F
Median |error|2.0°F1.0°F−1.0°F
Within 1°F (forecast bullseye)49.6%60.7%+11pp
Error ≥ 2°F (strategy-winnable)50.4%39.3%−11pp

Mean error looks similar, but that hides the real story. The median dropped from 2°F to 1°F. The live window is dominated by the “forecast nailed it” bin — 60.7% of day-city pairs had the actual within 1°F of forecast, vs 49.6% in backtest. That’s exactly the condition where the adjacent-bracket strategy cannot win: the truth sits inside the forecast bracket, so both ±1 bets lose by design.

Bucketing the 33 Live Bets by What Actually Happened

For each settled bet, classify by (a) how big the forecast miss was, and (b) whether the miss direction matched the bet’s offset:

ConditionCount%RecordWR
Error < 1.5°F (quiet weather, both adj bets lose)1442%structural loss
Error ≥ 1.5°F, wrong direction (bet +1 but miss was cold)1133%0W/11L0%
Error ≥ 1.5°F, right direction (bet direction matched miss)824%2W/6L25%

Two findings jump out:

Per-Day Read

Is the Strategy Broken?

Probably not, but it’s also not exonerated. Two things are true simultaneously:

Both effects need more data to disentangle. 33 bets is one noisy week of weather; the backtest was 246 days of varied conditions.

What I’m Watching

No Action Yet

33 bets is not a sample. I will not retune city groups, offset rules, or conviction thresholds on this data. The strategy stays in dry-run through at least Sunday. If by end of next weekend the WR is still sub-30% on 80+ bets, that’s the re-assessment point.

Mechanism note: this observation quantifies a known failure mode. The adjacent-bracket YES strategy loses money on quiet-weather days and loses more money on volatile days with random-direction misses. Profitability depends on (1) weather being variable enough to produce 1.5°F+ forecast misses, and (2) the model correctly predicting which direction those misses lean per city. Apr 21–22 underdelivered on both. Unclear yet whether (1) is a seasonal dip or (2) is a structural gap.

Forecast-Bracket YES — The Mirror Edge

Observation Date: April 22, 2026 | Based on: same 392 joined city-days from the Apr 21 research, re-sliced at offset 0 per (city, timepoint)

Setup

The Apr 21 observation established that buying YES on the adjacent (±1) bracket beats buying YES on the forecast bracket in aggregate — the forecast bracket has slight negative EV across the full sample. That conclusion is correct in aggregate but masks a cleaner per-city split that flipped immediately when we asked: are there any cities where the forecast-bracket YES actually works?

Answer: yes, 10 (city, timepoint) combinations have positive-EV forecast-bracket YES with n ≥ 5 tradeable opportunities. Five of those are above +60% ROI. The top entry is DEN nextday at +106% ROI, the single strongest individual-strategy ROI in the entire weather dataset.

Top Positive-EV Forecast-Bracket Strategies

CityTimepointnHit%Avg costEVROI
DENnextday1060%$0.29+$0.309+106%
DENmorning1060%$0.34+$0.261+77%
CHImorning1650%$0.28+$0.221+79%
ATLafternoon6100%$0.78+$0.220+28%
SFOnextday1155%$0.33+$0.212+63%
CHInextday743%$0.25+$0.181+73%
ATLnextday1060%$0.44+$0.165+38%
LAXnextday650%$0.34+$0.163+49%
PHXnextday1060%$0.45+$0.151+34%
AUSafternoon875%$0.62+$0.134+22%

Plus seven weaker-but-positive entries (ATL morning, MIA ND/AM, PHL ND, HOU AM, NYC AM, PHX AM) in the +2% to +16% ROI range. Complete list and methodology in scripts/forecast-evolution.js.

The Two Cohorts — Forecast-Accurate vs Forecast-Inaccurate

The cities that make forecast-bracket YES profitable are exactly the cities where the NWS forecast lands inside its 2°F bracket more often than the market prices in. Buy a 60%-probability outcome at $0.30 = +100% ROI. There’s no mystery once you sort the table.

These two strategies are mirror images, not competitors. For a given (city, timepoint):

Forecast behaviorWinning tradeWhy
Forecast lands in its own bracket reliablyBuy YES on offset 0Market underprices the forecast-bracket hit rate
Forecast misses its own bracket reliablyBuy YES on offset ±1Market anchors on forecast; adjacent brackets are where the real winner lives

Why Our “Loser” Cities Aren’t Actually Losers

The Apr 21 blog post labeled DEN, ATL, LAX as loser cities and LAX “doesn’t play.” That labeling was correct for the adjacent-bracket strategy but misleading as a blanket statement. These cities are actually among the best targets for forecast-bracket YES. We called them losers because we were only looking through the ±1 lens.

In particular, DEN is the strongest single city in the weather dataset when you include offset 0: +106% ROI on nextday, +77% on morning. These were zero-positive-EV cells under the Apr 21 analysis because the adjacent brackets lose at DEN — which is exactly the observable consequence of DEN being forecast-accurate.

Updated City Cohort Assignments (proposed)

If we want to deploy forecast-bracket YES as a complementary strategy, the clean assignment per (city, timepoint) looks like this. No cell should have both offset 0 and offset ±1 firing at once — that would be internally contradictory.

CityNextdayMorningAfternoon
DENoffset 0 (+106%)offset 0 (+77%)
CHIoffset 0 (+73%)offset 0 (+79%)
ATLoffset 0 (+38%)offset 0 (+16%)offset 0 (+28%)
SFOoffset 0 (+63%)offset ±1 (+8%)
LAXoffset 0 (+49%)
PHXoffset 0 (+34%)offset 0 marginal (+2%)
AUSoffset +1 (+90%)offset +1 (+46%)offset 0 (+22%)
MIAoffset +1 (+53%)offset +1 (+18%)
PHLoffset +1 (+66%)offset +1 (+63%)
DALoffset −1 (+130%)offset −1 (+78%)
OKCoffset −1 (+114%)offset −1 (+130%)
HOUoffset −1 (+37%)offset −1 (+14%)
SEAoffset +1 (+57%)offset +1 (+35%)offset +1 (+46%)
LASoffset −1 (+23%)
BOSoffset +1 (+38%)offset +1 (+137%)
NYCoffset +1 (+57%)
DCoffset −1 (+209%)

AUS is the only city with a split pattern — forecast-inaccurate early (ND/AM use offset +1) but forecast-accurate late (afternoon uses offset 0). Most cities have one consistent direction.

Caveats

Implication for the Apr 21 Dashboard

The live dashboard currently suppresses signals on “loser” cities (DEN, ATL, LAX). Under the mirror-edge frame, those suppressions are actively costing us signals — DEN nextday and ATL nextday would be among the highest-EV fires in the whole system if we surfaced them. If we decide to integrate forecast-bracket YES into live logic, the dashboard will need a way to show which offset is being played per city, and the existing “HIGH / MEDIUM / none” conviction system needs a third bucket for offset-0 signals.

Corollary to the Apr 21 Forecast Evolution observation. Same source data, same caveats about sample size. This finding complements rather than replaces the adjacent-bracket edge — they apply to disjoint city subsets and neither is a free lunch. Re-run scripts/forecast-evolution.js with the offset-0 slice any time to refresh. If/when live Apr 21+ data confirms the ±1 edge is holding, this offset-0 set is the natural next expansion.

Forecast Evolution — The Adjacent-Bracket YES Edge

Observation Date: April 21, 2026 | Based on: 392 joined city-days (Jan 20 – Apr 19), NWS forecast at 3 timepoints × Kalshi expiration_value truth

Setup

We have four temperature readings per city-day that matter for YES/NO pricing:

We joined all four per (city, date) using scripts/forecast-evolution.js. For each scan, we know (a) which bracket contained the forecast, (b) which bracket actually won on Kalshi, (c) the yesAsk / noAsk for every bracket in the book, and (d) the offset between forecast and winner (in 2°F bracket indices). Then we evaluated the strategy “always buy offset X YES at timepoint T” for every (T, X), limited to brackets that were tradeable (15¢ ≤ yesAsk < 95¢).

Forecast Accuracy Improves Through the Day

TimepointnMiss rate (offset ≠ 0)P(forecast bracket wins)
nextday (eve before)15865.8%34.2%
morning18960.3%39.7%
afternoon16954.4%45.6%

The forecast bracket wins 45.6% of the time when we look at 3pm–5pm ET forecasts, up from 34.2% the evening before. Operationally this confirms what we already believed: fresher forecasts are better, and afternoon scans should drive signal conviction.

The Drift Sign-Flip

When we bucket by how the forecast evolved between timepoints, miss rate is asymmetric in a surprising way:

DriftMorning miss vs nextdayAfternoon miss vs morning
cool 1–3°F58.6% (n=29)63.2% (n=38)
stable |d|<1°F61.6% (n=99)53.2% (n=79)
warm 1–3°F69.2% (n=26)46.2% (n=39)

Forecast warming from nextday to morning predicts MORE misses (69.2%). Forecast warming from morning to afternoon predicts FEWER misses (46.2%). Likely mechanism: morning warming revisions are often overcorrection to a single model run; afternoon warming revisions are the forecaster tracking actual observed warming. Early warming = noise; late warming = signal.

The Central Finding: Adjacent Brackets Have Positive YES EV

Ranked by expected value per $1 contract, tradeable only (15¢ ≤ yesAsk < 95¢):

StrategynHit rateAvg costEV/contractROI
afternoon | offset +15251.9%$0.394+$0.126+32.0%
nextday | offset −112235.2%$0.300+$0.053+17.7%
nextday | offset +111631.0%$0.279+$0.032+11.5%
afternoon | offset −15754.4%$0.516+$0.028+5.4%
morning | offset ±1139/12433.1%$0.316+$0.015+4.7%
morning | offset 0 (forecast)19437.6%$0.388−$0.012−3.1%
nextday | offset 0 (forecast)16331.9%$0.336−$0.017−5.1%
afternoon | offset 0 (forecast)9749.5%$0.583−$0.088−15.1%

The forecast bracket itself has slightly negative YES EV at every timepoint. Every adjacent-bracket strategy has positive EV. The afternoon +1 bracket is the strongest single edge: 51.9% hit rate at an average cost of 39.4¢ = 32% ROI over 52 observations. This is comparable to our live Forecast NO live P&L (+34.3% ROI on 24 signals).

Per-City Breakdown (Top Strategies)

Aggregate wins conceal a lot. Per-city EV for strategies with n ≥ 5 tradeable opportunities:

CityTierND +1ND −1AM +1AM −1PM +0 (fcst)
AUSresponsiven=9 56% +$0.263n=7 29% +$0.020n=16 44% +$0.137n=11 27% −$0.057n=8 75% +$0.134
MIAinvertedn=9 67% +$0.232n=17 53% +$0.080
PHLresponsiven=9 44% +$0.178n=8 25% +$0.014n=9 56% +$0.217n=5 20% $0.000
DALneutraln=5 40% +$0.134n=9 67% +$0.377n=5 20% −$0.052n=9 56% +$0.244n=7 43% −$0.070
OKCneutraln=8 13% −$0.134n=9 56% +$0.297n=7 14% −$0.196n=6 67% +$0.377n=7 43% −$0.087
HOUresponsiven=9 56% +$0.152n=9 44% +$0.054
BOSresponsiven=7 29% +$0.079n=6 17% −$0.062n=6 50% +$0.288n=6 17% −$0.092
DCresponsiven=7 29% −$0.020n=6 33% −$0.030n=5 80% +$0.542
SEAinvertedn=8 50% +$0.182n=6 33% +$0.045n=9 44% +$0.114n=8 25% −$0.250
LASneutraln=11 55% +$0.102n=11 45% −$0.054n=8 25% −$0.216
NYCneutraln=7 43% +$0.157n=5 20% −$0.106n=13 8% −$0.242n=10 10% −$0.198
SFOinvertedn=9 22% −$0.001n=11 18% −$0.095n=7 29% +$0.080n=10 40% +$0.080n=9 33% −$0.097
ATLresponsiven=10 10% −$0.156n=8 25% +$0.016n=7 14% −$0.119n=9 22% −$0.024n=6 100% +$0.220
DENneutraln=6 17% −$0.048n=10 20% −$0.075n=6 17% −$0.092n=9 22% −$0.073n=7 57% −$0.091
CHIresponsiven=9 11% −$0.119n=15 33% +$0.057n=8 13% −$0.144n=7 57% −$0.131
LAXinvertedn=5 20% −$0.120n=5 20% −$0.126n=6 33% +$0.005
PHXresponsiven=8 38% −$0.021n=7 29% −$0.030n=6 33% +$0.025n=10 50% −$0.120

Cells show n / hit% / EV. Green = positive EV, red = clearly negative. Dashes mean <5 tradeable opportunities. Old AFD tier labels included for reference — they are not predictive of which strategies work per city.

Three Groups Emerge

The “AFD tier” label does NOT predict membership in these groups. MIA and SEA (labeled “inverted” in AFD) are in the consistent-winner group; ATL (labeled “responsive”) is in the consistent-loser group. Whatever AFD tier was measuring, it isn’t “will this city yield positive YES EV.”

Why the Market Misprices Adjacent Brackets

Hypothesis: traders anchor to the NWS forecast and preferentially buy YES on the forecast bracket, inflating its price. Adjacent brackets get the residual liquidity — thin books and wider spreads mean they’re systematically underpriced relative to their true 20–35% hit probability. The afternoon +1 bracket is especially extreme because by 3pm ET the actual afternoon temperature is already climbing, and the +1 bracket is often the correct call that the book hasn’t fully repriced yet.

This is the mirror image of the Forecast NO edge: that strategy exploits the market underpricing the “forecast misses” outcome. This new finding exploits the market overpricing the “forecast hits” outcome. They’re consistent — both say the market is too confident in the forecast bracket being the winner.

Caveats & Next Steps

Operational Implications

If we want to act on this, the clearest minimum-risk starting point is:

This analysis uses scripts/forecast-evolution.js and the HistoricalConviction + Scan + Settlement collections. Row-level data is exported to scripts/out/forecast-evolution.jsonl. Re-run after each settlement cycle to track how edge estimates evolve.

Station vs Grid — Kalshi Settlement Locations & NWS Forecast Variance

Observation Date: April 20, 2026 | Status: initial documentation, warrants further research

The Issue

Kalshi settles temperature markets on the CF6 climate report from a specific airport (or park) weather station. NWS forecasts are for a ~2.5km grid cell at a lat/lon point, not the station itself. The grid cell averages over a broader area, while the station is a single-point reading affected by its immediate surroundings. This mismatch is a systematic source of forecast error that varies by city.

Some of the “forecast bust” days in our data may not be NWS getting the weather wrong — they may be the NWS grid forecast not matching the specific microclimate at the Kalshi settlement station.

Settlement Stations by City

CityStationStation NameLatLonMicroclimate Risk
NYCKNYCCentral Park40.779-73.969HIGH — Not an airport. Urban heat island + park cooling. Unique microclimate that NWS grid doesn’t specifically model. Central Park can read 2-4°F different from surrounding Manhattan.
LAXKLAXLAX Airport33.938-118.389HIGH — Coastal airport directly affected by marine layer. Can be 10°F+ colder than points 5 miles inland on the same day. NWS grid averages over the LA basin; LAX station sits right on the marine-layer boundary. Likely explains why LAX is “AFD-inverted” in the backfill.
SFOKSFOSFO Airport37.621-122.379HIGH — Peninsula airport surrounded by bay water on three sides. Extreme fog/marine-layer sensitivity. Same “inverted” AFD pattern as LAX. The NWS grid cell includes inland areas that behave completely differently.
PHXKPHXSky Harbor33.437-112.008MED
MIAKMIAMiami International25.793-80.291MED — Coastal proximity + sea-breeze effects. Another “inverted” AFD city. The station is inland enough to avoid direct ocean moderation but close enough for sea-breeze timing to matter.
SEAKSEASeaTac Airport47.450-122.309MED — Puget Sound marine influence. Inland from the coast but maritime air penetrates the gap. Third “inverted” city in the AFD analysis.
BOSKBOSLogan Airport42.366-71.010MED — Harbor-adjacent airport. Sea breeze can drop temps 5-10°F on summer afternoons vs inland. East wind = marine cooling; west wind = continental heating. NWS grid may not capture the harbor effect precisely.
DCKDCAReagan National38.851-77.040MED — Potomac River airport. Urban heat island + river cooling creates a complex microclimate. Can differ from Dulles (KIAD) by 3-5°F on the same day.
CHIKMDWMidway Airport41.787-87.752LOW — Inland urban airport. Less microclimate variability than coastal stations. Lake Michigan effect is weaker at Midway (10 miles inland) vs O’Hare or lakefront.
ATLKATLHartsfield33.641-84.428LOW — Large inland airport. Minimal microclimate effects. Good grid-to-station alignment expected.
AUSKAUSBergstrom Airport30.195-97.670LOW — Inland airport in flat terrain. Minimal microclimate offset expected.
DENKDENDenver Intl39.856-104.674MED — Airport is on the eastern plains, ~25 miles from the foothills. Chinook winds can create extreme local warming (20°F+ in hours) that the grid may underforecast. Elevation: 5,431 ft.
PHLKPHLPHL Airport39.872-75.241LOW — Inland airport. Delaware River nearby but minimal direct marine influence.
HOUKHOUHobby Airport29.646-95.279MED — Corrected Apr 21: Kalshi settles on KHOU (Hobby) not KIAH (Intercontinental). Hobby is ~20mi south, closer to Galveston Bay — Gulf moisture reaches Hobby before IAH. Explains the +67% Kalshi/ACIS drift we saw in Apr 20 data when we were fetching KIAH actuals.
OKCKOKCWill Rogers Airport35.393-97.601LOW — Great Plains airport. Flat terrain, minimal microclimate effects. Good grid alignment.
LASKLASMcCarran Airport36.084-115.154LOW — Desert airport. Consistent conditions. Urban heat island effect in the valley, but NWS grid likely captures it.
DALKDFWDFW Airport32.900-97.040LOW — Large inland airport. Flat terrain, minimal microclimate effects.

Correlation with AFD Tier Findings

The four “AFD-inverted” cities from the backfill (Apr 18 finding) are exactly the four highest microclimate-risk stations:

This is likely not a coincidence. The AFD discusses synoptic-scale weather patterns (fronts, troughs, ridges). For coastal/marine-layer cities, the local station temperature is dominated by micro-scale marine effects that the AFD doesn’t address. On “stable” days (high pressure, ridge), these cities have MORE micro-variability because the synoptic pattern is quiet but the marine layer position shifts unpredictably. On “volatile” days (fronts, troughs), the strong synoptic forcing actually OVERRIDES the marine variability and makes the station more predictable.

This explains the inversion: stable AFD → marine layer dominates → station is unpredictable. Volatile AFD → synoptic pattern dominates → station follows the grid forecast more closely.

Research Needed

Potential Impact on Strategy

If the grid-to-station offset is consistent per city (e.g., LAX station always reads 2°F cooler than the grid forecast on marine-layer days), that’s a free calibration adjustment we could add to the model. It would effectively give us a better “station-specific forecast” without needing a new data source.

If the offset is variable (sometimes +3°F, sometimes -2°F), it means the station microclimate adds irreducible noise that no forecast can capture — and the right response is to widen the confidence interval (larger sigma) for those cities, or avoid them entirely for tight-bracket NO bets.

This observation connects the Apr 18 AFD inversion finding (coastal cities have inverted AFD signal) to a physical mechanism (marine-layer microclimate at the station). The per-city AFD tiers (responsive vs inverted) may be a proxy for “how much does the station microclimate differ from the NWS grid forecast.” Further research: quantify the offset per city from existing settlement data.

Wx Overlay Refinement Roadmap — NO/YES Signal Integration

Observation Date: April 19, 2026 | Based on: 1,530 city-day backfill analysis (Apr 18 findings)

Context

The Apr 18 backfill validated that AFD keyword analysis HAS predictive value for forecast bust days — but the current scoring is saturated (67% of days at cap), the signal is city-dependent (inverted for coastal cities), and specific keywords carry almost all the weight. This roadmap prioritizes the next steps by impact × effort.

Tier 1 — Data-Driven Fixes (High Impact, Low Effort)

1. Recalibrate AFD Keyword Weights

The empirical keyword deltas from the 1,530-day backfill tell us exactly what to change:

ChangeKeywordCurrent WeightEmpirical DeltaAction
trough+0.04+9.9pp missRaise to +0.10 or higher
front+0.04+9.7pp missRaise to +0.10
rapidly+0.05+7.8pp missRaise to +0.08
uncertainty+0.06 (instability)−1.7pp missMove to neutral (0.00) or slight stability
dry−0.03−3.5pp missKeep or reduce slightly
clear−0.03−2.7pp missKeep
fair, calm, ridge, high pressure−0.03 to −0.06< ±3ppReduce toward 0 — too noisy

Estimated effort: 30 minutes. Config change in src/stability.js keyword arrays. Immediate impact on live Wx column.

2. Per-City AFD Enablement

Split cities into three tiers based on the backfill’s volatile-vs-stable gap:

TierCitiesGapAction
AFD-ResponsiveHOU, ATL, PHX, AUS, BOS, PHL, DC, CHI+20 to +44ppApply AFD conviction to NO/YES pricing
AFD-NeutralDEN, LAS, OKC, DAL, NYC< 15ppUse overall miss rate only; show Wx for info
AFD-InvertedLAX, SFO, SEA, MIANegativeExclude from AFD-based pricing entirely

Store as a per-city config flag in config/cities.js (e.g., afdTier: 'responsive' | 'neutral' | 'inverted'). The Wx column on Forecast NO would still show for all cities (observational), but only AFD-responsive cities would have their cap adjusted in Phase 2.

Estimated effort: 30 minutes. Config + classifier change.

3. Recompute Correlation Using NWS Forecasts + Real Brackets

The backfill used Open-Meteo GFS forecasts (1.2°F avg error) and synthetic brackets. The Settlement collection has NWS forecasts + Kalshi-verified actuals for 30+ days. Cross-joining those with HistoricalConviction AFD data would give numbers directly comparable to the live Forecast NO classifier — and likely show a wider volatile-vs-stable gap since NWS point forecasts are less accurate than GFS.

Estimated effort: 1 hour. Query + analysis script.

Tier 2 — New Signals to Integrate (Medium Impact, Medium Effort)

4. Binary “Trough or Front” Flag

The keyword analysis shows “trough” and “front” carry almost all the predictive signal (+10pp each). A simple boolean — “did the AFD mention trough or front?” — might outperform the complex 24-keyword composite. Test it against the backfill data before building. If it works, it’s the simplest possible conviction signal: one bit, +10pp edge.

5. Forecast-vs-Model Diff (Free Alternative to Ensemble Spread)

The backfill has both GFS forecasts and NWS forecasts (via Settlement). Computing |GFS − NWS| for each historical day gives a “model disagreement” signal without needing the paid Open-Meteo ensemble tier. Large divergence = NWS anchored to a different model = higher bust probability. Can be computed from existing data with zero new API calls.

6. Time-of-Day AFD Weighting

The IEM archive timestamps each AFD product. The morning AFD (~4am local) is the forecast the market prices off of. The afternoon AFD often reflects what actually happened. Scoring only the morning AFD (closest to market-open) may produce a cleaner signal than averaging all AFDs for the day.

Tier 3 — New Strategy Components (High Impact, Needs More Data)

7. Forecast YES Scanner

Build the inverse of Forecast NO: on neutral-AFD days (0.96–1.05), scan for YES contracts on the forecast bracket priced ≤ 35¢. The backfill shows 67% hit rate on these days. At 29–35¢ entry, that’s +20–40% ROI.

Needs its own page, signal tracking, and settlement scoring — parallel to the existing Forecast NO pipeline. The conviction overlay becomes the signal switch: Red Wx → buy NO, Green Wx → buy YES, Gray → baseline only.

Wait until keyword recalibration (Tier 1) is done — otherwise the YES scanner uses the same saturated scoring.

8. Conviction-Weighted Sizing

Instead of flat $10/signal, size each Forecast NO position by conviction:

This is the Phase 2 the Apr 15 observation described, but now calibrated with empirical weights from the backfill rather than guessed.

9. 180-Day Backfill Extension + Seasonal Analysis

Current 90 days cover late January through mid-April (winter → spring transition). Extending to 180 days adds the fall → winter transition and reveals whether the AFD signal is seasonal. “Trough” and “front” might be more predictive in transitional seasons than in stable summer ridges.

Run: node scripts/backfill-historical-conviction.js --days 180 (takes ~45 min with AFD scraping).

Recommended Sequence

  1. Items 1 + 2 together (keyword recalibration + per-city enablement) — ~1 hour, immediately improves the live Wx column
  2. Item 4 (binary trough/front flag) — test against backfill data, 30 min
  3. Item 3 (NWS forecast correlation) — validates with the real data source, 1 hour
  4. Item 7 (Forecast YES scanner) — biggest new revenue stream, builds on recalibrated scoring
  5. Items 5, 6, 8, 9 as time permits

This roadmap follows the Apr 18 backfill findings. All Tier 1 items use data we already have — no new API calls or paid services. The Forecast YES scanner (Tier 3) is the largest upside opportunity but depends on Tier 1 calibration being done first so the conviction signal is trustworthy.

Historical Conviction Backfill — AFD Validation Results

Observation Date: April 18, 2026 | Data: 1,530 city-days (17 cities × 90 days, Jan 18 – Apr 17 2026) | Sources: Open-Meteo Historical Forecast API + RCC-ACIS actuals + Iowa State IEM AFD archive

Summary

Backfilled 90 days of historical conviction data to validate whether the Wx overlay (AFD keyword scoring) correlates with forecast bust days. The answer is mixed: the signal exists but the scoring needs recalibration before it’s actionable.

Finding 1: AFD Scoring Is Saturated

67% of all days hit the AFD factor cap (1.25). Keywords like “front,” “trough,” and “storm” appear in nearly every AFD because forecasters always discuss some weather feature. The scoring has no discrimination power for most days.

AFD LevelMiss Raten% of Days
≤ 0.90 (very stable)48.0%22915%
0.96–1.05 (neutral)33.3%876%
1.06–1.15 (unsettled)47.5%1017%
1.21–1.25 (extreme/cap)51.4%1,03367%

Implication: the threshold for “volatile” needs to be much higher, or the keyword weights need restructuring. The current scoring cannot distinguish “a front is mentioned in passing for next week” from “a dangerous front is arriving tomorrow.”

Finding 2: The MIDDLE of the AFD Range Has the Signal

The lowest miss rate (33.3%) is in the neutral zone (0.96–1.05) — days where the AFD has roughly equal stability and instability language. Both extremes (≤ 0.90 and ≥ 1.21) have ~48–51% miss rates.

This directly supports the Forecast YES thesis (see Apr 17 observation): on neutral-AFD days, forecasts are right 67% of the time. At 29–35¢ YES prices, that’s profitable edge. The conviction overlay can identify which days to buy YES, not just which days to buy NO.

Finding 3: AFD Value Varies Enormously by City

CityVolatile Miss (n)Stable Miss (n)GapAFD Useful?
HOU55% (71)11% (9)+44ppYES — strong
ATL75% (59)44% (16)+31ppYES
PHX78% (32)51% (45)+27ppYES
AUS70% (60)50% (10)+20ppYES
DEN56% (66)43% (7)+13ppMarginal
LAS16% (32)15% (41)+1ppNO — both low
SEA74% (35)76% (34)−2ppNO — both high
LAX29% (34)56% (36)−26ppINVERTED
SFO47% (36)57% (35)−10ppInverted

Coastal cities (LAX, SFO, SEA) are inverted or neutral. Marine-layer micro-climate variability doesn’t respond to synoptic-scale AFD language. A “stable high-pressure ridge” over California can produce either 65°F or 80°F at LAX depending on marine layer position — and the AFD can’t predict that.

Interior/Gulf cities (HOU, ATL, PHX, AUS) respond strongly. These are the cities where AFD-based conviction should be applied. 20–44pp gap between volatile and stable days is real edge for signal discrimination.

Finding 4: Specific Keywords Matter More Than the Aggregate Score

KeywordMiss When PresentMiss When AbsentDelta
trough51.7%41.8%+9.9pp
front50.9%41.1%+9.7pp
rapidly55.8%48.0%+7.8pp
light winds43.2%50.9%−7.7pp
sunny44.9%50.3%−5.4pp
uncertainty48.1%49.8%−1.7pp

“Uncertainty” REDUCES miss rate. Counterintuitive but consistent: when NWS forecasters explicitly flag uncertainty, they anchor to climatology and make conservative predictions — which paradoxically makes the point forecast more accurate. “Trough” and “front” are the real bust predictors at ~+10pp each.

Finding 5: Overall Forecast Quality

Average absolute forecast error (Open-Meteo GFS): 1.2°F. Overall bracket miss rate: 49.1%. These numbers are lower than the live classifier’s 70–88% because: (a) GFS outperforms NWS point forecasts operationally, (b) the data covers all 17 cities including low-miss ones, and (c) the synthetic bracket assumption (hardcoded odd-top grid) affects the count.

Implications for Phase 2

Data Source Details

Script: scripts/backfill-historical-conviction.js. Collection: HistoricalConviction (1,530 flat documents). Safe to re-run (upserts by city+date+marketType).

This backfill was prompted by the Apr 15 conviction overlay observation and the Apr 17 Forecast YES hypothesis. The data validates the YES thesis (neutral AFD = 67% hit rate) while revealing that the NO-side AFD signal requires per-city calibration and keyword reweighting before it can drive automated decisions.

Forecast YES — Inverse Strategy on Stable Weather Days

Observation Date: April 17, 2026 | Triggered by: first day of Wx conviction data showing 9/14 cities at max AFD, with 3 cities (DEN, LAS, SEA) showing stable patterns

The Insight

While reviewing the first Wx conviction data, a cold front was driving most cities to AFD ≥ 1.20 (unstable). But three cities — DEN (0.89), LAS (0.93), SEA (0.90) — were stable, behind the front. The YES contract on Denver’s forecast bracket was priced at 29¢.

At 29¢, you only need 29% accuracy to break even. You can be wrong 70% of the time and still profit. This inverts the Forecast NO logic: instead of betting NWS is WRONG on volatile days, bet NWS is RIGHT on stable days, at cheap YES prices.

The Binary Payout Math

YES EntryWin ProfitLossBreak-even WR
25¢+$300−$10025%
29¢+$245−$10029%
33¢+$203−$10033%
40¢+$150−$10040%

The payout asymmetry is extreme. A 35% hit rate at 29¢ yields +21% ROI. A 40% hit rate yields +38% ROI. The strategy is very forgiving on accuracy because you’re buying cheap.

Historical Bracket HIT Rate (60-day, overall — not filtered by stability)

The complement of miss rate — % of days the actual high STAYED INSIDE the NWS 2°F bracket. These rates are BEFORE any stability filtering; the hypothesis is that stable-day filtering pushes hit rates higher.

CitySamplesHit Rate29¢ ROI25¢ ROINote
PHX2552%+79%+108%Already above breakeven without filtering
NYC2544%+52%+76%Already above breakeven
ATL2544%+52%+76%Already above breakeven
OKC2544%+52%+76%Already above breakeven
DAL2540%+38%+60%Already above breakeven
LAS2536%+24%+44%Already above breakeven
SEA2536%+24%+44%Already above breakeven
LAX2532%+10%+28%Above breakeven
SFO2532%+10%+28%Above breakeven
MIA2532%+10%+28%Above breakeven
HOU2528%−3%+12%Near breakeven
DEN2524%−17%−4%Below breakeven at base rate — needs stability filter
PHL2524%−17%−4%Below breakeven
DC2524%−17%−4%Below breakeven
BOS2520%−31%−20%Below breakeven
AUS2520%−31%−20%Below breakeven
CHI2516%−45%−36%Far below breakeven

Seven cities are already above the 29% breakeven at the overall rate, without any stability filtering. If stability filtering raises hit rates by even 5-10pp on clean days, the ROI numbers become very compelling.

Why This Is Different From The Old BUY_YES (Which Failed)

The active model’s BUY_YES was disabled in April (0/2 WR, −$51). That approach used the model’s probability estimate to decide when YES was underpriced — which required the model to be well-calibrated (it wasn’t).

This idea is fundamentally different:

ApproachSignalWhat It Relies On
Old BUY_YES (failed)“Our model says YES is cheap”Model probability calibration (broken)
Forecast YES (new)“NWS is likely RIGHT today”Wx conviction overlay identifying stable days

The conviction overlay becomes the signal switch between two complementary strategies that run on the same dashboard:

Wx SignalDay TypeStrategy
Red (AFD ≥ 1.15, spread ≥ 8°)VolatileForecast NO — NWS likely to bust, buy NO
Green (AFD ≤ 0.92, spread ≤ 3°)StableForecast YES — NWS likely right, buy YES cheap
GrayNeutralBaseline only or skip

Caveats

Analysis To Run When Data Is Ready

  1. Conditional hit rate by AFD level — for each city, what’s the bracket hit rate on days where AFD ≤ 0.92 vs days where AFD ≥ 1.15?
  2. Actual YES ask price at scan time — is 29¢ typical for stable cities, or was DEN an outlier? Need to capture YES prices alongside NO prices on scans.
  3. Edge = conditional hit rate − YES ask price. If this exceeds 5pp consistently, the strategy has real edge.
  4. Correlation between ensemble spread and hit rate — does low spread (≤ 3°) independently predict higher hit rates, or is it redundant with AFD?
  5. Interplay with edge-position — on a stable day, does a bottom-edge forecast (cold-biased city where 1°F cooling stays in bracket) have an even higher hit rate? That would be the tightest filter: stable + favorable edge + cheap YES.

Revisit Criteria

This observation was prompted by the first day of Wx conviction data (Apr 17). William noticed the inverse opportunity: if the conviction overlay identifies days when NWS is likely to be right, buying YES on the forecast bracket at cheap prices exploits the same data from the opposite direction. The conviction overlay was originally designed for NO-side edge detection — this is a completely new use case that emerged from the data itself.

Forecast NO — Day-Level Conviction Overlays

Observation Date: April 15, 2026 | Design notes — revisit when ready to build

The Insight That Prompted This

The April 11 post-mortem showed 71% of all Forecast NO losses (5 of 7) clustered on a single day. When the strategy loses, it loses multiple positions simultaneously because the model is collectively wrong about a weather pattern. Losses are day-correlated, not time-of-day-correlated.

Therefore the highest-leverage next improvement is a filter that can flag “today is a high-uncertainty day, size down or skip” vs “today is a clean baseline day, go full size.” Day-level conviction, not time-of-day tuning.

Already Computed But NOT Consumed by Forecast NO

These signals exist in the codebase (used by the active model for sigma adjustment) but are completely ignored by Forecast NO. Wiring them in is pure plumbing.

The AFD signal is the biggest miss. NWS forecasters literally use the word “uncertainty” in their discussion text on busted-forecast days — and we're ignoring it for the strategy that bets on their uncertainty.

Medium-Effort New Integrations

Harder / Longer-Term

Recommended Build Sequence

Phase 1 — Wire existing signals as display-only (2-3 hours). Create a dayConviction object on every Forecast NO scan with afdFactor, afdKeywords, ensembleSpread, ensembleDiff, hourlyVolatility, alertsPresent, and an aggregated dayConvictionScore (0-100, where 50 = neutral, >70 = high uncertainty). Surface on /weather/forecast-no as a new column. Do not change firing logic yet — let it accumulate for 2-3 weeks as display-only.

Phase 2 — Validate and fold in (after 60+ settled signals). On days where dayConvictionScore ≥ 70, was the realized miss rate actually higher than the city's baseline? If yes, use it as a cap multiplier: adjustedCap = baselineCap × (1 + 0.1 × convictionBonus). High uncertainty → accept higher NO ask prices (bigger positions). Stable days → tighten.

Phase 3 — April 11 post-mortem validation. Before investing in Phase 1 wiring, pull historical AFD + ensemble data for April 9-13 and check whether the dayConviction signal would have flagged April 11 as high-uncertainty. If yes, the signal is real and we should build. If no, the filter doesn't work and we need different data sources. This is the most valuable next step — proves the signal has predictive value on the one day we already know mattered.

Priority Order When Building

  1. AFD factor — biggest payoff, already computed, highest encoded human judgment
  2. Ensemble spread — biggest quantitative signal, already fetched
  3. Ensemble vs NWS diff — subtle but real edge indicator
  4. Hourly volatility — cheap, already partially done
  5. Everything else — wait and see if 1-4 are enough

Estimated Impact

If AFD flags 1 in 5 days as "high uncertainty" with meaningfully higher miss rates: back-of-envelope, April 11's loss was ~−$46 on 5 positions. A conviction filter that halved sizing on that day would have cut the loss to ~−$23, lifting period P&L from +$143 to +$166 — a ~16% improvement from a single filter during one documented bad day.

Open Questions for Future-Me

Revisit Criteria

Don't start this until:

Ensemble Models

The 6 global weather models used for ensemble spread and NWS-divergence analysis:

ModelOriginAPI IDNotes
GFSUS (NOAA)gfs_seamlessNWS's own parent model. When NWS agrees with GFS but disagrees with others, that's a strong NWS-anchoring signal.
ECMWFEuropeanecmwf_ifs025Generally considered the most accurate global model. When ECMWF diverges from NWS, the ECMWF read is often closer to truth.
ICONGerman (DWD)icon_seamlessStrong on European weather patterns. Independent initialization from GFS/ECMWF.
GEMCanadiangem_seamlessGood for northern US cities. Independent data assimilation.
JMAJapanesejma_seamlessStrongest on Pacific-influenced weather (west coast cities).
MeteoFranceFrenchmeteofrance_seamlessIndependent Arpège/AROME system. Adds diversity to the ensemble.

All fetched via Open-Meteo API (free, no API key). Source: src/fetchers/open-meteo.js.

How spread is computed: max(high) − min(high) across all 6 models for the forecast day. A spread of 3°F means the models all roughly agree. A spread of 10°F+ means at least one model sees a fundamentally different weather outcome (e.g., a front arriving 6 hours earlier/later than others expect).

Related observation: Forecast NO Execution Automation Design Notes (Apr 14, stored in recaps) — the conviction-score signal and the execution-intent automation share a natural integration point. When the conviction score drops, the automation should cancel open unfilled intents; when it rises, it should permit higher per-signal size.

Forecast Parity Interacts With City Bias

Observation Date: April 14, 2026 | Data: 21 HIGH settlements per city over last 30 days

The Structural Setup

Kalshi weather brackets are 2°F wide and aligned in even-odd pairs: 82-83, 84-85, 86-87, and so on. The top edge of every bracket is an odd number. The bottom edge is even.

This creates a structural interaction with each city's directional forecast bias:

In short: the parity of the forecast determines which side of the bracket a city’s typical drift crosses.

The Data

Splitting each city’s historical miss rate by forecast parity shows a dramatic effect:

CityBiasODD miss%EVEN miss%Edge to Preferred
HOU-1.4°F (cold)50% (n=10)91% (n=11)+41pp EVEN
SEA+2.7°F75% (n=8)38% (n=13)+37pp ODD
MIA+1.0°F80% (n=10)45% (n=11)+35pp ODD
DC+2.0°F91% (n=11)60% (n=10)+31pp ODD
AUS+1.0°F88% (n=16)60% (n=5)+28pp ODD
PHL+3.8°F86% (n=7)64% (n=14)+22pp ODD

Where It Breaks Down

The effect is strongest for cities with modest directional bias (~1-2°F). For cities with extreme bias, the magnitude of drift overwhelms the parity effect:

Implications

Current Status

First Live Bias-Change Signal (same day)

After adding bias change detection (comparing recent 14d to prior 14d), the first classifier run surfaced meaningful movement right away. These are not small adjustments — they suggest real regime shifts are happening even within a 4-week window:

CityPrior 14dRecent 14dΔFlagNote
LAX-2.0°F (cold)+0.6°F (warm)+3.1°F⚠ FLIPCold bias reversed to warm — parity preference now inverted. Treat with caution until confirmed.
ATL+3.3°F+0.9°F-2.7°F↓ SHIFTStill warm, but bias magnitude cut to a third. Moving into the "parity-sensitive" sweet spot where the effect is strongest.
NYC+1.6°F-0.4°F-2.4°F↓ SHIFTDrifted from warm into the neutral zone. Parity preference now none.
OKC+3.6°F+1.7°F-2.2°F↓ SHIFTHalved the warm bias. Still ODD-preferring but less confident.
DC+3.9°F+1.4°F-3.0°F↓ SHIFTBiggest shift. Was the highest-confidence ODD city; now in the middle of the pack.

The pattern: all four "shift" cities are cooling — their warm bias is dropping. LAX reversed entirely. This points to NWS models catching up to recent weather patterns (spring warm-up already priced in), or an actual regime change (cold snap suppressing the usual warm bias).

Stable cities (no flag): CHI (+5.1°F), BOS (+2.4°F), AUS (+0.9°F), MIA (+1.1°F), PHX (+1.3°F), HOU (-1.6°F). These have been consistent across both windows — trustworthy parity signals right now.

Implication for the strategy: the five shift/flip cities need higher confidence margins on their entry prices, or we should explicitly mark them as "bias uncertain" and skip for a few days. Specifically LAX — we were about to trust its "prefer ODD" signal, but with the flip flag, that recommendation is unreliable.

Credit: William spotted this. The observation was sparked by noticing that forecast brackets had odd numbers at the top, which immediately suggested the asymmetry. Sample sizes are still small (5-16 per city-parity bucket) — pattern should be re-verified at 45+ days of data.

Weather Forecast Error Patterns

Observation Date: April 7, 2026 | Data: 289 HIGH settlements across 17 cities (March 21 - April 7, 2026)

Key Finding: Weather Errors Persist, Not Reverse

Unlike BTC 15-minute markets where price mean-reverts (57% reversal rate after 3+ streaks), weather forecast errors persist in the same direction. Streak reversal strategies fail for weather:

Streak LengthReversal Rate (miss >1°)Verdict
2-day47%Coin flip
3-day31%Continuation favored
4-day14%Strong continuation

This is the opposite of BTC. Weather errors compound — if the NWS missed hot for 3 days, they'll likely miss hot again tomorrow.

Autocorrelation: The Warm Bias

The NWS has a systematic warm bias. After a cold miss, it returns to warm. After a hot miss, it stays hot. The bias is the attractor.

After Big Misses: Forecast Improves But Undercorrects

When the NWS misses by 4°+, the next day improves but doesn't fully correct:

Miss SizeNext Day "Improved"Full Reversal
≥2°74% (HOT) / 83% (COLD)31% / 39%
≥3°81% / 86%37% / 41%
≥4°85% / 100%42% / 62%
≥5°95% / 100%46% / 56%

Cold misses correct more aggressively than hot misses (they revert to the warm bias). Hot misses persist.

Tradeable Signal: Regression to Forecast

After a big miss, the next day's actual temp tends to land closer to the forecast but within a known band:

Miss ThresholdBracket OffsetWin RateSample
≥3°±2°62%95
≥4°±2°67%73
≥5°±2°70%50
≥4°±3°73%73
≥5°±3°78%50

Sweet spot: after a 5°+ miss, bet the next day’s actual will be within 3° of the forecast. 78% win rate, ~3 opportunities per week.

Per-City Overcorrection Patterns

Some cities overcorrect after big misses (error switches sign by >1°), others never do:

CityOvercorrection RatePattern
AUS75%Frequently overcorrects — tradeable
LAS67%Overcorrects often
DEN, PHL, BOS50%Coin flip
CHI, SFO, SEA, DAL0%Never overcorrects — errors persist

CHI’s 0% overcorrection explains why our warm bias trades on CHI keep winning — when CHI misses hot, it stays hot.

Conclusions

This observation is based on 17 days of settlement data. Patterns should be validated over 60+ days before trading on them. Filed for review.